Re: [9fans] awk, not utf aware... - Plan9

This is a discussion on Re: [9fans] awk, not utf aware... - Plan9 ; i had to dig this off 9fans.net/archive. htmlfmt does some very bad things with non-ascii characters. i hope i put them back correctly. > Yes, and then there is locale: does [a-z] include ij when you run it > in ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: Re: [9fans] awk, not utf aware...

  1. Re: [9fans] awk, not utf aware...

    i had to dig this off 9fans.net/archive. htmlfmt does some very bad things
    with non-ascii characters. i hope i put them back correctly.

    > Yes, and then there is locale: does [a-z] include ij when you run it
    > in Holland (it should)? Does it include á, è, ô in France (it should)?
    > Does it include ø, å in Norway (it should not)? And what happens when
    > you evaluate "è"< "o" (it depends)?
    >
    > Fixing awk is much harder than anyone things. I had a chat about it with
    > Brian Kernighan and he says he's been thinking about fixing awk for a
    > long time, but that it really is a hard problem.


    how does a program know where it's being run? ☺ how do you write a
    program that processes byte streams from a dutch user and from a
    norwegian? how does one deal with a multi-language file.

    i see some problems with localized regexps. like pre-utf character
    sets, it's impossible to tell from a byte stream what the character
    set is. two users can run the same program and get different results.
    (how do you test in an environment like this?) and, of course, you
    can't switch locale within a file making multi-language files
    difficult.

    perhaps it would be more effective to break down the concept
    a bit. instead of a general locale hammer, why not expose some
    operations that could go into a locale? for example, have a base-
    character folding switch that allows regexps to fold codpoints into
    base codepoints so that *ïìîi -> i. this information is in the unicode
    tables. perhaps the language-dependent character mapping should
    be specified explictly. &c.

    - erik


  2. Re: [9fans] awk, not utf aware...

    On Thu, Feb 28, 2008 at 6:10 AM, erik quanstrom wrote:
    > perhaps it would be more effective to break down the concept
    > a bit. instead of a general locale hammer, why not expose some
    > operations that could go into a locale? for example, have a base-
    > character folding switch that allows regexps to fold codpoints into
    > base codepoints so that i -> i. this information is in the unicode
    > tables. perhaps the language-dependent character mapping should
    > be specified explictly. &c.


    Loosely-related tangent:

    http://www.mail-archive.com/rsync@li.../msg20395.html

    > On the LINUX machines running utf-8 the is coded as $C3A4 which is
    > in utf-8 equal to the character E4. The occupies in that way 2 bytes.
    >
    > I was very astonished, when I copied a mac-filename, pasted into a
    > texteditor and looked at the file:
    >
    > In the mac-filename the letter is coded as: $61CC88, which in utf-8
    > means the letter "a" followed by a $0308. (Combining diacritical marks)
    > So the Mac combines the letter a with the two points above it instead
    > using the E4 letter
    > Now the things are clear: The filenames are different, in spite of
    > looking equally.


    So, if folding codepoints is a reasonable tactic, how many
    representations do you need to fold? How many binary representations
    are needed to fold i -> i?

    -Jack

  3. Re: [9fans] awk, not utf aware...

    > > On the LINUX machines running utf-8 the ä is coded as $C3A4 which is
    > > in utf-8 equal to the character E4. The ä occupies in that way 2 bytes.
    > >
    > > I was very astonished, when I copied a mac-filename, pasted into a
    > > texteditor and looked at the file:
    > >
    > > In the mac-filename the letter ä is coded as: $61CC88, which in utf-8
    > > means the letter "a" followed by a $0308. (Combining diacritical marks)
    > > So the Mac combines the letter a with the two points above it instead
    > > using the E4 letter
    > > Now the things are clear: The filenames are different, in spite of
    > > looking equally.

    >
    > So, if folding codepoints is a reasonable tactic, how many
    > representations do you need to fold? How many binary representations
    > are needed to fold *ïìîi -> i?


    i didn't make my point very well. in this case i was suggesting a -f flag
    for grep that would map a codepoints into their base codepoint. the match
    result would be the original text --- in the manner of the -i flag.

    seperately, however ...

    utf combining characters are a really unfortunate choice, imho. there
    is no limit to the number of combining codepoints one can add to
    a base codepoint. you can, for example build a single letter like this
    U+0061 U+0302 ... U+0302
    i don't think it's possible to build legible glyphs from bitmaps using
    combining diacriticals.

    therefore, i would argue for reducing letters made up of base+combiners
    to a precombined codepoint whenever possible. it would be helpful
    if tcs did this. infortunately some transliterations of russian into the roman
    alphabet use characters with no precombined form in unicode.

    rob probablly has a more informed opinion on this than i.

    - erik

+ Reply to Thread