Re: [9fans] awk, not utf aware... - Plan9

This is a discussion on Re: [9fans] awk, not utf aware... - Plan9 ; > There is split and other functions, > for example: > > toupper("a*") > gives > A* > > My guess is that there are many more little (or not) corners where it > doesn't work. Yes, and then there ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: Re: [9fans] awk, not utf aware...

  1. Re: [9fans] awk, not utf aware...

    > There is split and other functions,
    > for example:
    >
    > toupper("a*")
    > gives
    > A*
    >
    > My guess is that there are many more little (or not) corners where it
    > doesn't work.


    Yes, and then there is locale: does [a-z] include ij when you run it
    in Holland (it should)? Does it include á, è, ô in France (it should)?
    Does it include ø, å in Norway (it should not)? And what happens when
    you evaluate "è" < "o" (it depends)?

    Fixing awk is much harder than anyone things. I had a chat about it with
    Brian Kernighan and he says he's been thinking about fixing awk for a
    long time, but that it really is a hard problem.

    Sape


  2. [9fans] localization, unicode, regexps (was: awk, not utf aware...)

    > erik | Sape * uriel

    I have been pondering character sets rather alot recently (mostly wishful
    thinking, by my estimation), so this conversation set me thinking more...

    > how does one deal with a multi-language file.

    By not dealing in languages? Unicode (however flawed) solves multi-script
    files, why mire ourselves in mutable (scripts are plenty) language rules?

    > for example, have a base-character folding switch that allows regexps
    > to fold codpoints into base codepoints so that *ïìîi -> i.

    I would favor decomposing codepoints (*→í, ï→ï, ì→ì, î→î) with the switch
    to ignore combining characters, that has the disadvantage of lengthening,
    by a byte or rune a time, your text, but does allow you to match accents.

    | Yes, and then there is locale: does [a-z] include ij when you run it
    | in Holland (it should)? Does it include á, è, ô in France (it should)?
    | Does it include ø, å in Norway (it should not)? And what happens when
    | you evaluate "è"< "o" (it depends)?
    Does spanish [a-c] match the c in ch (depends on when and where you ask)?
    More Unicode-centric, does 'a' match (the first byte of) 'à' (U0061+0300)
    (or all three bytes, or not at all)?

    I would write [a-z] in a regexp upon two occations, a letter of the latin
    alphabet (better served by something like [[:latin:]] (so I needent add a
    bunch of other things ([þðæœø]))) or the bytes [61, 7a]. As any sort of a
    public project is stuck with Unicode (not advocating the hysteria before,
    just wishing Unicode left some of it behind), regexps reflecting Unicode,
    not the user's language, makes sense to me. Unicode is at least codified.

    * I think the plan9 tools demonstrate that it is not so hard to find a
    * 'good enough' solution; and the lunix locale debacle demonstrate that
    * if you want to get it 'right' you will end up with a nightmare.
    Yet some things that are good enough (I'll pick on Unicode) for one idea,
    lumping character sets together does a fine job to write multiple scripts
    in the same file, spawns nightmares, * = ǭ = ǭ = ǭ = ǭ, good enough being
    ill-thought-out. Yet mayhap you mean well-compromised (that seems right).

    To those who were at IWP9 this year: Cast your mind back to a question of
    plan9 people with vested intrest for RtL rendering and the like. I should
    have stood up then and cried out, I! Imagine either I did so or I do now.

    If anyone has interest in playing on this at a character set level, tell?

    enjoy,
    tristan

    --
    All original matter is hereby placed immediately under the public domain.


+ Reply to Thread