Re: [9fans] awk, not utf aware... - Plan9

This is a discussion on Re: [9fans] awk, not utf aware... - Plan9 ; > Date: Wed, 27 Feb 2008 21:01:33 +0100 > From: Uriel > Subject: Re: [9fans] awk, not utf aware... > To: Fans of the OS Plan 9 from Bell Labs > > None of those issues are specific to AWK, ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: Re: [9fans] awk, not utf aware...

  1. Re: [9fans] awk, not utf aware...

    > Date: Wed, 27 Feb 2008 21:01:33 +0100
    > From: Uriel
    > Subject: Re: [9fans] awk, not utf aware...
    > To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
    >
    > None of those issues are specific to AWK, they apply just as well to
    > sed(1) or any program dealing with regexps. I think the plan9 tools
    > demonstrate that it is not so hard to find a 'good enough' solution;
    > and the lunix locale debacle demonstrate that if you want to get it
    > 'right' you will end up with a nightmare.


    Plan 9 had the luxury of starting over with Unicode from the ground
    up. Many of the C mb* interfaces predate Unicode, as do many of the
    character encodings in use in different parts of the world. Unix vendors
    (and standards bodies) have the very real problems of trying to make
    their software work, and continue to work for the forseeable future,
    in different countries, encodings, etc.

    I am not saying that the POSIX locale stuff is wonderful, elegant,
    clean, etc. It has real problems, and for the most recent gawk
    release, gawk no longer uses the locale's decimal point for numeric
    output by default.

    But one has to give the standards groups and Unix vendors credit for
    trying to grapple with a real problem instead of side stepping it and
    then crowing about it.

    > The problem with awk is that it is not a native plan9 app, and it
    > simian nature shows in too many places. For example system() and | are
    > badly broken:
    >
    > % echo |awk '{print |"echo $KSH_VERSION"}'
    > @(#)PD KSH v5.2.14 99/07/13.2


    Why is this broken? If the shell that awk is running is PDKSH, or
    KSH_VERSION exists in the environment, this is to be expected.

    For awk specifically, off the top of my head, the functions that have to
    be character-set aware are: index, substr, length, tolower, toupper, and
    match. Gawk has been multibyte aware for several years, although there
    were some bugs initially. And someone recently pointed out another one:

    str = sprintf("%.5s", otherstr)

    has to work in terms of characters, not bytes, which I overlooked
    and still have to fix.

    > Boyd made a native port of awk that fixed most (all?) of this issues,
    > it can be found somewhere in his contrib dir but I don't think is
    > production-ready.


    I remember talking to him about this some, since for a long while the Plan
    9 awk was one that was forked from BWK's circa 1993 and needed updating.

    > On Wed, Feb 27, 2008 at 4:54 PM, Sape Mullender
    > wrote:
    > > > There is split and other functions,
    > > > for example:
    > > >
    > > > toupper("a*")
    > > > gives
    > > > A*
    > > >
    > > > My guess is that there are many more little (or not) corners where it
    > > > doesn't work.

    > >
    > > Yes, and then there is locale: does [a-z] include ij when you run it
    > > in Holland (it should)? Does it include á, è, ô in France (it should)?
    > > Does it include ø, å in Norway (it should not)? And what happens when
    > > you evaluate "è" < "o" (it depends)?
    > >
    > > Fixing awk is much harder than anyone things. I had a chat about it with
    > > Brian Kernighan and he says he's been thinking about fixing awk for a
    > > long time, but that it really is a hard problem.


    Indeed. I bit the bullet; Brian hasn't been willing to suffer the complaints,
    and I don't blame him. :-) You can see some of his travails by looking
    at the CHANGES file in his distribution, available from his Bell Labs
    and Princeton web pages.

    As far as I know, gawk and the Solaris /usr/xpg4/bin/awk are the only
    awks that are multibyte aware. The Solaris version is derived from the MKS
    one (see the code from opensolaris.org) with multibyte fixes. I can supply
    simple patches to make it compile on Linux if anyone wants. This version
    doesn't handle some dark corners, but has the advantage of being
    very small.

    Arnold

  2. Re: [9fans] awk, not utf aware...

    > > % echo |awk '{print |"echo $KSH_VERSION"}'
    > > @(#)PD KSH v5.2.14 99/07/13.2

    >
    > Why is this broken? If the shell that awk is running is PDKSH, or
    > KSH_VERSION exists in the environment, this is to be expected.


    I thought it was obvious that the output was from a 'standard' Plan 9
    terminal. But given the percentage of people actually using plan9 in
    this list, I guess I should have been much more explicit.

    And the problem is precisely that the environment under which awk run
    commands is completely different from the one awk is run in; in other
    words, awk spreads its 'simian' (ape-ish) nature.

    uriel

  3. Re: [9fans] awk, not utf aware...

    > I thought it was obvious that the output was from a 'standard' Plan 9
    > terminal. But given the percentage of people actually using plan9 in
    > this list, I guess I should have been much more explicit.
    >
    > And the problem is precisely that the environment under which awk run
    > commands is completely different from the one awk is run in; in other
    > words, awk spreads its 'simian' (ape-ish) nature.


    i think that awk is in a no-win situation here. if it used rc, then
    awk scripts from plan 9 would break on unix and vice versa. sam and
    acme have similar issues in p9p's environment. i don't see how either
    using the native shell or using the shell from the original
    environment is wrong a priori. awk picks a lane and sticks too it.
    i'd bet that benefits other ape stuff like lp.

    if you really don't like this situation, perhaps the solution is to
    improve upon awk. a plan 9 scripting language based on sre's --- as
    suggested by rob --- could be really cool.

    - erik


+ Reply to Thread