[9fans] awk, not utf aware... - Plan9

This is a discussion on [9fans] awk, not utf aware... - Plan9 ; I think this has come up before, but I didn't found reply. If I do in awk something like: split($0, c, ""); c should be an array of Runes internally, UTF externally, but apparently, it is not. Is it just ...

+ Reply to Thread
Results 1 to 14 of 14

Thread: [9fans] awk, not utf aware...

  1. [9fans] awk, not utf aware...

    I think this has come up before, but I didn't found reply.
    If I do in awk something like:

    split($0, c, "");

    c should be an array of Runes internally, UTF externally, but apparently,
    it is not. Is it just broken?, is there a replacement?, is it just the
    builtins or
    is the whole awk broken?.

    Example, freqpair

    ------
    #!/bin/awk -f

    {
    n = split($0, c , "");
    for(i=1; i pair=c[i] c[i+1]
    f[pair]++;
    }
    }
    END{
    for(h in f)
    printf("%d %s\n", f[h], h);
    }

    ------

    % echo abcd|freqpair
    1 ab
    1 cd
    1 bc
    % echo a*cd|freqpair
    1 cd
    1 �c
    1 *
    1 a�


    where the ? is a Peter face...

    Thanks.

    --
    - curiosity sKilled the cat

  2. Re: [9fans] awk, not utf aware...

    Awk is one of the few programs in the ditribution that is maintained
    externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
    actually be the only one - I didn't bother to check.) A quick glimpse at
    lex.c suggests that awk scans input one char at a time. In hindsight I'm a
    bit surprised that I haven't got bitten by this, but I probably didn't split
    within multibyte sequences. It's probably not too hard to change awk to read
    runes for the price of creating ``the other one true awk.''

    Martin

    * Gorka Guardiola (paurea@gmail.com) wrote:
    > I think this has come up before, but I didn't found reply.
    > If I do in awk something like:
    >
    > split($0, c, "");
    >
    > c should be an array of Runes internally, UTF externally, but apparently,
    > it is not. Is it just broken?, is there a replacement?, is it just the
    > builtins or
    > is the whole awk broken?.
    >
    > Example, freqpair
    >
    > ------
    > #!/bin/awk -f
    >
    > {
    > n = split($0, c , "");
    > for(i=1; i > pair=c[i] c[i+1]
    > f[pair]++;
    > }
    > }
    > END{
    > for(h in f)
    > printf("%d %s\n", f[h], h);
    > }
    >
    > ------
    >
    > % echo abcd|freqpair
    > 1 ab
    > 1 cd
    > 1 bc
    > % echo a*cd|freqpair
    > 1 cd
    > 1 �c
    > 1 *
    > 1 a�
    >
    >
    > where the ? is a Peter face...
    >
    > Thanks.
    >
    > --
    > - curiosity sKilled the cat


  3. Re: [9fans] awk, not utf aware...

    On Tue, Feb 26, 2008 at 2:16 PM, Martin Neubauer wrote:
    > Awk is one of the few programs in the ditribution that is maintained
    > externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
    > actually be the only one - I didn't bother to check.) A quick glimpse at
    > lex.c suggests that awk scans input one char at a time. In hindsight I'ma
    > bit surprised that I haven't got bitten by this, but I probably didn't split
    > within multibyte sequences. It's probably not too hard to change awk to read
    > runes for the price of creating ``the other one true awk.''
    >


    I don't know if it is as easy. I leave it in my todo list for the future :-).
    Anyway, the BUGS section should say it does not know about UTF.
    Ill send a patch.


    --
    - curiosity sKilled the cat

  4. Re: [9fans] awk, not utf aware...

    > I think this has come up before, but I didn't found reply.
    > If I do in awk something like:
    >
    > split($0, c, "");
    >
    > c should be an array of Runes internally, UTF externally, but apparently,
    > it is not. Is it just broken?, is there a replacement?, is it just the
    > builtins or
    > is the whole awk broken?.


    i think the comments about this problem are missing the point
    a bit. utf8 should be transparent to awk unless the situation demands
    that awk needs to know the length of a character. it's not necessary
    to keep strings as Rune*s internally to work with utf8. splitting on
    "" is a special case where awk does need to know the length of
    a character. e.g. this script should work fine

    ; cat /tmp/smile
    #!/bin/awk -f
    {
    n = split($0, c, "☺");
    for(i = 1; i <= n; i++)
    print c[i]
    }
    ; echo fu☺bar|/tmp/smile
    fu
    bar

    but splitting on "" won't. i attached a patch that fixes this problem
    as an illustration. i'm not using utflen because pcc won't see it.
    it's an ugly patch.

    i don't think i know what a proper fix for awk would be. i wouldn't
    think there are many cases like this, but i haven't spent much time
    with awk internals.

    - erik

    ------

    9diff run.c
    /n/sources/plan9//sys/src/cmd/awk/run.c:1191,1196 - run.c:1191,1219
    return(False);
    }

    + static int
    + utf8len(char *s)
    + {
    + int c, n, i;
    +
    + c = *(unsigned char*)s++;
    + if ((c&0xe0) == 0xc0)
    + n = 2;
    + else if ((c&0xf0) == 0xe0)
    + n = 3;
    + else if ((c&0xf8) == 0xf0)
    + n = 4;
    + else
    + return 1; //-1;
    + i = n-1;
    + if(strlen(s) < i)
    + return 1; // -1;
    + for(; i-- && (c = *(unsigned char*)s++)
    + if(0x80 != (c&0xc0))
    + return 1; //-1;
    + return n;
    + }
    +
    Cell *split(Node **a, int nnn) /* split(a[0], a[1], a[2]); a[3] is type */
    {
    Cell *x = 0, *y, *ap;
    /n/sources/plan9//sys/src/cmd/awk/run.c:1279,1290 - run.c:1302,1316
    s++;
    }
    } else if (sep == 0) { /* new: split(s, a, "") => 1 char/elem */
    - for (n = 0; *s != 0; s++) {
    - char buf[2];
    + int i, len;
    + char buf[5];
    + for (n = 0; *s != 0; s += len) {
    n++;
    sprintf(num, "%d", n);
    - buf[0] = *s;
    - buf[1] = 0;
    + len = utf8len(s);
    + for(i = 0; i < len; i++)
    + buf[i] = s[i];
    + buf[len] = 0;
    if (isdigit(buf[0]))
    setsymtab(num, buf, atof(buf), STR|NUM, (Array *) ap->sval);
    else

  5. Re: [9fans] awk, not utf aware...

    Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
    mbtowc/wctomb functions to deal with UTF. Thus it uses mblen rather
    than utflen or utf8len.


  6. Re: [9fans] awk, not utf aware...

    And it's wonderful that the C standard defines a character literal as
    so:

    char-literal:
    ' characters '
    characters:
    character
    characters character

    (or something like that)

    Question, then: why do we need wchar_t/Rune?

    On Feb 26, 2008, at 4:08 PM, geoff@plan9.bell-labs.com wrote:

    > Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
    > mbtowc/wctomb functions to deal with UTF. Thus it uses mblen rather
    > than utflen or utf8len.
    >



  7. Re: [9fans] awk, not utf aware...

    On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
    > And it's wonderful that the C standard defines a character literal as
    > so:
    >
    > char-literal:
    > ' characters '
    > characters:
    > character
    > characters character
    >
    > (or something like that)
    >
    > Question, then: why do we need wchar_t/Rune?


    The definitions are (<> used to indicate non-terminals in the
    grammar...):

    (6.4.4.4) character-constant:
    ' '
    L' '

    (6.4.4.4) c-char-sequence:



    (6.4.4.4) c-char:
    any member of the source character set except the single-quote ',
    backslash \, or new-line character



    Steven Vormwald
    sdvormwa@mtu.edu



  8. Re: [9fans] awk, not utf aware...

    > And it's wonderful that the C standard defines a character literal as
    > so:
    >
    > char-literal:
    > ' characters '
    > characters:
    > character
    > characters character
    >
    > (or something like that)
    >
    > Question, then: why do we need wchar_t/Rune?
    >


    because we have more tha 255 characters.

    - erik

  9. Re: [9fans] awk, not utf aware...

    Yes. I'm too lazy to pick up my copy of the standard.

    On Feb 26, 2008, at 4:32 PM, Steven Vormwald wrote:

    > On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
    >> And it's wonderful that the C standard defines a character literal as
    >> so:
    >>
    >> char-literal:
    >> ' characters '
    >> characters:
    >> character
    >> characters character
    >>
    >> (or something like that)
    >>
    >> Question, then: why do we need wchar_t/Rune?

    >
    > The definitions are (<> used to indicate non-terminals in the
    > grammar...):
    >
    > (6.4.4.4) character-constant:
    > ' '
    > L' '
    >
    > (6.4.4.4) c-char-sequence:
    >
    >
    >
    > (6.4.4.4) c-char:
    > any member of the source character set except the single-quote ',
    > backslash \, or new-line character
    >
    >
    >
    > Steven Vormwald
    > sdvormwa@mtu.edu
    >
    >



  10. Re: [9fans] awk, not utf aware...

    (which I have sitting next to me)

    On Feb 26, 2008, at 4:40 PM, Pietro Gagliardi wrote:

    > Yes. I'm too lazy to pick up my copy of the standard.
    >
    > On Feb 26, 2008, at 4:32 PM, Steven Vormwald wrote:
    >
    >> On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
    >>> And it's wonderful that the C standard defines a character
    >>> literal as
    >>> so:
    >>>
    >>> char-literal:
    >>> ' characters '
    >>> characters:
    >>> character
    >>> characters character
    >>>
    >>> (or something like that)
    >>>
    >>> Question, then: why do we need wchar_t/Rune?

    >>
    >> The definitions are (<> used to indicate non-terminals in the
    >> grammar...):
    >>
    >> (6.4.4.4) character-constant:
    >> ' '
    >> L' '
    >>
    >> (6.4.4.4) c-char-sequence:
    >>
    >>
    >>
    >> (6.4.4.4) c-char:
    >> any member of the source character set except the single-quote ',
    >> backslash \, or new-line character
    >>
    >>
    >>
    >> Steven Vormwald
    >> sdvormwa@mtu.edu
    >>
    >>

    >



  11. Re: [9fans] awk, not utf aware...

    thanks for catching that.

    my brain's not on today. generally i avoid the mb functions because they
    rely on locale. of course this doesn't apply on plan 9 and so there's no reason
    for utf8len.

    it looks like mblen is used elsewhere; perhaps this would now be a worthwhile
    patch.

    - erik

    > Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
    > mbtowc/wctomb functions to deal with UTF. Thus it uses mblen rather
    > than utflen or utf8len.
    >


  12. Re: [9fans] awk, not utf aware...

    On Tue, 2008-02-26 at 16:40 -0500, Pietro Gagliardi wrote:
    > Yes. I'm too lazy to pick up my copy of the standard.


    I just happened to be reading through Annex A (the grammar) at the time,
    so I thought I'd send it out.

    Steven Vormwald
    sdvormwa@mtu.edu

  13. Re: [9fans] awk, not utf aware...

    On Tue, Feb 26, 2008 at 4:21 PM, Pietro Gagliardi wrote:
    > And it's wonderful that the C standard defines a character literal as
    > so:


    But it leaves the meaning of a literal like 'abcd' up to the compiler.
    I did something very perverse -- but 'legal' -- in the compiler I
    started writing for class...

    Also recall that sizeof('c') == sizeof(int). I suspect, though, that
    literals like 'abcd' are left from the B (word-addressable, not
    byte-addressable) days.

    A quick check of /sys/src/cmd/cc/lex.c shows that kenc disallows such horrors.

    --Joel

  14. Re: [9fans] awk, not utf aware...

    "Joel C. Salomon" wrote:
    > Also recall that sizeof('c') == sizeof(int). I suspect, though, that
    > literals like 'abcd' are left from the B (word-addressable, not
    > byte-addressable) days.


    Yes, in C ordinary character constants have always had type int.
    Multi-character constants were used in the first C version of "troff",
    for one example, so the language permits them even though their use
    has nonportable aspects.

+ Reply to Thread