This is a discussion on Re: [9fans] awk, not utf aware... - Plan9 ; > There is split and other functions, > for example: > > toupper("aĆ*") > gives > AĆ* > > My guess is that there are many more little (or not) corners where it > doesn't work. > We can go ...
> There is split and other functions,
> for example:
> My guess is that there are many more little (or not) corners where it
> doesn't work.
> We can go on and on looking for crevices and hiding the bugs further
> under the rug
> so that they are not evident and find everyone completely unaware,
> leave awk as it is now or really fix the problem. The first approach
> doesn't work. I am going to take
> the second till I have time to take the third which means use runes or
> at least revise all the
> code so that it is uniformly aware of the existance of non-ascii characters.
i don't understand this approach. you propose redoing a fundamental
part of awk. yet at the end you won't have solved the bug that's bothering
ignoring the fact that awk is an ape program and doesn't use runes, the
problem with toupper is independent of the internal representation
of strings. as far as i can tell, ape doesn't even have towupper and towlower.
so if you provide those functions, fixing toupper and tolower could be
a 5 minute fix. and you know you won't have broken anything else.
/sys/doc/utf.ps is worth a read. it's not to hard to think of situations
that depend on character boundaries or operate on non-ascii characters.
generally there are few. for example, rc only bothers with character
boundaries in matching. perhaps you could build a utf testsuite for awk.
make sure to use non-latin1 languages, too.