Encoding issues with literal strings (C++) - Linux

This is a discussion on Encoding issues with literal strings (C++) - Linux ; Hi, I'm a bit puzzled by the following. My application is a client/server, where the server runs on Linux and is written in C++. The client runs on Windows and is written in Borland C++ Builder 6. Since it is ...

+ Reply to Thread
Results 1 to 9 of 9

Thread: Encoding issues with literal strings (C++)

  1. Encoding issues with literal strings (C++)


    Hi,

    I'm a bit puzzled by the following.

    My application is a client/server, where the server runs on Linux
    and is written in C++. The client runs on Windows and is written
    in Borland C++ Builder 6.

    Since it is in Spanish (most of the users are hispanophones), I
    have many messages that the server sends that include characters
    with accent (in HTML, á é , etc.).

    Some of these messages come from literal strings, with embedded
    \x sequences to represent the special characters in ISO-8859-1
    (or rather, Windows-1252).

    For instance, in LATIN1 (ISO-8859-1) and in Windows-1252 encodings,
    the a with acute accent has the code 0xE1; the o with acute accent
    has code 0xF3 ... So I write those just like that (well, \xE1 and
    \xF3 in the literal strings), and it works.

    But I have two puzzling problems:

    1) When I write the i with acute accent (which has code 0xED), that
    one doesn't work (shows up as a greek letter beta on the client,
    and the letter after that one doesn't show).

    When I do a hexdump -C of the executable, I see that the string
    is not the same!!! The \xED character has been replaced by a
    0xDF, and the character after the \xED is missing !!! Here:

    The literal string is: " ..... espec\xEDficos ..... "

    The hexdump output (the relevant line) is:

    65 73 20 65 73 70 65 63 df 69 63 6f 73 2e 20 20 |es espec.icos. |

    Why did that happen? How do I avoid it? --- without having to
    manually edit the executable, that is). I have the feeling that
    it has to do with UTF-8 encoding, perhaps invalid UTF-8 sequences
    that the compiler is "fixing" --- but, if that is the case, why?


    2) The other thing is that I'm getting a compiler warning of hex
    escape sequence out of range for the \xF3 --- yet that character
    shows up ok (the o with acute accent).


    Thanks for any ideas!

    Carlos
    --

  2. Re: Encoding issues with literal strings (C++)

    Carlos Moreno wrote:
    >
    > Hi,
    >
    > I'm a bit puzzled by the following.
    >
    > My application is a client/server, where the server runs on Linux
    > and is written in C++. The client runs on Windows and is written
    > in Borland C++ Builder 6.
    >
    > Since it is in Spanish (most of the users are hispanophones), I
    > have many messages that the server sends that include characters
    > with accent (in HTML, á é , etc.).
    >
    > Some of these messages come from literal strings, with embedded
    > \x sequences to represent the special characters in ISO-8859-1
    > (or rather, Windows-1252).
    >
    > For instance, in LATIN1 (ISO-8859-1) and in Windows-1252 encodings,
    > the a with acute accent has the code 0xE1; the o with acute accent
    > has code 0xF3 ... So I write those just like that (well, \xE1 and
    > \xF3 in the literal strings), and it works.
    >
    > But I have two puzzling problems:
    >
    > 1) When I write the i with acute accent (which has code 0xED), that
    > one doesn't work (shows up as a greek letter beta on the client,
    > and the letter after that one doesn't show).
    >
    > When I do a hexdump -C of the executable, I see that the string
    > is not the same!!! The \xED character has been replaced by a
    > 0xDF, and the character after the \xED is missing !!! Here:
    >
    > The literal string is: " ..... espec\xEDficos ..... "
    >
    > The hexdump output (the relevant line) is:
    >
    > 65 73 20 65 73 70 65 63 df 69 63 6f 73 2e 20 20 |es espec.icos. |
    >
    > Why did that happen? How do I avoid it? --- without having to
    > manually edit the executable, that is). I have the feeling that
    > it has to do with UTF-8 encoding, perhaps invalid UTF-8 sequences
    > that the compiler is "fixing" --- but, if that is the case, why?
    >
    >
    > 2) The other thing is that I'm getting a compiler warning of hex
    > escape sequence out of range for the \xF3 --- yet that character
    > shows up ok (the o with acute accent).
    >


    They're both the same problem. I'm not sure if this is a bug or not, but
    gcc is taking more than two digits to make a string literal. In your
    example:
    "espec\xEDficos"

    Here gcc is taking the literal as 0xedf, which is out of range. The
    modulo value of 0xdf is what shows up in your output. I confirmed this
    behavior in gcc 3.4.4.

    Again, I always thought C only uses two digits for \x escapes, so this
    smells like non-conformance to me. However, you can work around it by
    terminating the sequence with whitespace, or you can make it two strings
    as follows:
    "espec\xED""ficos"

    This is valid C syntax. The compiler will concatenate these two strings
    and produce the correct characters.

    Cheers,
    John

  3. Re: Encoding issues with literal strings (C++)

    John Fusco writes:
    > They're both the same problem. I'm not sure if this is a bug or not,
    > but gcc is taking more than two digits to make a string literal. In
    > your example:
    > "espec\xEDficos"
    >
    > Here gcc is taking the literal as 0xedf, which is out of range. The
    > modulo value of 0xdf is what shows up in your output. I confirmed this
    > behavior in gcc 3.4.4.
    >
    > Again, I always thought C only uses two digits for \x escapes, so this
    > smells like non-conformance to me. However, you can work around it by
    > terminating the sequence with whitespace, or you can make it two
    > strings as follows:
    > "espec\xED""ficos"
    >
    > This is valid C syntax. The compiler will concatenate these two
    > strings and produce the correct characters.


    What I would do, is to keep my sources encoded in utf-8, and just be
    sure to output the HTML with the right "Content-type:...;charset..."
    and META tag.


    --
    __Pascal Bourguignon__ http://www.informatimago.com/
    Our enemies are innovative and resourceful, and so are we. They never
    stop thinking about new ways to harm our country and our people, and
    neither do we. -- Georges W. Bush

  4. Re: Encoding issues with literal strings (C++)

    John Fusco wrote:

    [...]
    > gcc is taking more than two digits to make a string literal. In your
    > example:
    > "espec\xEDficos"
    >
    > Here gcc is taking the literal as 0xedf, which is out of range. The
    > modulo value of 0xdf is what shows up in your output. [...]



    Thank you SO MUCH for noticing and pointing it out!! It wouldn't
    have occured to me in a million years!!! (well, ok, I'm exaggerating,
    but still, thanks so much!!!!)


    > Again, I always thought C only uses two digits for \x escapes, so this
    > smells like non-conformance to me.


    With much horror, I have to confirm that gcc/g++ is correct, as per
    the C++ ISO/IEC standard (the 1998 one --- dunno if it's going to
    be changed in the next revision, or if it has been already, in the
    current draft):

    From 2.13.2:

    "The escape \ooo consists of the backslash followed by one, two, or
    three octal digits [...]. The escape \xhhh consists of the backslash
    followed by x followed by one or more hexadecimal digits that are
    taken to specify the value of the desired character. There is no
    limit to the number of digits in a hexadecimal sequence. A sequence
    of octal or hexadecimal digits is terminated by the first character
    that is not an octal digit or a hexadecimal digit, respectively."


    I'm completely speechless !!!

    > However, you can work around it by
    > terminating the sequence with whitespace, or you can make it two strings
    > as follows:
    > "espec\xED""ficos"


    This one would work --- terminating with whitespace is not an option,
    since the string is what it is; I can not choose to put a space or
    newline or tab after the i-acute-accent character; I simply can't:
    the word is "especifico" (with the acute accent in the firrst i).

    But yeah, relying on the automatic concatenation of literal strings
    is definitely an option --- it seems ridiculous that I would have to
    do that; but again, that goes with the "I'm speechless" part, how
    horrifying I find this feature, which IMHO, is more like a gratuitous
    defect of the language (I guess both C and C++ share this "defect").

    BTW, it would be a *really* nice feature for the text editors (e.g.,
    Kwrite in KDevelop) that they highlight the three-digit sequence, no?

    Thanks,

    Carlos
    --

  5. Re: Encoding issues with literal strings (C++)

    Pascal Bourguignon wrote:
    > John Fusco writes:
    >
    >>They're both the same problem. I'm not sure if this is a bug or not,
    >>but gcc is taking more than two digits to make a string literal. In
    >>your example:
    >> "espec\xEDficos"
    >>
    >>Here gcc is taking the literal as 0xedf, which is out of range. The
    >>modulo value of 0xdf is what shows up in your output. I confirmed this
    >>behavior in gcc 3.4.4.
    >>
    >>Again, I always thought C only uses two digits for \x escapes, so this
    >>smells like non-conformance to me. However, you can work around it by
    >>terminating the sequence with whitespace, or you can make it two
    >>strings as follows:
    >> "espec\xED""ficos"
    >>
    >>This is valid C syntax. The compiler will concatenate these two
    >>strings and produce the correct characters.

    >
    >
    > What I would do, is to keep my sources encoded in utf-8, and just be
    > sure to output the HTML with the right "Content-type:...;charset..."
    > and META tag.


    Except that who said that the output is HTML?

    If it was HTML, I'd rather encode it in HTML, and not in UTF-8;
    that is, I would have written that as: específicos (which
    is, BTW, how I write it whenever I need to write an HTML document
    containing Spanish text).

    I was, in fact, considering the possibility of having my application
    decode (at run-time) the literal string containing HTML entities,
    or some other encoding; even URL-encoding, perhaps --- just a %
    instead of a \x .

    Thanks,

    Carlos
    --

  6. Re: Encoding issues with literal strings (C++)

    On 2006-12-20, Carlos Moreno wrote:

    > The literal string is: " ..... espec\xEDficos ..... "
    >
    > The hexdump output (the relevant line) is:
    >
    > 65 73 20 65 73 70 65 63 df 69 63 6f 73 2e 20 20 |es espec.icos. |
    >
    > Why did that happen? How do I avoid it? --- without having to
    > manually edit the executable, that is). I have the feeling that
    > it has to do with UTF-8 encoding, perhaps invalid UTF-8 sequences
    > that the compiler is "fixing" --- but, if that is the case, why?


    The compiler definately should not miss with string literals
    like that.

    > 2) The other thing is that I'm getting a compiler warning of hex
    > escape sequence out of range for the \xF3 --- yet that character
    > shows up ok (the o with acute accent).


    odd...

    Post something we can compile, and the full text of the warning,
    and the full version number of the compiler you are using.

    Bye.
    Jasen

  7. Re: Encoding issues with literal strings (C++)

    jasen wrote:

    > [...]
    > Post something we can compile, and the full text of the warning,
    > and the full version number of the compiler you are using.


    Thanks for your comments --- however, mabe your newsreader is now
    showing the other replies in this thread, where John Fusco pointed
    out the problem (the f that follows is being taken as part of the
    hex sequence), and later found out that it is standard conforming
    behaviour. (now *that*, I found odd :-))

    Thanks,

    Carlos
    --

  8. Re: Encoding issues with literal strings (C++)

    On 2006-12-22, Carlos Moreno wrote:
    > jasen wrote:
    >
    >> [...]
    >> Post something we can compile, and the full text of the warning,
    >> and the full version number of the compiler you are using.

    >
    > Thanks for your comments --- however, mabe your newsreader is now
    > showing the other replies in this thread, where John Fusco pointed
    > out the problem (the f that follows is being taken as part of the
    > hex sequence), and later found out that it is standard conforming
    > behaviour. (now *that*, I found odd :-))


    yeah as is evident it's a new one to me.

    It makes sense though.... the unicode character set has about 2^24 symbols
    so C needs a way to represent them all...

    It was (but no longer is) my understanding that after the \ only upto
    three characters were interpreted as digits... (3 octal or x+2hex)


    --

    Bye.
    Jasen

  9. Re: Encoding issues with literal strings (C++)

    jasen wrote:

    >> [...]
    >>Thanks for your comments --- however, mabe your newsreader is now
    >>showing the other replies in this thread, where John Fusco pointed
    >>out the problem (the f that follows is being taken as part of the
    >>hex sequence), and later found out that it is standard conforming
    >>behaviour. (now *that*, I found odd :-))

    >
    > yeah as is evident it's a new one to me.
    >
    > It makes sense though.... the unicode character set has about 2^24 symbols
    > so C needs a way to represent them all...
    >
    > It was (but no longer is) my understanding that after the \ only upto
    > three characters were interpreted as digits... (3 octal or x+2hex)


    Odd, isn't it? (that was exactly my understanding as well)

    In comp.lang.c++.moderated, it was pointed out that C++ has universal
    character codes --- after re-checking my copy of the ISO standard (the
    1998 one), I see that my \xED could (should) have been \u00ED, and it
    wouldn't suffer from the same problem, as the \u is followed by
    *exactly* four hexadecimal digits (and there is also the \U which is
    followed by exactly eight digits.

    Thanks,

    Carlos
    --

+ Reply to Thread