UTF-8 encoding problem - Linux

This is a discussion on UTF-8 encoding problem - Linux ; Hi All, I am having a GUI which accepts a Unicode string and searches a given set of xml files for that string. Now, i have 2 XML files both of them saved in UTF-8 format, having characters of different ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: UTF-8 encoding problem

  1. UTF-8 encoding problem

    Hi All,

    I am having a GUI which accepts a Unicode string and searches a given
    set of xml files for that string.

    Now, i have 2 XML files both of them saved in UTF-8 format, having
    characters of different language.

    Although both of them are having UTF-8 as BoM, but only first file is
    having UTF-8 defined in XML declration at the top of the XML file as
    well.

    Now, when i search for some different langauge character in that
    directory using a third party GUI for desktop search, it shows that the
    charcter exist in the first file (in which XML declation was also
    there), but not in the second file (having only BoM)

    Initilally i thought that the problem is mainly because of UTF-8 being
    supporting both MultiBye and Unicode, but could not find much on it,
    because both of them had the same contents when opened in Binary mode
    (Except for XML Declaration in 1 of them)
    Please help.

    Regards,
    Shreshth


  2. Re: UTF-8 encoding problem

    shreshth.luthra@gmail.com wrote:
    > Although both of them are having UTF-8 as BoM


    BoM is a marker for Unicode files stored as UTF-16 or USC-2 and not
    UTF-8. UTF-8 streams are read byte-by-byte. UTF-16 is read 2 bytes
    (word) at the time, so BoM is needed for little-endian/big-endian platforms.

    Therefore, BoM is useless to detect UTF-8 files, unless it was written
    by some smart-ass programmer who didn't understand how it works, but put
    BoM in there since he thought it was correct thing to do.
    Unfortunatelly, many M$-based tools do exactly that as newer versions
    for Windows (2000, XP) use UTF-16 internally in their APIs. Windows NT
    uses UCS-2 so it is also needed there.


    Probably not the best reference, but here is some info (pages 25-42):

    http://www.destructor.de/talks/fb2005-charsets.zip

    --
    Milan Babuskov
    http://njam.sourceforge.net
    http://swoes.blogspot.com

+ Reply to Thread