A Question on multibyte string comparisons - DICOM

This is a discussion on A Question on multibyte string comparisons - DICOM ; Hi, I had a couple of questions on comparing "multibyte character" strings as may result from encodings like GB18030, ISO-IR-159, ISO-IR-87 and the like... 1. Would the use of the traditional "strcmp" guarantee that the string comparison would return correct ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: A Question on multibyte string comparisons

  1. A Question on multibyte string comparisons

    Hi,
    I had a couple of questions on comparing "multibyte character"
    strings as may result from encodings like GB18030, ISO-IR-159,
    ISO-IR-87 and the like...

    1. Would the use of the traditional "strcmp" guarantee that the string
    comparison would return correct results ? My reasoning was that if a
    given encoding is a minimal length encoding, then I could use strcmp
    to compare two given strings and then if they match, I could check the
    Escape Sequences of the encodings specified by the 8,5 tag to check
    whether the strings compared are really the same or if strcmp
    accidentally matched escape sequence of one encoding with a character
    in the other. Is this a safe assumption to make ?

    2. Second option would be to convert from "traditional" encoding to
    Unicode using libraries like ICU and compare. However, this has the
    overhead that just for query comparison, I would have to convert to
    Unicode and compare and the results would anycase have to be returned
    in Encodings as specified by the DICOM std. I could use some
    suggestions here...

    3. Is there any other way of accomplishing the same ?

    Regards,
    Vikram.

  2. Re: A Question on multibyte string comparisons

    1. Never. strcmp works with ASCII character strings only and it is
    considering the NULL-termination of strings. A multi-byte character
    string will never be interpreted correctly by strxxx functions.
    You should use multi-byte comparison routines - I know _mbsnbcmp on
    windows - or create your own comparison routines (or comparison
    wrappers that call the appropriate functions given the string
    encoding).

    But afterall, strings are just bytes in memory. The best thing you can
    do is to use, if you have such character-set problems, memcmp to
    perform all your string comparisons.

    In dicom you usually know the length of the strings you're dealing with
    so memcmp (which is an ANSII function) should be the choice.
    The above should adress 2 and 3 as well.
    ~~Razvan


  3. Re: A Question on multibyte string comparisons

    Hi,
    Thanks for the information on the use of memcmp as against strcmp for
    comparison of multi-byte char strings.

    1. I find that ISO-IR-87 is a 2 byte encoding i.e. all characters will
    be represented by 2 bytes as against one. In such a case memcmp per se
    would not suffice and would need a modified version of the same to
    compare characters considering 2 bytes per char. Is my understanding
    correct ?

    2. Secondly, what about wild character matching ? Eg: A query comes in
    for patient name as John*. This could match any number of records since
    even records with 8,5 tag set as GB18030 etc can have ASCII. Also, in
    the incoming query if the 8,5 tag is set to some encoding, then would
    in a comparison like "x?z", how would I know how many bytes to skip for
    "?". In ASCII this is one byte, but maybe in that particular encoding
    "?" or "*" could be multibyte.

    Regards,
    Vikram.


  4. Re: A Question on multibyte string comparisons

    Vix wrote:
    > Hi,
    > Thanks for the information on the use of memcmp as against strcmp for
    > comparison of multi-byte char strings.
    >
    > 1. I find that ISO-IR-87 is a 2 byte encoding i.e. all characters

    will
    > be represented by 2 bytes as against one. In such a case memcmp per

    se
    > would not suffice and would need a modified version of the same to
    > compare characters considering 2 bytes per char. Is my understanding
    > correct ?


    Why wouldn't memcmp suffice? you pass it two pointers and the number of
    bytes to be checked. It does not care about the particular character
    encoding.

    What you have to determine is the number of bytes per character, as
    used by a particular encoding, prior to the memcmp call, and you'd be
    set. And since you do know what the 0008,0005 tag said about that while
    you parse your IOD, you shouldn't be in too much trouble.

    Of course, all this string comparison talk refers to character strings
    with the same character encoding. Obviously, you can't compare two
    strings with different character encodings.


    > 2. Secondly, what about wild character matching ? Eg: A query comes

    in
    > for patient name as John*. This could match any number of records

    since
    > even records with 8,5 tag set as GB18030 etc can have ASCII. Also, in
    > the incoming query if the 8,5 tag is set to some encoding, then would
    > in a comparison like "x?z", how would I know how many bytes to skip

    for
    > "?". In ASCII this is one byte, but maybe in that particular encoding
    > "?" or "*" could be multibyte.


    You do know the character encoding by examinig the C.E. tag and you
    would know the number of bytes to skip. But this is string matching,
    not string comparison and again, beware of comparing/matching strings
    with different character encodings.
    In such a case, you would _have_ to make a conversion of strings and
    then do the matching, as you suggested in your original post.

    However, if you'd had 2 strings with different encodings, you probably
    would want to convert from the less-bytes-per-char encoding to the
    more-bytes-per-char encoding and make only one coversion + one
    comparison as opposed to converting both strings to the
    common-smalles-multiple nr. of bytes per char.


    These are sticky operations and you want to be careful with them
    though. You also have to take into account the trasnfer syntax used -
    the order of bytes in the IOD or you'd get a major headache.

    Think about this:
    Compare "John" ASCII with "\0J\0o\0h\0n" in Unicode, Big Endian (MACs
    use that).

    Compare "John" ASCII with "J\0o\0h\0n\0" in Unicode, Little Endian.

    In this case, you want to convert from ASCII to Unicode, then take into
    account the endianess of the IOD and only after that make the memcmp.
    HTH,
    ~~Razvan


+ Reply to Thread