A Question on multibyte string comparisons - DICOM
This is a discussion on A Question on multibyte string comparisons - DICOM ; Hi,
I had a couple of questions on comparing "multibyte character"
strings as may result from encodings like GB18030, ISO-IR-159,
ISO-IR-87 and the like...
1. Would the use of the traditional "strcmp" guarantee that the string
comparison would return correct ...
-
A Question on multibyte string comparisons
Hi,
I had a couple of questions on comparing "multibyte character"
strings as may result from encodings like GB18030, ISO-IR-159,
ISO-IR-87 and the like...
1. Would the use of the traditional "strcmp" guarantee that the string
comparison would return correct results ? My reasoning was that if a
given encoding is a minimal length encoding, then I could use strcmp
to compare two given strings and then if they match, I could check the
Escape Sequences of the encodings specified by the 8,5 tag to check
whether the strings compared are really the same or if strcmp
accidentally matched escape sequence of one encoding with a character
in the other. Is this a safe assumption to make ?
2. Second option would be to convert from "traditional" encoding to
Unicode using libraries like ICU and compare. However, this has the
overhead that just for query comparison, I would have to convert to
Unicode and compare and the results would anycase have to be returned
in Encodings as specified by the DICOM std. I could use some
suggestions here...
3. Is there any other way of accomplishing the same ?
Regards,
Vikram.
-
Re: A Question on multibyte string comparisons
1. Never. strcmp works with ASCII character strings only and it is
considering the NULL-termination of strings. A multi-byte character
string will never be interpreted correctly by strxxx functions.
You should use multi-byte comparison routines - I know _mbsnbcmp on
windows - or create your own comparison routines (or comparison
wrappers that call the appropriate functions given the string
encoding).
But afterall, strings are just bytes in memory. The best thing you can
do is to use, if you have such character-set problems, memcmp to
perform all your string comparisons.
In dicom you usually know the length of the strings you're dealing with
so memcmp (which is an ANSII function) should be the choice.
The above should adress 2 and 3 as well.
~~Razvan
-
Re: A Question on multibyte string comparisons
Hi,
Thanks for the information on the use of memcmp as against strcmp for
comparison of multi-byte char strings.
1. I find that ISO-IR-87 is a 2 byte encoding i.e. all characters will
be represented by 2 bytes as against one. In such a case memcmp per se
would not suffice and would need a modified version of the same to
compare characters considering 2 bytes per char. Is my understanding
correct ?
2. Secondly, what about wild character matching ? Eg: A query comes in
for patient name as John*. This could match any number of records since
even records with 8,5 tag set as GB18030 etc can have ASCII. Also, in
the incoming query if the 8,5 tag is set to some encoding, then would
in a comparison like "x?z", how would I know how many bytes to skip for
"?". In ASCII this is one byte, but maybe in that particular encoding
"?" or "*" could be multibyte.
Regards,
Vikram.
-
Re: A Question on multibyte string comparisons
Vix wrote:
> Hi,
> Thanks for the information on the use of memcmp as against strcmp for
> comparison of multi-byte char strings.
>
> 1. I find that ISO-IR-87 is a 2 byte encoding i.e. all characters
will
> be represented by 2 bytes as against one. In such a case memcmp per
se
> would not suffice and would need a modified version of the same to
> compare characters considering 2 bytes per char. Is my understanding
> correct ?
Why wouldn't memcmp suffice? you pass it two pointers and the number of
bytes to be checked. It does not care about the particular character
encoding.
What you have to determine is the number of bytes per character, as
used by a particular encoding, prior to the memcmp call, and you'd be
set. And since you do know what the 0008,0005 tag said about that while
you parse your IOD, you shouldn't be in too much trouble.
Of course, all this string comparison talk refers to character strings
with the same character encoding. Obviously, you can't compare two
strings with different character encodings.
> 2. Secondly, what about wild character matching ? Eg: A query comes
in
> for patient name as John*. This could match any number of records
since
> even records with 8,5 tag set as GB18030 etc can have ASCII. Also, in
> the incoming query if the 8,5 tag is set to some encoding, then would
> in a comparison like "x?z", how would I know how many bytes to skip
for
> "?". In ASCII this is one byte, but maybe in that particular encoding
> "?" or "*" could be multibyte.
You do know the character encoding by examinig the C.E. tag and you
would know the number of bytes to skip. But this is string matching,
not string comparison and again, beware of comparing/matching strings
with different character encodings.
In such a case, you would _have_ to make a conversion of strings and
then do the matching, as you suggested in your original post.
However, if you'd had 2 strings with different encodings, you probably
would want to convert from the less-bytes-per-char encoding to the
more-bytes-per-char encoding and make only one coversion + one
comparison as opposed to converting both strings to the
common-smalles-multiple nr. of bytes per char.
These are sticky operations and you want to be careful with them
though. You also have to take into account the trasnfer syntax used -
the order of bytes in the IOD or you'd get a major headache.
Think about this:
Compare "John" ASCII with "\0J\0o\0h\0n" in Unicode, Big Endian (MACs
use that).
Compare "John" ASCII with "J\0o\0h\0n\0" in Unicode, Little Endian.
In this case, you want to convert from ASCII to Unicode, then take into
account the endianess of the IOD and only after that make the memcmp.
HTH,
~~Razvan