Any portable way get a filename in UTF-8 or to get the FS encoding? - Unix

This is a discussion on Any portable way get a filename in UTF-8 or to get the FS encoding? - Unix ; Hello, I am trying to devise a simple tool in which I read many directory and file names (to compare two directories). I never wrote code that I would port to different systems, but I would not mind doing it ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 34

Thread: Any portable way get a filename in UTF-8 or to get the FS encoding?

  1. Any portable way get a filename in UTF-8 or to get the FS encoding?

    Hello,

    I am trying to devise a simple tool in which I read many directory and
    file names (to compare two directories).

    I never wrote code that I would port to different systems, but I would
    not mind doing it now.

    So I download and read sus v2 and sus v3 to see the
    openddir/readdir/closedir functions, but they only return char[] strings
    for file names and they say nothing about the encoding of the file names.

    A computer system may mount and/or access many kinds of file systems.
    NTFS as I know is an UNICODE file system (Sorry I do not know how ufs or
    extfs are). When mounting FAT systems one can explicitly specify a
    charset for all the file names.

    I have seen _wreaddir function in some implementations, but is there a
    portable way to get a file's name in UTF-8 or to get a file name in the
    underlaying encoding of its file system and to get the encoding ?

    Are POSIX implementations required to convert the file name return by
    readdir to the application's execution character set ?

    Thank you,
    Timothy Madden,
    Romania

  2. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Timothy Madden wrote:
    > Hello,
    >
    > I am trying to devise a simple tool in which I read many directory and
    > file names (to compare two directories).
    >
    > I never wrote code that I would port to different systems, but I would
    > not mind doing it now.
    >
    > So I download and read sus v2 and sus v3 to see the
    > openddir/readdir/closedir functions, but they only return char[] strings
    > for file names and they say nothing about the encoding of the file names.
    >
    > A computer system may mount and/or access many kinds of file systems.
    > NTFS as I know is an UNICODE file system (Sorry I do not know how ufs or
    > extfs are). When mounting FAT systems one can explicitly specify a
    > charset for all the file names.
    >
    > I have seen _wreaddir function in some implementations, but is there a
    > portable way to get a file's name in UTF-8 or to get a file name in the
    > underlaying encoding of its file system and to get the encoding ?
    >
    > Are POSIX implementations required to convert the file name return by
    > readdir to the application's execution character set ?


    A filename is just a NUL terminated string which is completely
    compatible with UTF-8 (and with most other character encodings).

    So if files are created in a UTF-8 locale, the filenames will be encoded
    already in UTF-8. If not, then use iconv (or something like it) to convert.

    Robert

    >
    > Thank you,
    > Timothy Madden,
    > Romania


  3. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    In article <470931e0$0$90263$14726298@news.sunsite.dk>,
    Timothy Madden wrote:
    >Hello,
    >
    >So I download and read sus v2 and sus v3 to see the
    >openddir/readdir/closedir functions, but they only return char[] strings
    >for file names and they say nothing about the encoding of the file names.


    And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
    directory separator) and 0x00 (the string terminator) are special. The rest
    is just bytes.

    Think about what it would mean for filenames to be bound to a specific
    character set. open(), instead of being a plain syscall, would have to do
    character set translation. Ouch! Or you'd have to do translation in the
    kernel. Double ouch!

    readdir() returns the same bytes that were passed to creat(). You wanna know
    what the bytes mean? Ask the guy who named the file.

    >
    >A computer system may mount and/or access many kinds of file systems.
    >NTFS as I know is an UNICODE file system (Sorry I do not know how ufs or
    >extfs are). When mounting FAT systems one can explicitly specify a
    >charset for all the file names.


    When mounting non-unix filesystems, sometimes we emulate the brokenness of
    the creating OS, using ugly hacks like bloating the kernel with character
    translation tables and making open() reject perfectly legitimate filenames
    that contain bytes which would upset the poor, easily confused, non-unix OS.

    When unix is being itself, on its own well-designed filesystems, there's no
    need for such behavior. Latin-1 filenames can sit right next to UTF-8
    filenames and they don't bother each other, because the kernel doesn't care.

    If you perceive some benefit in knowing that all filenames in a directory are
    in a common character set, that can be achieved by agreement between you and
    the other users who put files in that directory. Much better than inserting a
    complicated translation mechanism into the various syscalls that deal with
    filenames.

    --
    Alan Curry
    pacman@world.std.com

  4. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Robert Harris wrote:
    > A filename is just a NUL terminated string which is completely
    > compatible with UTF-8 (and with most other character encodings).
    >
    > So if files are created in a UTF-8 locale, the filenames will be encoded
    > already in UTF-8. If not, then use iconv (or something like it) to convert.
    >


    How would I know if files are created in UTF-8 locale ?

    How would I know if readdir has converted the filename from its encoding
    in the filesystem to the application execution character set or if it
    has converted the file name to UTF-8 or if it has returned the filename
    in its native encoding ?

    Thank you,
    Timothy Madden,
    Romania

  5. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Alan Curry wrote:
    > In article <470931e0$0$90263$14726298@news.sunsite.dk>,
    > Timothy Madden wrote:
    >> Hello,
    >>
    >> So I download and read sus v2 and sus v3 to see the
    >> openddir/readdir/closedir functions, but they only return char[] strings
    >> for file names and they say nothing about the encoding of the file names.

    >
    > And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
    > directory separator) and 0x00 (the string terminator) are special. The rest
    > is just bytes.
    >


    Are you saying that filenames are binary data ?
    Are you sure about that ?
    Can I read that somewhere in SUS or in a man page or anything ?

    And how does the OS convert that data to strings ? I mean any
    application that I ever used, including the OS shell, displays filenames
    as text. How do all applications convert that binary data to text when
    they display file names? They just leave printf to use the encoding from
    the current locale when output ?


    Thank you,
    Timothy Madden,
    Romania

  6. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    In article <4709463A.3030207@gmail.com>,
    Timothy Madden wrote:
    >Alan Curry wrote:
    >> In article <470931e0$0$90263$14726298@news.sunsite.dk>,
    >>
    >> And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
    >> directory separator) and 0x00 (the string terminator) are special. The rest
    >> is just bytes.
    >>

    >
    >Are you saying that filenames are binary data ?


    Everything in computers is binary data.

    >Are you sure about that ?
    >Can I read that somewhere in SUS or in a man page or anything ?
    >
    >And how does the OS convert that data to strings ? I mean any


    Strings are binary data too. 0x2f is the slash character in ASCII, in case
    you didn't realize that the first time I mentioned it. The reason I called it
    0x2f instead of slash was to help make the point: the kernel understands the
    '/' character to be the directory separator, not because it looks like a
    diagonal line from top-right to bottom-left when printed on your terminal,
    but because it's 0x2f. If you wanted to use an exotic character set that was
    not a superset of ASCII, you could. But character 0x2f would still be the
    directory separator, so you couldn't use it in a filename.

    >application that I ever used, including the OS shell, displays filenames
    >as text. How do all applications convert that binary data to text when


    When the byte sequence 0x66 0x6f 0x6f 0x2f 0x62 0x61 0x72 is sent to your
    terminal, it looks like "foo/bar". There's no converting at all!

    When the byte sequence 0xc4 0xbf is sent to my terminal, it looks like a
    capital A with 2 dots on it followed by an upside-down question mark. That's
    how those bytes are rendered in Latin-1. If I was using a UTF-8 terminal, it
    would look like something else ("LATIN CAPITAL LETTER L WITH MIDDLE DOT" if
    I'm interpreting my Unicode correctly).

    If I now run this little program:

    main(){ creat("\xc4\xbf", 0666); }

    I'll have a file whose name is composed of those 2 bytes. If I run "ls" I'll
    see the A with 2 dots and upside down question mark. If you come along with a
    UTF-8 terminal and run "ls" in the same directory, you'll see that funky
    L-dot thing. Which one is correct? Both!

    The bytes being shown are the same. You can look at the contents of the file
    by typing "cat " on your terminal. (It'll probably be
    easier to cut and paste the weird character than figure out how to type it).
    I can likewise "cat" the file by pasting the 2 characters that were displayed
    by my "ls". Does it matter that we're not seeing the same graphical
    representation of the filename?

    If that does matter, the only way to fix it is for us to have an agreement on
    what character set is used for filenames. That agreement could be made by the
    person with the UTF-8 terminal to find the other person and yell "Upgrade
    your terminal and stop making those ugly non-UTF-8 filenames, you jerk!"
    while beating him on the head with a rolled-up newspaper. In the future, he
    can expect increased probability that readdir() will return UTF-8 names. This
    is something the OS does not need to know about.

    >they display file names? They just leave printf to use the encoding from
    >the current locale when output ?


    I don't think printf does any conversions either. It's just a matter of the
    terminal (or graphical text widget) converting a sequence of bytes into a
    sequence of glyphs based on its configured character set. The filesystem
    doesn't know what character set that is. If it's not the same character set
    that was used by the person who named the file in the first place, it won't
    look the same.

    (If you're experimenting, note that "ls" may actually show question marks if
    it thinks your terminal won't recognize a filename as a printable character
    sequence. That's not because of any translation that the OS is doing. It's
    just "ls" trying to be friendly and not mess up your terminal with control
    codes.)

    --
    Alan Curry
    pacman@world.std.com

  7. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Alan Curry wrote:
    > In article <4709463A.3030207@gmail.com>,
    > Timothy Madden wrote:
    >> Alan Curry wrote:
    >>> In article <470931e0$0$90263$14726298@news.sunsite.dk>,
    >>>
    >>> And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
    >>> directory separator) and 0x00 (the string terminator) are special. The rest
    >>> is just bytes.
    >>>

    >> Are you saying that filenames are binary data ?

    >
    > Everything in computers is binary data.
    >
    >> Are you sure about that ?
    >> Can I read that somewhere in SUS or in a man page or anything ?
    >>
    >> And how does the OS convert that data to strings ? I mean any

    [...]
    >> application that I ever used, including the OS shell, displays filenames
    >> as text. How do all applications convert that binary data to text when

    >
    > When the byte sequence 0x66 0x6f 0x6f 0x2f 0x62 0x61 0x72 is sent to your
    > terminal, it looks like "foo/bar". There's no converting at all!
    >
    > When the byte sequence 0xc4 0xbf is sent to my terminal, it looks like a
    > capital A with 2 dots on it followed by an upside-down question mark. That's
    > how those bytes are rendered in Latin-1. If I was using a UTF-8 terminal, it
    > would look like something else ("LATIN CAPITAL LETTER L WITH MIDDLE DOT" if
    > I'm interpreting my Unicode correctly).
    >
    > If I now run this little program:
    >
    > main(){ creat("\xc4\xbf", 0666); }
    >
    > I'll have a file whose name is composed of those 2 bytes. If I run "ls" I'll
    > see the A with 2 dots and upside down question mark. If you come along with a
    > UTF-8 terminal and run "ls" in the same directory, you'll see that funky
    > L-dot thing. Which one is correct? Both!


    I think this is a problem with the POSIX/SUS standard, as long as this
    behavior required is by the standard.

    I find it normal to see the same file name no matter what terminal I
    have (as long as it has the glyphs), no matter what computer I use to
    access the file system, as long as it has the proper software, no matter
    what my current character encoding is on my system. Would you not like
    that ?

    The current encoding on my computer is Latin-2 (for Romanian language),
    still my computer can display text encoded in Latin-1, UTF-8, UTF-16 and
    other encodings. So if I read messages in a newsgroup and I see a
    message written by a person from japan, encoded in UTF-8, I can still
    see the same text the person wrote, even if UTF-8 is different from
    Latin-2. I just need to know the message is encoded in UTF-8, no matter
    what my current encoding is.

    The same thing should happen with filenames. I think filenames are text
    just as much as a word from that message is text. It is just the POSIX
    standard that thinks otherwise, and that should be fixed.

    Only some byte sequences can encode characters in UTF-8. Others are for
    example reserved for future code points in UNICODE. This shows I could
    have POSIX filenames that can not even be sent to that "UTF-8 terminal"
    you were talking about. Would you not like POSIX to fix the situation ?

    Not to mention that various implementations already offer non-standard
    functions for reading and writing file names in UTF-16 (like _wopen and
    _wreaddir), which convert the names from some encoding to UTF-16 (if
    only to extend each char to wchar_t by prepending 0 bits, but I think
    they use something like mbtowc). Anyway this non-standard functions show
    various implementations treat filenames as text, unless you think
    wreaddir for a file named AB returns 66*256 + 65 (multi-byte character
    'AB'), instead of L"AB".

    Thank you,
    Timothy Madden,
    Romania

  8. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    In article <4709820E.80206@gmail.com>,
    Timothy Madden wrote:
    >Alan Curry wrote:
    >>
    >> I'll have a file whose name is composed of those 2 bytes. If I run "ls" I'll
    >> see the A with 2 dots and upside down question mark. If you come along with a
    >> UTF-8 terminal and run "ls" in the same directory, you'll see that funky
    >> L-dot thing. Which one is correct? Both!

    >
    >I think this is a problem with the POSIX/SUS standard, as long as this
    >behavior required is by the standard.


    I can't see where it's required. It's reality though.

    >
    >I find it normal to see the same file name no matter what terminal I
    >have (as long as it has the glyphs), no matter what computer I use to
    >access the file system, as long as it has the proper software, no matter
    >what my current character encoding is on my system. Would you not like
    >that ?


    No, I actually prefer not to use a computer that displays characters I can't
    read.

    >
    >The current encoding on my computer is Latin-2 (for Romanian language),


    To pick a example from Latin-2, you may be happy to be able to create
    filenames containing an "OGONEK" (whatever that is) and see it displayed
    correctly - and there's no reason you shouln't - but if I come across that
    file it'll be more useful to me to have that character displayed as \262.
    Unicode is full of characters I can't identify, can't reproduce from the
    keyboard, and in some cases can't even distinguish from each other.
    Displaying them to me would only be irritating (or dangerous if a pair of
    identical-looking characters confuse me into rm'ing the wrong file).

    >still my computer can display text encoded in Latin-1, UTF-8, UTF-16 and
    > other encodings. So if I read messages in a newsgroup and I see a
    >message written by a person from japan, encoded in UTF-8, I can still
    >see the same text the person wrote, even if UTF-8 is different from
    >Latin-2. I just need to know the message is encoded in UTF-8, no matter
    >what my current encoding is.


    Yes, newsgroup messages have headers, in which you can find the character
    encoding. Filesystems don't.

    >
    >The same thing should happen with filenames. I think filenames are text
    >just as much as a word from that message is text. It is just the POSIX
    >standard that thinks otherwise, and that should be fixed.
    >
    >Only some byte sequences can encode characters in UTF-8. Others are for
    >example reserved for future code points in UNICODE. This shows I could
    >have POSIX filenames that can not even be sent to that "UTF-8 terminal"
    >you were talking about. Would you not like POSIX to fix the situation ?


    Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
    sequences become banned from filenames. open() acquires a new mode of failure
    that it didn't have before. The simple rule of "All these bytes are yours
    except 0x2f. Attempt no landing there" gets replaced with a complicated
    system in which the validity of a byte depends on what came before it.

    The way things are now, I can use whatever character set I like and you can
    use whatever character set you like. You want to impose a single character
    set on everybody. That's not nice.

    You're free to assume that all filenames on unix are encoded in UTF-8. That
    seems to be the consensus that's being built by all the Unicode advocates
    with their rolled-up newspapers. In fact I'm pretty sure that UTF-8 has been
    made the official character set of the Linux ext2fs filesystem.

    I can't find the announcement right now, but the cool thing about that change
    (and it was a change, because Latin-1 was far more likely to be used in the
    early days) is that it involved 0 bytes of new code. It is purely a social
    guideline with no software enforcement. Existing filesystems populated with
    Latin-2 and KOI-8 and SHIFT-JIS filenames didn't suddenly stop working. They
    became "incorrect" in some unimportant theoretical sense, but they still work
    fine because the kernel and libc - and pretty much everything that isn't in
    charge of displaying characters on screen or converting keypresses to
    characters - treats a filename as an opaque sequence of bytes.

    --
    Alan Curry
    pacman@world.std.com

  9. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Alan Curry wrote:
    > In article <4709820E.80206@gmail.com>,
    > Timothy Madden wrote:

    [...]
    >> Only some byte sequences can encode characters in UTF-8. Others are for
    >> example reserved for future code points in UNICODE. This shows I could
    >> have POSIX filenames that can not even be sent to that "UTF-8 terminal"
    >> you were talking about. Would you not like POSIX to fix the situation ?

    >
    > Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
    > sequences become banned from filenames. open() acquires a new mode of failure
    > that it didn't have before. The simple rule of "All these bytes are yours
    > except 0x2f. Attempt no landing there" gets replaced with a complicated
    > system in which the validity of a byte depends on what came before it.
    >
    > The way things are now, I can use whatever character set I like and you can
    > use whatever character set you like. You want to impose a single character
    > set on everybody. That's not nice.


    Everyone is free to use their character set. I just want a way to know
    that character set, so I can see the names the same way as you.

    What is bad about filesystems or filenames having a charset property ?

    Old apps would then be free to ignore it, but new apps would know better.

    New apps might even chose to let the system transcode filenames on the
    fly if they do not want to take all the charset hassle.

    You said it yourself you are pretty sure ext2fs adopted UTF-8 as the
    filenames charset, and they did it without breaking compatibility for
    existing apps. I just want the same thing in the POSIX standard, for
    whatever charset is appropiate on a given implementation.

    Timothy Madden,
    Romania

  10. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

    > I have seen _wreaddir function in some implementations, but is there a
    > portable way to get a file's name in UTF-8 or to get a file name in the
    > underlaying encoding of its file system and to get the encoding ?
    >
    > Are POSIX implementations required to convert the file name return by
    > readdir to the application's execution character set ?


    The encoding used for file names on any given file system is never
    specified in a POSIX system, and a user is free to create file names
    using several different encodings even on the same file system. (I
    actually have such a file system myself, where most file names are
    encoded in UTF-8 but the file names in one directory are encoded in
    ISO-8859-1.)

    A process that wants to interpret the bytes that makes up a file name
    must look at its environment for hints about which encoding the user
    wants those file names to be interpreted as (eg. the LC_* environment
    variables). You can use the mbstowcs() library function to automatically
    convert a string into a wide character string according to the encoding
    specified by the current environment.

    Cheers // Fredrik Roubert

    --
    Dyre Halses gate 10 | +47 73568556 / +47 41266295
    NO-7042 Trondheim | http://www.df.lth.se/~roubert/

  11. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Fredrik Roubert wrote:
    > On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

    [...]
    > A process that wants to interpret the bytes that makes up a file name
    > must look at its environment for hints about which encoding the user
    > wants those file names to be interpreted as (eg. the LC_* environment
    > variables). You can use the mbstowcs() library function to automatically
    > convert a string into a wide character string according to the encoding
    > specified by the current environment.


    How about files from a remote file system ? Than I am out of luck !

    I use to connect through VPN, at work, to my client's LAN. They use
    Latin-1, I use Latin-2.

    How can I tell that programmatically and portably ? My app has to work
    with files from both machines.

    I would like a standard way to get that encoding, and the file system
    should be the first to know about it.

    I guess I will just have to rely on the user passing the encoding for
    files whose names I process on the command line, or else assume the LC_*
    default.

    This is not allways possible, for example when simply browsing the FS
    (like the GUI shell does), you can not ask the user for the encoding of
    files before browsing ...

    I would like POSIX to fix this problem.

    P.S. Fortunately me and my client have only used 7-bit ASCII characters
    in file names until now.

    Thank you,
    Timothy Madden,
    Romania

  12. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Timothy Madden wrote:
    > Fredrik Roubert wrote:
    >> On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

    > [...]
    >> A process that wants to interpret the bytes that makes up a file name
    >> must look at its environment for hints about which encoding the user
    >> wants those file names to be interpreted as (eg. the LC_* environment
    >> variables). You can use the mbstowcs() library function to automatically
    >> convert a string into a wide character string according to the encoding
    >> specified by the current environment.

    >
    > How about files from a remote file system ? Than I am out of luck !
    >
    > I use to connect through VPN, at work, to my client's LAN. They use
    > Latin-1, I use Latin-2.
    >
    > How can I tell that programmatically and portably ? My app has to work
    > with files from both machines.
    >
    > I would like a standard way to get that encoding, and the file system
    > should be the first to know about it.
    >
    > I guess I will just have to rely on the user passing the encoding for
    > files whose names I process on the command line, or else assume the LC_*
    > default.
    >
    > This is not allways possible, for example when simply browsing the FS
    > (like the GUI shell does), you can not ask the user for the encoding of
    > files before browsing ...
    >
    > I would like POSIX to fix this problem.
    >
    > P.S. Fortunately me and my client have only used 7-bit ASCII characters
    > in file names until now.
    >
    > Thank you,
    > Timothy Madden,
    > Romania


    The only portable solution is to use UNICODE everywhere!

    Robert

  13. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Robert Harris wrote:
    > Timothy Madden wrote:
    >> Fredrik Roubert wrote:
    >>> On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

    >> [...]
    >>> A process that wants to interpret the bytes that makes up a file name
    >>> must look at its environment for hints about which encoding the user
    >>> wants those file names to be interpreted as (eg. the LC_* environment
    >>> variables). You can use the mbstowcs() library function to automatically
    >>> convert a string into a wide character string according to the encoding
    >>> specified by the current environment.

    >>
    >> How about files from a remote file system ? Than I am out of luck !
    >>

    [...]
    >
    > The only portable solution is to use UNICODE everywhere!
    >
    > Robert


    Yes, well, my app only reads directories (to compare them), so even if I
    use UNICODE, I still need the encoding the others have used when they
    created the directories and files. I want my tool to work with all the
    files, everywhere. That is what portability is about. Unfortunately
    POSIX only gives me a binary char[] array for the file name.

    Timothy Madden,
    Romania.

  14. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    In article <4709F14D.3010301@gmail.com>,
    Timothy Madden wrote:
    >
    >Everyone is free to use their character set. I just want a way to know
    >that character set, so I can see the names the same way as you.
    >
    >What is bad about filesystems or filenames having a charset property ?


    Don't underestimate the inertia effect. All of the system interfaces that use
    filenames (open/creat, readdir, link, unlink, symlink, mkdir, mknod, etc.)
    have been around for a long time. They were "opaque char *" back when it was
    obvious that all strings were ASCII, and they haven't changed much since.

    Replacing them all with a new set of syscalls that associates a charset tag
    with each name would be a major effort. And who's going to bother, now that
    we're approaching a time when it will be obvious that all strings are UTF-8?

    You keep saying POSIX should "fix" this. Well, that's not how it works.
    Successful standards are the ones that codify existing practice. You need to
    show at least one working implementation of whatever interface you'd like to
    standardize. Otherwise you're just using the standard as a club to beat
    people with, and that doesn't make them eager to implement your idea for you.

    One implementation that could be easily done would be to add a "charset"
    mount option, and make the mount syscall ignore it. Anyone who's interested
    could look at the mount options with getmntent(). Add your own opendir()
    wrapper and you've got something. Of course it's only a per-mount tag, not
    per-directory or per-file, and there's nothing to prevent users from creating
    files with the "wrong" kind of names. But at least the implementation
    overhead is fairly low.

    --
    Alan Curry
    pacman@world.std.com

  15. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Alan Curry wrote:
    > In article <4709F14D.3010301@gmail.com>,
    > Timothy Madden wrote:
    >> Everyone is free to use their character set. I just want a way to know
    >> that character set, so I can see the names the same way as you.
    >>
    >> What is bad about filesystems or filenames having a charset property ?

    >
    > Don't underestimate the inertia effect. All of the system interfaces that use
    > filenames (open/creat, readdir, link, unlink, symlink, mkdir, mknod, etc.)
    > have been around for a long time. They were "opaque char *" back when it was
    > obvious that all strings were ASCII, and they haven't changed much since.
    >
    > Replacing them all with a new set of syscalls that associates a charset tag
    > with each name would be a major effort. And who's going to bother, now that
    > we're approaching a time when it will be obvious that all strings are UTF-8?
    >
    > You keep saying POSIX should "fix" this. Well, that's not how it works.
    > Successful standards are the ones that codify existing practice. You need to
    > show at least one working implementation of whatever interface you'd like to
    > standardize. Otherwise you're just using the standard as a club to beat
    > people with, and that doesn't make them eager to implement your idea for you.
    >
    > One implementation that could be easily done would be to add a "charset"
    > mount option, and make the mount syscall ignore it. Anyone who's interested
    > could look at the mount options with getmntent(). Add your own opendir()
    > wrapper and you've got something. Of course it's only a per-mount tag, not
    > per-directory or per-file, and there's nothing to prevent users from creating
    > files with the "wrong" kind of names. But at least the implementation
    > overhead is fairly low.
    >


    I know standards are meant for everyone, big or small. But standards
    should also offer directions for future development.

    I was thinking about a 'charset' option for mkfs, with mkfs taking the
    default from LC_* variables, or some hard-coded value. I only want to
    allow per-directory charsets in the interface just for future
    enhancements, but I would like the charset implemented for the entire fs
    only. And mount would look for the FS-specific charset first, and if not
    present would take it from the mount options. Then all open/creat/..
    functions would ignore it for compatibility with the old apps, until the
    application makes some special syscall or uses some new flag for open,
    and then the new feature gets activated, and the system transcodes
    filenames from LC_* encoding to FS encoding on the fly. Also wopen and
    wreaddir would know this charset and convert from it to UTF-16.

    Anyway I give up. It is all a mess and it is not in my power to fix it.
    Not because compatibility or technical reasons, but because people do
    not care. If I get negative feedback on the NG, I think I will get even
    worse feedback from the POSIX group. Even if anyone can devise such a
    new feature and still keep compatibility with POSIX.3

    Thank you for bearing with me up until now anyway.
    Timothy Madden,
    Romania

  16. Re: Any portable way get a filename in UTF-8 or to get the FS encoding ?

    Timothy Madden wrote:
    > Alan Curry wrote:
    > > In article <4709820E.80206@gmail.com>,
    > > Timothy Madden wrote:

    > [...]
    > >> Only some byte sequences can encode characters in UTF-8. Others are for
    > >> example reserved for future code points in UNICODE. This shows I could
    > >> have POSIX filenames that can not even be sent to that "UTF-8 terminal"
    > >> you were talking about. Would you not like POSIX to fix the situation ?

    > >
    > > Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
    > > sequences become banned from filenames. open() acquires a new mode of failure
    > > that it didn't have before. The simple rule of "All these bytes are yours
    > > except 0x2f. Attempt no landing there" gets replaced with a complicated
    > > system in which the validity of a byte depends on what came before it.
    > >
    > > The way things are now, I can use whatever character set I like and you can
    > > use whatever character set you like. You want to impose a single character
    > > set on everybody. That's not nice.


    > Everyone is free to use their character set. I just want a way to know
    > that character set, so I can see the names the same way as you.


    What if the "character set" is actually a special binary 3D object
    description for use in some new visualization application? And your terminal
    or file manager doesn't have a prayer of supporting it in your life time?
    (Also, who says file names are meant to be read by humans? A filesystem is a
    database like any other, unless you cripple it with provincial features.)

    The fact that NTFS, Win32, and Java adopted UTF-16 as their "character set"
    encoding, and the subsequent issues that arose when Unicode evolved and
    trashed almost all the presumed benefits of co-mingling the concepts of
    textual data (such as many filesystem object identifiers, i.e. file names)
    with textual representation, shows how misguided this notion is.

    Just basic engineering and historical sense suggests that the kernel and
    low-level system libraries should keep arms-length from these things.

    Notwithstanding the unfortuante limbo that various locales have found
    themselves in because of half-measures like ISO-8859 and ISO-2022, the
    entire software industry is converging on UTF-8 (not the least because of
    its disposition in regards to ASCII and 8-bit opaque data). Unsurprisingly,
    Unix dodged another bullet by keeping its nose out of application
    developers' faces, and letting them fight it out in due course.

    Disregarding some minor anachronisms, Unix has treated file names and file
    content as opaque bytes. As well it should continue to do so. It might not
    be the best way, but all the other bright ideas have inevitably crashed and
    burned. These arguments you make can find no currency with people who have
    watched the industry evolve. They're short-sighted, and don't even
    satisfactorily solve the problems at hand.

  17. Re: Any portable way get a filename in UTF-8 or to get the FS encoding ?

    Timothy Madden wrote:

    > I know standards are meant for everyone, big or small. But standards
    > should also offer directions for future development.


    Exactly. And there's no better standard for future development than treating
    the file names as opaque bytes. That is exactly what people having been
    trying to explain.

    Adding an external character set identifier--let's call it "meta data"--does
    not fit with the sensibilities with many or even most developers. Not that
    they don't think such meta data isn't useful, but that they'd prefer the
    meta data to be in the actual file data (or in some file data, not
    necessarily in the same file, or executed as a matter of policy). Why?
    Because the needs and modes of these things are constantly evolving (the
    concept of character set is not immune to this process). The filesystem
    provides a very primitive interface. And time has taught that it's more
    flexibile and economical to keep the interface primitive and allow
    developer's more freedom to build on top of this, rather than forcing them
    to deal with excess baggage which they may or may not make use of.

    It makes your life harder, for sure, but it makes life easier for many more,
    now and in the future (perhaps yourself). This is an area of software
    archiecture where people are justifiably conservative. Maybe because they're
    all stupid, or maybe because it's not a terribly bad idea and nothing better
    has come along.


  18. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Timothy Madden wrote:
    > How can I tell that programmatically and portably ? My app has to work
    > with files from both machines.
    >
    > I would like a standard way to get that encoding, and the file system
    > should be the first to know about it.


    If the filesystem has an encoding set for it, how do you expect multiuser
    systems to work? Or do you want to make it impossible to support different
    encodings for different users? If user 'bob' wants to use US-ASCII in
    /home/bob and user 'andre' wants to use Italian in /home/andre, and both
    /home/bob and /home/andre are on the filesystem, your system where the
    filesystem knows (and enforces) the proper encoding for everybody would
    make this impossible.

    Essentially, putting the encoding into the filesystem makes it have a
    global scope. It makes encoding into a global variable that only root
    can set. Is that what you really want? Is it really cleaner? I would
    argue that it's much less flexible for the user this way.

    Note that this is certainly not a hypothetical situation. I have myself
    been system administrator at a site where many users on the same system
    had different native languages and preferred to use a different encoding
    from each other. They should be allowed to choose the encoding for their
    filenames as well.

    - Logan

  19. Re: Any portable way get a filename in UTF-8 or to get the FS encoding ?

    On Oct 8, 4:15 am, Timothy Madden wrote:
    > Fredrik Roubert wrote:
    > > On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

    > [...]
    > > A process that wants to interpret the bytes that makes up a file name
    > > must look at its environment for hints about which encoding the user
    > > wants those file names to be interpreted as (eg. the LC_* environment
    > > variables). You can use the mbstowcs() library function to automatically
    > > convert a string into a wide character string according to the encoding
    > > specified by the current environment.

    >
    > How about files from a remote file system ? Than I am out of luck !
    >
    > I use to connect through VPN, at work, to my client's LAN. They use
    > Latin-1, I use Latin-2.
    >
    > How can I tell that programmatically and portably ? My app has to work
    > with files from both machines.
    >
    > I would like a standard way to get that encoding, and the file system
    > should be the first to know about it.
    >
    > I guess I will just have to rely on the user passing the encoding for
    > files whose names I process on the command line, or else assume the LC_*
    > default.


    You could adopt a convention where the encoding is contained in the
    filename itself. There's a scheme like this for email subject lines.
    For example I have a piece of spam in my inbox with a subject of =?
    ISO-2022-JP?B?GyRCMnEwd0ApNVUxZyU1JSQbKEI=?= which I presume a smart
    enough mail client would display as Japanese text. (Mine doesn't, but
    I don't care cause it's spam and I can't read Japanese anyway.)


  20. Any portable way get a filename in UTF-8 or to get the FS encoding ?

    TM> So if I read messages in a newsgroup and I see a message
    TM> written by a person from japan, encoded in UTF-8, I can still
    TM> see the same text the person wrote, even if UTF-8 is different
    TM> from Latin-2. I just need to know the message is encoded in
    TM> UTF-8, no matter what my current encoding is. The same
    TM> thing should happen with filenames.

    AC> Yes, newsgroup messages have headers, in which you can
    AC> find the character encoding. Filesystems don't.

    Wrong. The filesystem formats that two messages ago you characterized
    as "well-designed" may not support such metadata. But the filesystem
    formats that you associated with "brokenness", ugliness, and easy
    confusion, most certainly do. HPFS, for example, has a code page
    (index) field in its data structures for directory entries,
    immediately preceding the name field.

    AC> open() acquires a new mode of failure that it didn't have before.
    AC> The simple rule of "All these bytes are yours except 0x2f.
    AC> Attempt no landing there" gets replaced with a complicated
    AC> system in which the validity of a byte depends on what
    AC> came before it.

    .... or it gets replaced with the equally simple rule of "All these
    codepoints are yours except for U+0000 and U+002F." and a syscall
    interface that uses UTF16.

    Of course, it is false that the rule actually _is_ as simple in
    practice as "All these bytes are yours except 0x2f." in the first
    place. The "complicated system in which the validity of a byte
    depends on what came before it" already exists and is what is actually
    enforced right now, because the on-disc data structures for many
    filesystem formats don't use octets for storing filenames. NTFS, HFS
    +, and FAT all use UTF16, for example. Thus an operating system
    kernel that uses octet strings in its system call interface _already_
    has to impose multi-byte encoding rules on those strings, because they
    have to convert cleanly to UTF16 in order to be valid filenames.

    These rules have nothing to do with "brokenness", "ugliness", or
    "confusion". Those filesystem formats pretty much (glossing over
    issues such as decomposition) have the UTF16 equivalent of the simple
    rule mentioned above when it comes to the on-disc data structures, and
    as a result when employed by operating systems that have a UTF16
    native system API have the very same elegance that you are discussing
    for the 8-bit world. Blaming this on the "poor non-Unix operating
    systems" is to not understand the actual issue at all. The issue that
    mandates these rules has nothing whatsoever to do with operating
    systems not being Unix, and everything to do with the mechanics of
    converting between 8-bit character strings and 16-bit character
    strings. One faces the stark choice between having 16-bit character
    strings that cannot be represented as 8-bit character strings, i.e. an
    8-bit system where some of the on-disc filenames created by 16-bit
    systems are inaccessible; and having 8-bit character strings that have
    no mapping to 16-bit character strings, i.e. an 8-bit system where
    some 8-bit filenames are invalid because the multi-byte encoding is
    incorrect.


+ Reply to Thread
Page 1 of 2 1 2 LastLast