Any portable way get a filename in UTF-8 or to get the FS encoding? - Unix

This is a discussion on Any portable way get a filename in UTF-8 or to get the FS encoding? - Unix ; William Ahern wrote: > Timothy Madden wrote: >> Alan Curry wrote: >>> In article , >>> Timothy Madden wrote: >> [...] >>>> Only some byte sequences can encode characters in UTF-8. Others are for >>>> example reserved for future code ...

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 21 to 34 of 34

Thread: Any portable way get a filename in UTF-8 or to get the FS encoding?

  1. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    William Ahern wrote:
    > Timothy Madden wrote:
    >> Alan Curry wrote:
    >>> In article <4709820E.80206@gmail.com>,
    >>> Timothy Madden wrote:

    >> [...]
    >>>> Only some byte sequences can encode characters in UTF-8. Others are for
    >>>> example reserved for future code points in UNICODE. This shows I could
    >>>> have POSIX filenames that can not even be sent to that "UTF-8 terminal"
    >>>> you were talking about. Would you not like POSIX to fix the situation ?
    >>> Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
    >>> sequences become banned from filenames. open() acquires a new mode of failure
    >>> that it didn't have before. The simple rule of "All these bytes are yours
    >>> except 0x2f. Attempt no landing there" gets replaced with a complicated
    >>> system in which the validity of a byte depends on what came before it.
    >>>
    >>> The way things are now, I can use whatever character set I like and you can
    >>> use whatever character set you like. You want to impose a single character
    >>> set on everybody. That's not nice.

    >
    >> Everyone is free to use their character set. I just want a way to know
    >> that character set, so I can see the names the same way as you.

    >
    > What if the "character set" is actually a special binary 3D object
    > description for use in some new visualization application? And your terminal
    > or file manager doesn't have a prayer of supporting it in your life time?
    > (Also, who says file names are meant to be read by humans? A filesystem is a
    > database like any other, unless you cripple it with provincial features.)
    >

    [...]
    >
    > Disregarding some minor anachronisms, Unix has treated file names and file
    > content as opaque bytes. As well it should continue to do so. It might not
    > be the best way, but all the other bright ideas have inevitably crashed and
    > burned. These arguments you make can find no currency with people who have
    > watched the industry evolve. They're short-sighted, and don't even
    > satisfactorily solve the problems at hand.


    What do you mean "all the other bright ideas have crashed and burned" ?
    They are still living, _wreaddir functions are perfectly working, and
    WinNT+ is all UNICODE (ANSI functions are wrappers around the UNICODE ones).

    It is POSIX who keeps doing things the old way. But even standards
    evolve, so all they need now is finding a standard, interoperable way to
    somehow include the charset in the filesystem interfaces.

    And since cd, ls and cat are user commands, file names are clearly meant
    to be read and written by humans. Event though the file system is a
    database like any other.

    It is ok if file names have been created on systems with special
    encodings and can only be displayed there, this is happening all the
    time, but at least now I would have a way to know about it, instead of
    seeing a different file name and taking it for good, that most likely
    looks like "garbage" as some say it.

    Timothy Madden,
    Romania

  2. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    William Ahern wrote:
    > Timothy Madden wrote:
    >
    >> I know standards are meant for everyone, big or small. But standards
    >> should also offer directions for future development.

    >
    > Exactly. And there's no better standard for future development than treating
    > the file names as opaque bytes. That is exactly what people having been
    > trying to explain.
    >
    > Adding an external character set identifier--let's call it "meta data"--does
    > not fit with the sensibilities with many or even most developers. Not that
    > they don't think such meta data isn't useful, but that they'd prefer the
    > meta data to be in the actual file data (or in some file data, not
    > necessarily in the same file, or executed as a matter of policy). Why?
    > Because the needs and modes of these things are constantly evolving (the
    > concept of character set is not immune to this process). The filesystem
    > provides a very primitive interface. And time has taught that it's more
    > flexibile and economical to keep the interface primitive and allow
    > developer's more freedom to build on top of this, rather than forcing them
    > to deal with excess baggage which they may or may not make use of.
    >
    > It makes your life harder, for sure, but it makes life easier for many more,
    > now and in the future (perhaps yourself). This is an area of software
    > archiecture where people are justifiably conservative. Maybe because they're
    > all stupid, or maybe because it's not a terribly bad idea and nothing better
    > has come along.
    >


    I only want on optional feature. I am also conservative and I value
    compatibility before new features.

    Any applications, including the existing ones, can ignore any charset
    values and work as before. But I want the option of letting the system
    transcode filenames for me, or just let me know the charset and then I
    will deal with it.

    I know the best solutions are often the simple ones. But sometimes you
    have to work if you want to make things right.

    Thank you,
    Timothy Madden,
    Romania

  3. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Timothy Madden wrote:
    ....
    > Are POSIX implementations required to convert the file name return by
    > readdir to the application's execution character set ?


    A reasonable convention to use (hard to enforce) is that all file names
    be stored in a normalized utf-8. This is similar to the Windows
    solution of storing all file names in utf-16. The question of what to
    do where a process's character set is unable to convert from utf-8.
    There are two solutions - keep file names in utf-8 and display them in
    utf-8 or convert the entire application to use utf-8.

    The third solution is to only use a subset of utf-8 - ascii, for file names.

  4. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Logan Shaw wrote:
    > Timothy Madden wrote:
    >> How can I tell that programmatically and portably ? My app has to work
    >> with files from both machines.
    >>
    >> I would like a standard way to get that encoding, and the file system
    >> should be the first to know about it.

    >
    > If the filesystem has an encoding set for it, how do you expect multiuser
    > systems to work?


    It is easy. Use UTF-8 or UTF-16 as the file system encoding and let the
    system re-encode names between UTF and the user's current LC_* encoding
    on the fly.

    So every user effectively sees the file system in it's own LC_* encoding.

    Even more, if the user has two apps, one that only knows SHIFT_JIS and
    one that only knows ANSI, the user just needs to arrange that current
    locale for the first app is SHIFT_JIS, and the current locale for the
    second app is ANSI. And suddenly the same filesystem appears all in
    SHIFT_JIS to app1, and all in ANSI to app2. Even in the same time .

    Since not all SHIFT_JIS names can be re-encoded to ANSI without question
    mark characters it is still better for the user to use unicode
    applications. Which is true anyway, with or without filesystem/filename
    charsets.

    Timothy Madden,
    Romania

  5. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    fjblurt@yahoo.com wrote:
    > On Oct 8, 4:15 am, Timothy Madden wrote:
    >> Fredrik Roubert wrote:
    >>> On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

    >> [...]
    >>> A process that wants to interpret the bytes that makes up a file name
    >>> must look at its environment for hints about which encoding the user
    >>> wants those file names to be interpreted as (eg. the LC_* environment
    >>> variables). You can use the mbstowcs() library function to automatically
    >>> convert a string into a wide character string according to the encoding
    >>> specified by the current environment.

    >> How about files from a remote file system ? Than I am out of luck !
    >>

    [...]
    >
    > You could adopt a convention where the encoding is contained in the
    > filename itself. There's a scheme like this for email subject lines.
    > For example I have a piece of spam in my inbox with a subject of =?
    > ISO-2022-JP?B?GyRCMnEwd0ApNVUxZyU1JSQbKEI=?= which I presume a smart
    > enough mail client would display as Japanese text. (Mine doesn't, but
    > I don't care cause it's spam and I can't read Japanese anyway.)
    >


    This problem is about the POSIX standard or interoperability or the
    entire world if you want.

    However the encoding is stored in the file system is the decision of the
    FS implementation and I am sure there are many possibilities to choose from.

    Timothy Madden,
    Romania.

  6. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Gianni Mariani wrote:
    > Timothy Madden wrote:
    > ...
    >> Are POSIX implementations required to convert the file name return by
    >> readdir to the application's execution character set ?

    >
    > A reasonable convention to use (hard to enforce) is that all file names
    > be stored in a normalized utf-8. This is similar to the Windows
    > solution of storing all file names in utf-16. The question of what to
    > do where a process's character set is unable to convert from utf-8.
    > There are two solutions - keep file names in utf-8 and display them in
    > utf-8 or convert the entire application to use utf-8.
    >
    > The third solution is to only use a subset of utf-8 - ascii, for file
    > names.


    Every one is free to use their character set. POSIX is a standard, and
    it is meant for every one. If everything your application knows is
    EBCDIC, just keep using EBCDIC, UTF-8 should be just one of the options.

    If you meant internal storage, that is the decision of the file system
    implementation only.

    The question of names using characters outside the application's
    character set is still a difficult one for me. I guess the system could
    use some escape mechanism in the file names for such characters, like
    uri-encoding them, and in the same time set some variable to let the
    requesting application know about what happened.

    Timothy Madden,
    Romania

  7. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    Timothy Madden wrote:
    > Gianni Mariani wrote:
    >> Timothy Madden wrote:
    >> ...
    >>> Are POSIX implementations required to convert the file name return by
    >>> readdir to the application's execution character set ?

    >>
    >> A reasonable convention to use (hard to enforce) is that all file
    >> names be stored in a normalized utf-8. This is similar to the Windows
    >> solution of storing all file names in utf-16. The question of what to
    >> do where a process's character set is unable to convert from utf-8.
    >> There are two solutions - keep file names in utf-8 and display them in
    >> utf-8 or convert the entire application to use utf-8.
    >>
    >> The third solution is to only use a subset of utf-8 - ascii, for file
    >> names.

    >
    > Every one is free to use their character set. POSIX is a standard, and
    > it is meant for every one. If everything your application knows is
    > EBCDIC, just keep using EBCDIC, UTF-8 should be just one of the options.
    >
    > If you meant internal storage, that is the decision of the file system
    > implementation only.
    >
    > The question of names using characters outside the application's
    > character set is still a difficult one for me. I guess the system could
    > use some escape mechanism in the file names for such characters, like
    > uri-encoding them, and in the same time set some variable to let the
    > requesting application know about what happened.


    If you want interoperability then a very good solution is to use a
    common base. Unicode is designed to accomodate that common base for
    languages. You don't *have* to use it and you can go and whack your
    head against a brick wall if you really want to.

    It gets to the point that once you have decided you need to have
    multiple processes with different locale encodings to talk to each other
    (which is the inevitable problem with file names), then using a common
    encoding like utf-8 and deprecating all other encodings becomes an
    interesting solution.

    It will take a while still before it is ubiquitous, however, many web
    based documents are utf-8 or many applications communicate in utf-8 or
    utf-16. Most of the recent web browsers work very well multiligially,
    the tools are there, the problems are solved. There are a plethora of
    multilingual documents on the web today.

    See if this works below :-

    س اスセソタチツテ لاБ Г Д من 1441

  8. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    On Mon, 08 Oct 2007 14:15:22 +0300, Timothy Madden wrote:

    > How about files from a remote file system ? Than I am out of luck !
    >
    > I use to connect through VPN, at work, to my client's LAN. They use
    > Latin-1, I use Latin-2.
    >
    > How can I tell that programmatically and portably ? My app has to work
    > with files from both machines.


    When mounting your remote file system you should specify the character
    set conversion you would like to get done in order to get the file names
    in the encoding that you want your process to receive.

    If you have file systems with latin1 file names and file systems with
    latin2 file names that you want to access from the same process, then I
    suggest that you mount them with character set conversion to UTF-8 and
    then run your process in a UTF-8 locale.

    Cheers // Fredrik Roubert

    --
    Dyre Halses gate 10 | +47 73568556 / +47 41266295
    NO-7042 Trondheim | http://www.df.lth.se/~roubert/

  9. Re: Any portable way get a filename in UTF-8 or to get the FS encoding?

    On Mon, 08 Oct 2007 11:58:53 +0300, Timothy Madden wrote:

    > What is bad about filesystems or filenames having a charset property ?


    That would make it necessary for all files on any given filesystem to
    have their names encoded in the same character set. This would prevent,
    say, one user from encoding his file names in ISO-8859-1 and another
    user to encode his file names in GB2312.

    Many other systems work this way, but from the Unix point of view, every
    single process should be able to run in its own locale. In many kinds of
    larger and distributed systems, this is a really good idea.

    Cheers // Fredrik Roubert

    --
    Dyre Halses gate 10 | +47 73568556 / +47 41266295
    NO-7042 Trondheim | http://www.df.lth.se/~roubert/

  10. Any portable way get a filename in UTF-8 or to get the FS encoding ?

    TM> What is bad about filesystems or filenames having a charset
    property ?

    FR> That would make it necessary for all files on any given filesystem
    to
    FR> have their names encoded in the same character set.

    _This is actually already the case_ for the FAT, NTFS, and HFS+
    filesystem formats. It's a requirement of the filesystem formats.

    FR> This would prevent, say, one user from encoding his file names in
    FR> ISO-8859-1 and another user to encode his file names in GB2312.

    Wrong. That is _not_ a consequence of filenames having a character
    set property. If filenames had a character set property -- as _they
    have_ on HPFS -- then one user could use one character set for xyr
    file names and another user could use another character set for xyr
    file names. And, indeed, on those operating systems that support this
    facility of HPFS, that is exactly what they do.

    FR> Many other systems work this way, but from the Unix point of
    FR> view, every single process should be able to run in its own
    locale.

    This is irrelevant to the issue. If the system API were UTF16, for
    example, translation between UTF16 and an 8-bit character set would be
    done in application-mode code, and would use the process' current
    locale. Thus the 8-bit character set would be locale-dependent and
    per-process, as desired. This is exactly how those "other systems"
    actually work.

    http://reactos.org./generated/doxygen/d4/d47/
    dll_2win32_2kernel32_2file_2find_8c.html#a10>


  11. Re: Any portable way get a filename in UTF-8 or to get the FS encoding ?

    J de Boyne Pollard writes:
    > TM> What is bad about filesystems or filenames having a charset
    > property ?
    >
    > FR> That would make it necessary for all files on any given filesystem
    > to
    > FR> have their names encoded in the same character set.
    >
    > _This is actually already the case_ for the FAT, NTFS, and HFS+
    > filesystem formats. It's a requirement of the filesystem formats.


    In other words, DOS/Windows and Mac OS behave differently in this
    respect.

    > FR> This would prevent, say, one user from encoding his file names in
    > FR> ISO-8859-1 and another user to encode his file names in GB2312.
    >
    > Wrong. That is _not_ a consequence of filenames having a character
    > set property.


    If a filename has a 'character set property', there is obviously a
    character set attached to the filename. The same goes for a filesystem
    with a character set property: It would have one.

    > If filenames had a character set property -- as _they have_ on HPFS
    > -- then one user could use one character set for xyr file names and
    > another user could use another character set for xyr file names.


    That's a non-sequitur.

    > FR> Many other systems work this way, but from the Unix point of
    > FR> view, every single process should be able to run in its own
    > locale.
    >
    > This is irrelevant to the issue. If the system API were UTF16, for
    > example, translation between UTF16 and an 8-bit character set would be
    > done in application-mode code, and would use the process' current
    > locale.


    If the filesystem actually used an 'encoding' internally, instead
    of just using the supplied bytestring, applications would be required
    to translate to and from that encoding in case the application would
    want to use a different encoding. More generally put, if the kernel
    had a 'default policy', application would need to work around that if
    the user (for some reason) would like to use a different policy.

    I do not quite understand why this would be an argument for or against
    anything.

  12. Re: Any portable way get a filename in UTF-8 or to get the FS encoding ?

    On Wed, 10 Oct 2007 05:05:10 -0700, J de Boyne Pollard wrote:

    > > This would prevent, say, one user from encoding his file names in
    > > ISO-8859-1 and another user to encode his file names in GB2312.

    >
    > Wrong. That is _not_ a consequence of filenames having a character
    > set property.


    Of course that's not a consequence of filenames having a character set
    property. If you read a bit more carefully, you'll realize that I was
    referring to filesystems having a character set property.

    Cheers // Fredrik Roubert

    --
    Dyre Halses gate 10 | +47 73568556 / +47 41266295
    NO-7042 Trondheim | http://www.df.lth.se/~roubert/

  13. Re: Any portable way get a filename in UTF-8 or to get the FS encoding ?

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFHEe5QVcFcaSW/uEgRAmYwAKCVhWfB9X2HXN4DbQSecWAeL2Kr7gCfeUK4
    WnbyCVhw1GfrpftKGzVKj1c=
    =/Ofg
    -----END PGP SIGNATURE-----

  14. Any portable way get a filename in UTF-8 or to get the FS encoding ?

    TM> What is bad about filesystems or filenames having a
    TM> charset property ?

    FR> This would prevent, say, one user from encoding his file names in
    FR> ISO-8859-1 and another user to encode his file names in GB2312.

    JdeBP> Wrong. That is _not_ a consequence of filenames having
    JdeBP> a character set property.

    FR> Of course that's not a consequence of filenames having a
    FR> character set property. If you read a bit more carefully,
    FR> you'll realize that I was referring to filesystems having a
    FR> character set property.

    False. You were answering the question that is quoted above, which
    talks about filenames having a character set property. And nowhere
    did you write anything to indicate that this was not what you were
    talking about.


+ Reply to Thread
Page 2 of 2 FirstFirst 1 2