Discussion:
Unicode Filenames in Archives
SrinTuar
2008-06-13 15:32:23 UTC
Permalink
Using some fairly recent O/S's, such as Fedora core 8 and WIndows XP,
I seem to have no way to move a bunch of files from one to the other while
preserving
the nice unicode filenames I have.

In specific, the files were created on the fc8 system. (a few thousand of
them)

Putting them together in a zip file works fine fc8->fc8, but fails miserably
when trying to unzip in windows.

A bit of searching shows this:
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

pkware has apparently declared a flag bit to mean all filenames are utf-8

But at the same time, the developers of info-zip say this:
http://www.info-zip.org/FAQ.html

Basically, that utf-8 support is nowhere on their radar.

Things work poorly in the opposite direction for zipfiles created on windows
as well:
sometimes i can guess the original encoding and reverse the damage, other
times
I cannot : perhaps the software that made the archive has already trashed
the filenames.

Ive also given tarballs a shot for this task, but sadly cygwin is
ascii-only.

Because it works linux to linux, or at least fedora to fedora, and that is
really good enough for me,
Its not a major issue. But I'm curious to know if other have run into this
cross-platform problem, and how they
resolved it for themselves. That is, if anyone still reads this list.

How do you go about making a basic archive containing non-ascii filenames
that you can have confidence
will unpack well on most operating systems.
Pádraig Brady
2008-06-13 16:25:38 UTC
Permalink
Post by SrinTuar
Ive also given tarballs a shot for this task, but sadly cygwin is
ascii-only.
doesn't winzip, winrar et. al. support tarballs?
I haven't used windows in years so I'm just guessing.

Pádraig.
Simos Xenitellis
2008-06-13 18:03:24 UTC
Permalink
Post by SrinTuar
Using some fairly recent O/S's, such as Fedora core 8 and WIndows XP,
I seem to have no way to move a bunch of files from one to the other
while preserving
the nice unicode filenames I have.
In specific, the files were created on the fc8 system. (a few thousand
of them)
Putting them together in a zip file works fine fc8->fc8, but fails miserably
when trying to unzip in windows.
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
pkware has apparently declared a flag bit to mean all filenames are utf-8
http://www.info-zip.org/FAQ.html
Basically, that utf-8 support is nowhere on their radar.
Things work poorly in the opposite direction for zipfiles created on
sometimes i can guess the original encoding and reverse the damage,
other times
I cannot : perhaps the software that made the archive has already
trashed the filenames.
Ive also given tarballs a shot for this task, but sadly cygwin is
ascii-only.
Because it works linux to linux, or at least fedora to fedora, and
that is really good enough for me,
Its not a major issue. But I'm curious to know if other have run into
this cross-platform problem, and how they
resolved it for themselves. That is, if anyone still reads this list.
How do you go about making a basic archive containing non-ascii
filenames that you can have confidence
will unpack well on most operating systems.
If you check the list archives, you will notice a discussion a few years
back.
One of the outcomes was that it's a bit messy to use ZIP and filenames
in encoding other than ASCII.

I would suggest that you to tar and GZip (or BZip) your archives. Will
these work on Windows?
Try with 7zip to extract the said files. I would appreciate it if you
could report back on this.

Talking about 7Zip, 7z is another option as well.

Simos
SrinTuar
2008-06-13 18:26:03 UTC
Permalink
Thanks Simos.

I gave the 7za command line tool a try, and it seems to work quite well, and
handily solves
the cross-platform unicode filenames problem. (it either correctly figured
out that my ext3
filenames were utf-8 based on the locale, or else it just hard assumed utf-8
period. I'm fine
with either case, didnt review the source to see which it was... )

I suppose the answer then is to steer clear of the most popular archive
formats when
you want to release an archive containing unicode filenames to a
multi-platform audience.
Alternatives, such as 7zip appear to work quite correctly.

I found this on wikipedia:
http://en.wikipedia.org/wiki/Comparison_of_file_archivers

So we have a handful of valid choices after all.
Try with 7zip to extract the said files. I would appreciate it if you could
report back on this.
Talking about 7Zip, 7z is another option as well.
t***@towo.net
2008-06-13 17:33:07 UTC
Permalink
Post by SrinTuar
Ive also given tarballs a shot for this task, but sadly cygwin is
ascii-only.
To quickly respond to this side-topic: It is possible to enable UTF-8 in
cygwin in a limited way although cygwin has only bogus locale support.
Some applications, however, are able to support UTF-8 without locale
support:

* xterm works nicely in UTF-8 mode if configured properly
* rxvt-unicode can be patched to support UTF-8 (the package includes
my patch)
* my editor mined supports UTF-8 if it finds the terminal to be
running in UTF-8 mode
Post by SrinTuar
Because it works linux to linux, or at least fedora to fedora, and
that is really good enough for me,
Its not a major issue. But I'm curious to know if other have run into
this cross-platform problem, and how they
resolved it for themselves. That is, if anyone still reads this list.
How do you go about making a basic archive containing non-ascii
filenames that you can have confidence
will unpack well on most operating systems.
I have been irritated by this as well but I have no general solution.
May we find one together. I'll do some testing...

Kind regards,
Thomas Wolff
Markus Kuhn
2008-06-14 18:07:30 UTC
Permalink
Post by t***@towo.net
Post by SrinTuar
Ive also given tarballs a shot for this task, but sadly cygwin is
ascii-only.
To quickly respond to this side-topic: It is possible to enable UTF-8 in
cygwin in a limited way although cygwin has only bogus locale support.
Some applications, however, are able to support UTF-8 without locale
* xterm works nicely in UTF-8 mode if configured properly
* rxvt-unicode can be patched to support UTF-8 (the package includes
my patch)
* my editor mined supports UTF-8 if it finds the terminal to be
running in UTF-8 mode
Cygwin is a Windows DLL that provides Windows C applications a POSIX API
very similar to that available under Linux. Under Linux, if you open a
file with a UTF-8 filename and the file is located on a VFAT or NTFS
filesystem, then it is the job of the kernel file-system driver to
convert between the UTF-8 encoding used in the open() system call and
the UTF-16 encoding used on Microsoft's file systems. Under Linux, this
works nicely if the utf8 option is passed to the ntfs driver by mount
from /etc/fstab. This is now done in all recent (i.e., post-2005) major
Linux distribution by default.

So one option for doing the file transfer is to mount the relevant NTFS
partition under Linux and then use any standard Linux file copy tool
(cp, tar, rsync, etc.) to do the job. This is trivial if the Linux and
NTFS partition reside on the same system, otherwise, either (a) connect
the NTFS harddisk to the Linux computer or (b) boot the PC that contains
the NTFS partition temporarily with one of the many Live-CD Linux
distributions (Knoppix, etc.) from CD-R.

The question with regard to Cygwin is not what locales it has, but
whether it translates a UTF-8 string provided to it in a POSIX system
call, such as open(), into a UTF-16 string before passing the data on to
the equivalent Win32 system call, and vice versa.

Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
Loading...