garbled file names on a linux/windows volume

Discussion:

Ray Chuan

2008-10-31 17:51:42 UTC

Hi,

recently I used ubuntu to mount a windows fat32 volume to copy files
to another fat volume, using a simple "cp -r", and now file names
containing diacritics and "other" languages do not show up properly. i
only discovered the problem when i was about to use them.

using an edonkey client, which has a function to convert file names to
url-friendly strings (aka ed2k links), i was able to see that "é"
showed up as %C3%83%C2%A9, while the more complex "专辑"
(专辑) would be %C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91.

is it possible to fix this quickly using iconv? or perhaps someone
could suggest an algorithm t

Andries E. Brouwer

2008-10-31 20:31:53 UTC

Permalink

Post by Ray Chuan
using an edonkey client, which has a function to convert file names to
url-friendly strings (aka ed2k links), i was able to see that "é"
showed up as %C3%83%C2%A9, while the more complex "专辑"
(专辑) would be %C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91.

You converted twice to UTF-8, so have to go back once.

(é is U+00e9 which is 11000011 10101001 in UTF-8, but if you read
the latter as Latin-1 and convert once more to UTF-8 you get
11000011 10000011 11000010 10101001, that is, %C3%83%C2%A9 as you reported)

Ben Wiley Sittler

2008-10-31 22:49:07 UTC

Permalink

if you need to fix a lot of these automatically from a shell script,
you might consider something like this:

python -c 'import sys, urllib; print urllib.unquote("
".join(sys.argv[1:])).decode("utf-8").encode("iso-8859-1")' \
'%C3%83%C2%A9' \
'%C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91'

é 专辑

it works like "echo", but decodes the %-escaping and one of the levels
of utf-8 encoding.

On Fri, Oct 31, 2008 at 1:31 PM, Andries E. Brouwer

Post by Andries E. Brouwer

You converted twice to UTF-8, so have to go back once.
(é is U+00e9 which is 11000011 10101001 in UTF-8, but if you read
the latter as Latin-1 and convert once more to UTF-8 you get
11000011 10000011 11000010 10101001, that is, %C3%83%C2%A9 as you reported)
--
Linux-UTF8: i18n of Linux on all levels
Archi

Ray Chuan

2008-11-02 01:26:51 UTC

Permalink

thanks, that worked.

Post by Ben Wiley Sittler
if you need to fix a lot of these automatically from a shell script,
python -c 'import sys, urllib; print urllib.unquote("
".join(sys.argv[1:])).decode("utf-8").encode("iso-8859-1")' \
'%C3%83%C2%A9' \
'%C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91'
é 专辑
it works like "echo", but decodes the %-escaping and one of the levels
of utf-8 encoding.
On Fri, Oct 31, 2008 at 1:31 PM, Andries E. Brouwer

Post by Andries E. Brouwer

You converted twice to UTF-8, so have to go back once.
(é is U+00e9 which is 11000011 10101001 in UTF-8, but if you read
the latter as Latin-1 and convert once more to UTF-8 you get
11000011 10000011 11000010 10101001, that is, %C3%83%C2%A9 as you reported)
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/