How to enter accented UTF-8 character on GNOME terminal

Discussion:

Colin Paul Adams

2007-03-17 07:05:01 UTC

I can't find this in the GNOME help, so I thought I'd try asking here.

I want to be rename a file so it has an a-umlaut (lower case) in the
name.

My LANG is en_GB.UTF-8.

I don't know how to type the accented character.

--
Colin Adams
Preston Lancashire

Rich Felker

2007-03-17 07:53:17 UTC

Permalink

Post by Colin Paul Adams
I can't find this in the GNOME help, so I thought I'd try asking here.
I want to be rename a file so it has an a-umlaut (lower case) in the
name.
My LANG is en_GB.UTF-8.
I don't know how to type the accented character.

One sure way is to copy-and-paste it from a file already containing
the character. I keep around a copy of UnicodeData.txt with the
literal UTF-8 character added to each line for exactly this purpose.

Another method that might work is the ISO 14755 entry method, holding
control and shift and typing the character number in hex. Not sure if
GNOME terminal supports this or not. On the Linux console, if you have
an appropriate keymap loaded, holding AltGr and typing the character
number will do the same.

Of course for characters that you want to enter often, all of these
methods are rather inconvenient. For this purpose you can customize
the X keyboard tables with xkb or xmodmap. I have xkb configured so
that capslock toggles between two mappings. Then I run the command:
setxkbmap us,xx with xx replaced with whatever secondary mapping I
want to use. If you just want accented characters though you probably
don't need a whole secondary mapping; just enabling 'dead keys' or
setting up altgr+something to enter the characters you need is
probably sufficient.

Rich

Colin Paul Adams

2007-03-17 08:25:43 UTC

Permalink

Rich> On Sat, Mar 17, 2007 at 07:05:01AM +0000, Colin Paul Adams

Post by Colin Paul Adams
I can't find this in the GNOME help, so I thought I'd try
asking here.
I want to be rename a file so it has an a-umlaut (lower case)
in the name.
My LANG is en_GB.UTF-8.
I don't know how to type the accented character.

Rich> One sure way is to copy-and-paste it from a file already
Rich> containing the character. I keep around a copy of
Rich> UnicodeData.txt with the literal UTF-8 character added to
Rich> each line for exactly this purpose.

Rich> Another method that might work is the ISO 14755 entry
Rich> method, holding control and shift and typing the character
Rich> number in hex. Not sure if GNOME terminal supports this

Trial says it does.
Thank-you.

Now my real problem is somewhat more interesting, and relevant to this
list, I think:

I am the author of an XSLT 2.0 interpreter, and a member of the W3C
XSLT WG. As such, I have access to the XSLT 2.0 test suite
(unfortunately not publicly distributed now).

One of the tests involves evaluation of the following expression:

document('xgespr%C3%A4ch.xml')

According to the rules of the language, the argument to document() is
of type xs:anyURI. The percent-encoding must be interpreted as a UTF-8
byte sequence representing the Unicode characters.

Now this is where it gets interesting.
My URI resolver translates the file name (the URI is relative to a
base file: URI) into a UTF-8 byte sequence which gets passed to the
fopen call (the program is supposed to work on other O/Ses too, not
just Linux, but I'll worry about that later).

The test suite is currently distributed as a zip file. It so happens
that the file concerned is named using ISO-8859-1 on the distributors
system. On my system, doing ls from the GNOME console shows the name
as xgespr?ch.xml. Whereas Emacs dired shows the name as
xgespräch.xml.

I'm not sure exactly how fopen is supposed to handle the situation.

Anyway, the test failed - not surprisingly.
I looked at the unzip man page, to see if there was any filename
translation option. I couldn't find one.

So I tried unzipping the distrbution afresh, but this time with
LANG=en_GB.

Emacs still showed the same name, ls however showed a completely
different character (it loked like it might be arabic to me - I don't
know).

The test still failed.

So I went back to LANG=en_GB.UTF-8, unzipped the distribution again,
and re-named the file, thanks to your help.

ls now shows the correct file name. Emacs shows
xgesprÃ¤ch.xml. And the test works.

Has anyone any illuminating comments to make? I'm particularly
interested in the distribution problem.

--
Colin Adams
Preston Lancashire

ＳｒｉｎＴｕａｒ

2007-03-17 15:18:56 UTC

Permalink

Post by Colin Paul Adams
The test suite is currently distributed as a zip file. It so happens
that the file concerned is named using ISO-8859-1 on the distributors
system. On my system, doing ls from the GNOME console shows the name
as xgespr?ch.xml. Whereas Emacs dired shows the name as
xgespräch.xml.

Zip files treat filenames as byte arrays, so zip tends to be clumsy when you get
zipfiles created on legacy systems. Its compatible with utf-8 at
least, so zipfiles you
make yourself should have no problems.

Post by Colin Paul Adams
So I went back to LANG=en_GB.UTF-8, unzipped the distribution again,
and re-named the file, thanks to your help.
ls now shows the correct file name. Emacs shows
xgesprÃ¤ch.xml. And the test works.

I tried emacs and saw the same problem you did. vim seems to work
correctly with locales.
Allthough advising a switch to vim is probably more responsible,
a quick seach revealed this link: http://linux.seindal.dk/item32.html

Post by Colin Paul Adams
Has anyone any illuminating comments to make? I'm particularly
interested in the distribution problem.

You could have the distributor change his locale to utf-8 and rename the files
on his filesyst

Ben Wiley Sittler

2007-03-17 16:51:53 UTC

Permalink

emacs seems not to handle utf-8 filenames at all, regardless of locale.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½

Post by Colin Paul Adams
Has anyone any illuminating comments to make? I'm particularly
interested in the distribution problem.

You could have the distributor change his locale to utf-8 and rename the file

Rich Felker

2007-03-18 01:40:59 UTC

Permalink

Post by Ben Wiley Sittler
emacs seems not to handle utf-8 filenames at all, regardless of locale.

(setq file-name-coding-system 'utf-8)

~Rich

Colin Paul Adams

2007-03-18 03:13:26 UTC

Permalink

Rich> On Sat, Mar 17, 2007 at 09:51:53AM -0700, Ben Wiley Sittler

Post by Ben Wiley Sittler
emacs seems not to handle utf-8 filenames at all, regardless of locale.

Rich> (setq file-name-coding-system 'utf-8)

Thank you.

--
Colin Adams
Preston Lancashire

Ben Wiley Sittler

2007-03-18 15:41:48 UTC

Permalink

awesome, and thank you! however, utf-8 filenames given on the command
line still do not work... the get turned into iso-8859-1, which is
then utf-8 encoded before saving (?!)

here's my (partial) utf-8 workaround for emacs so far:

(if (string-match "XEmacs\\|Lucid" emacs-version)
nil
(condition-case nil (eval
(if
(string-match "\\.\$UTF\\|utf\$-?8$"
(or (getenv "LC_CTYPE")
(or (getenv "LC_ALL")
(or (getenv "LANG")
"C"))))
'(concat (set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-default-coding-systems 'utf-8)
(setq file-name-coding-system 'utf-8)
(set-language-environment "UTF-8"))))
((error "Language environment not defined: \"UTF-8\"") nil)))

Post by Rich Felker

Post by Ben Wiley Sittler
emacs seems not to handle utf-8 filenames at all, regardless of locale.

(setq file-name-coding-system 'utf-8)
~Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2007-03-18 18:17:32 UTC

Permalink

Post by Ben Wiley Sittler
awesome, and thank you! however, utf-8 filenames given on the command
line still do not work... the get turned into iso-8859-1, which is
then utf-8 encoded before saving (?!)
(if (string-match "XEmacs\\|Lucid" emacs-version)
nil
(condition-case nil (eval
(if
(string-match "\\.\$UTF\\|utf\$-?8$"
(or (getenv "LC_CTYPE")
(or (getenv "LC_ALL")
(or (getenv "LANG")
"C"))))
'(concat (set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-default-coding-systems 'utf-8)
(setq file-name-coding-system 'utf-8)
(set-language-environment "UTF-8"))))
((error "Language environment not defined: \"UTF-8\"") nil)))

Here are all my relevant emacs settings. They work in at least
emacs-21 and later; however, emacs-21 seems to be having trouble with
UTF-8 on the command line and I don’t know any way around that.

; Force unix and utf-8
(setq inhibit-eol-conversion t)
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(setq file-name-coding-system 'utf-8)
(setq coding-system-for-read 'utf-8)
(setq coding-system-for-write 'utf-8)

Note that the last two may be undesirable; they force ALL files to be
treated as UTF-8, skipping any detection. This allows me to edit files
which may have invalid sequences in them (like Kuhn’s decoder test
file) or which are a mix of binary data and UTF-8.

I use the experimental unicode-2 branch of GNU emacs, and with it,
forcing UTF-8 does not corrupt non-UTF-8 files. The invalid sequences
are simply shown as octal byte codes and saved back to the file as
they were in the source. I cannot confirm that this will not corrupt
files on earlier versions of GNU emacs, however, and XEmacs ALWAYS
corrupts files visited as UTF-8 (it converts any unicode character for
which it does not have a corresponding emacs-mule character into a
replacement character) so it’s entirely unsuitable for use with UTF-8
until that’s fixed (still broken in latest cvs as of a few months
ago..).

BTW looking for “UTF-8” in the locale string is a bad idea since UTF-8
is not necessarily a “special” encoding but may be the “native”
encoding for the selected language. nl_langinfo(CODESET) is the only
reliable determination and I doubt emacs provides any direct way of
accessing it. :(

~Rich

Ben Wiley Sittler

2007-03-18 20:21:53 UTC

Permalink

yeah, using the newer 'emacs-snapshot' (GNU Emacs 22.0.91.1) here on
ubuntu feisty solves most of the UTF-8 related problems in emacs,
including command line argument encoding. since i deal with some data
in non-utf-8 encodings (iso-2022, iso-2022-jp, iso-8859-x, etc.) and
interact with other X11 applciations that use compound-text in their
selections, i do not think some of those settings would work for me.

i agree that looking for a particular substring in the locale name is
the wrong approach. on a linux system i should perhaps base this on
the output of the "locale charmap" command instead, but my rusty elisp
is not up to that task at the moment. fortunately the UTF-8 locales
all seem to end with ".UTF-8" on this system.

Post by Rich Felker

Here are all my relevant emacs settings. They work in at least
emacs-21 and later; however, emacs-21 seems to be having trouble with
UTF-8 on the command line and I don't know any way around that.
; Force unix and utf-8
(setq inhibit-eol-conversion t)
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(setq file-name-coding-system 'utf-8)
(setq coding-system-for-read 'utf-8)
(setq coding-system-for-write 'utf-8)
Note that the last two may be undesirable; they force ALL files to be
treated as UTF-8, skipping any detection. This allows me to edit files
which may have invalid sequences in them (like Kuhn's decoder test
file) or which are a mix of binary data and UTF-8.
I use the experimental unicode-2 branch of GNU emacs, and with it,
forcing UTF-8 does not corrupt non-UTF-8 files. The invalid sequences
are simply shown as octal byte codes and saved back to the file as
they were in the source. I cannot confirm that this will not corrupt
files on earlier versions of GNU emacs, however, and XEmacs ALWAYS
corrupts files visited as UTF-8 (it converts any unicode character for
which it does not have a corresponding emacs-mule character into a
replacement character) so it's entirely unsuitable for use with UTF-8
until that's fixed (still broken in latest cvs as of a few months
ago..).
BTW looking for "UTF-8" in the locale string is a bad idea since UTF-8
is not necessarily a "special" encoding but may be the "native"
encoding for the selected language. nl_langinfo(CODESET) is the only
reliable determination and I doubt emacs provides any direct way of
accessing it. :(
~Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Jan Larres

2007-03-19 15:27:54 UTC

Permalink

Post by Rich Felker
BTW looking for “UTF-8” in the locale string is a bad idea since UTF-8
is not necessarily a “special” encoding but may be the “native”
encoding for the selected language. nl_langinfo(CODESET) is the only
reliable determination and I doubt emacs provides any direct way of
accessing it. :(

AFAIK 'locale charmap' should do the same thing. At least that's what the
man page of nl_langinfo states.

Jan

--
OpenPGP Key-ID: CF1635D4
"The most exciting phrase to hear in science, the one that heralds new
discoveries, is not "Eureka!" (I found it!) but "That's funny ..."" --
Isaac Asimov

James Cloos

2007-03-18 15:12:39 UTC

Permalink

Ben> emacs seems not to handle utf-8 filenames at all, regardless of locale.

That is dependant on the version of emacs. I can imaging Gnu Emacs 21
may have problems, and I expect earlier will. But 22 and later should
work well. (Some configuration may be required, though, for 22.)

I don't know about Xemacs.

-JimC

--
James Cloos <***@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

Rich Felker

2007-03-18 01:44:58 UTC

Permalink

Post by Colin Paul Adams
Now this is where it gets interesting.
My URI resolver translates the file name (the URI is relative to a
base file: URI) into a UTF-8 byte sequence which gets passed to the
fopen call (the program is supposed to work on other O/Ses too, not
just Linux, but I'll worry about that later).
The test suite is currently distributed as a zip file. It so happens
that the file concerned is named using ISO-8859-1 on the distributors
system. On my system, doing ls from the GNOME console shows the name
as xgespr?ch.xml. Whereas Emacs dired shows the name as
xgespräch.xml.
I'm not sure exactly how fopen is supposed to handle the situation.

It's not. You should not create files in your filesystem with the
wrong encoding. If you do, then the only way to access them is via
whatever the (invalid) byte sequence is.

Post by Colin Paul Adams
Anyway, the test failed - not surprisingly.
I looked at the unzip man page, to see if there was any filename
translation option. I couldn't find one.

Yes, the problem here is the unzip command. It should provide a way to
translate filenames...

Post by Colin Paul Adams
So I tried unzipping the distrbution afresh, but this time with
LANG=en_GB.

That won't help. You can't mix encodings in the filesystem and expect
any reasonable behavior.

Post by Colin Paul Adams
Emacs still showed the same name, ls however showed a completely
different character (it loked like it might be arabic to me - I don't
know).
The test still failed.
So I went back to LANG=en_GB.UTF-8, unzipped the distribution again,
and re-named the file, thanks to your help.

Yep, this is the only reasonable fix until the unzip command is fixed
to handle foreign encodings.

Post by Colin Paul Adams
ls now shows the correct file name. Emacs shows
xgesprÃ¤ch.xml. And the test works.

(setq file-name-coding-system 'utf-8)

~Rich

Colin Paul Adams

2007-03-18 03:17:00 UTC

Permalink

Post by Colin Paul Adams
So I went back to LANG=en_GB.UTF-8, unzipped the distribution
again, and re-named the file, thanks to your help.

Rich> Yep, this is the only reasonable fix until the unzip command
Rich> is fixed to handle foreign encodings.

Thanks for this confirmation.

I just took a look at the tar man page. I can't see anything here either.

--
Colin Adams
Preston Lancashire

Jan Larres

2007-03-19 15:40:35 UTC

Permalink

If you have more files with the wrong encoding, you can rename them all
at once with convmv:
http://freshmeat.net/projects/convmv/

Jan

--
OpenPGP Key-ID: CF1635D4
"The difference between fiction and reality? Fiction has to make sense."
-- Tom Clancy

Colin Paul Adams

2007-03-19 15:54:37 UTC

Permalink

Post by Colin Paul Adams
So I went back to LANG=en_GB.UTF-8, unzipped the distribution
again, and re-named the file, thanks to your help.
ls now shows the correct file name. Emacs shows
xgespräch.xml. And the test works.

Jan> If you have more files with the wrong encoding, you can
Jan> rename them all at once with convmv:
Jan> http://freshmeat.net/projects/convmv/

I don't know that I do, but this sounds like a useful utility to have.

Thank you.

--
Colin Adams
Preston Lancashire

Pádraig Brady

2007-03-20 09:56:50 UTC

Permalink

Post by Colin Paul Adams

Post by Colin Paul Adams
So I went back to LANG=en_GB.UTF-8, unzipped the distribution
again, and re-named the file, thanks to your help.
ls now shows the correct file name. Emacs shows
xgesprÃ¤ch.xml. And the test works.

Jan> If you have more files with the wrong encoding, you can
Jan> http://freshmeat.net/projects/convmv/
I don't know that I do, but this sounds like a useful utility to have.

If you want to find out, have a look at the "Bad names" functionality
in FSlint which is in fedora extras and debian, or otherwise:
http://www.pixelbeat.org/fslint/

cheers,
Pádraig.

Larry Wall

2007-03-24 20:45:35 UTC

Permalink

On Sat, Mar 17, 2007 at 02:53:17AM -0500, Rich Felker wrote:
: On Sat, Mar 17, 2007 at 07:05:01AM +0000, Colin Paul Adams wrote:
: > I can't find this in the GNOME help, so I thought I'd try asking here.
: >
: > I want to be rename a file so it has an a-umlaut (lower case) in the
: > name.
: >
: > My LANG is en_GB.UTF-8.
: >
: > I don't know how to type the accented character.
:
: One sure way is to copy-and-paste it from a file already containing
: the character. I keep around a copy of UnicodeData.txt with the
: literal UTF-8 character added to each line for exactly this purpose.

Here's a handy program to grep out names from the unicode database.
I call it "uni".

#!/usr/bin/perl -C

binmode STDOUT, ":utf8";
$pat = "@ARGV";
if (ord $pat > 256) {
$pat = sprintf("%04x", ord $pat);
print "That's $pat...\n";
$pat = '^' . $pat;
}
elsif (ord $pat > 128) { # arg in sneaky UTF-8
$pat = sprintf("%04x", unpack("U0U",$pat));
print "That's $pat...\n";
$pat = '^' . $pat;
}

@names = split /^/, do 'unicore/Name.pl';
for my $line (@names) {
$hex = hex($line);
$_ = chr($hex)."\t".$line;
if (/$pat/io) {
print;
}
}

For example, typing "uni ing face" produces:

☹ 2639 WHITE FROWNING FACE
☺ 263A WHITE SMILING FACE
☻ 263B BLACK SMILING FACE

Larry

Jan Willem Stumpel

2007-03-26 18:29:16 UTC

Permalink

Post by Larry Wall
Here's a handy program to grep out names from the unicode
database. I call it "uni". [..]

This is really neat. I didn't know that perl has such extensive
utf-8 support now (including the whole unicode database). I know
perl only from the "Llama book" (Learning Perl, 1997 edition)
which neither mentions unicode, nor explains the ord function.

BTW uni might be improved by making its input case-insensitive.

Regards, Jan

Daniel Glassey

2007-03-24 13:35:43 UTC

Permalink

On 17 Mar 2007 07:05:01 +0000, Colin Paul Adams

Hope you have got it working already :) but just for info ...

Afair the quickest way to enter a character that you don't know how to
type is to use the gnome 'Character Map'. You will find a-umlaut
(latin small letter a with diaeresis) in the Latin block. Just copy
the character and paste it into the terminal.

If you are using accented characters a lot then there are better ways
especially for european languages that use them (but I don't use them
often so I don't know).

Regards,
Daniel

William J Poser

2007-03-24 17:03:11 UTC

Permalink

For entering non-ascii characters, I use three techniques:

(a) when the characters are part of a set used routinely, e.g.
the alphabet of French, install a keyboard map specifically
for that language (or, e.g., for ISO-8859-1, which includes it);

(b) at the other extreme, when the character is some random character
for which I have a one time need, use gucharmap, or, what is
often quicker, look it up in my copy of the Unicode Consortium
file Nameslist.txt (http://unicode.org/Public/UNIDATA/NamesList.txt)
and enter the character via its hex code using any of several
methods depending on where I want to put it.

(c) for the intermediate case, of characters that I use with some
frequency but that aren't part of some language's writing
system or where it isn't convenient to switch to a separate
keyboard, I use a character entry utility of my own, available
at: http://billposer.org/Software/CharEntry.html
This works something like gucharmap, but instead of presenting
all of Unicode it provides clickable charts of selected sets of
characters: (a) the consonants of the International Phonetic
Alphabet; (b) the IPA vowels; (c) a large set of roman letters with
diacritics; and (d) a set of combining diacritics. There is also
a widget that accepts hex codes. You can also define custom
clickable character charts by reading a definition from a simple
text file (basically each line consists of the hex code and
the gloss to appear in the tool tip).

Bill

Simos Xenitellis

2007-03-25 16:47:58 UTC

Permalink

Since you use GNOME, you can either enable a keyboard layout that has
those characters (such as US International),
http://ubuntuguide.org/wiki/Ubuntu_Edgy#How_to_type_extended_characters
or use compose sequences (no need to enable a special keyboard layout),
http://ubuntuguide.org/wiki/Ubuntu_Edgy#How_to_set_the_Compose_key_to_type_special_characters

Simos

Post by William J Poser
(a) when the characters are part of a set used routinely, e.g.
the alphabet of French, install a keyboard map specifically
for that language (or, e.g., for ISO-8859-1, which includes it);
(b) at the other extreme, when the character is some random character
for which I have a one time need, use gucharmap, or, what is
often quicker, look it up in my copy of the Unicode Consortium
file Nameslist.txt (http://unicode.org/Public/UNIDATA/NamesList.txt)
and enter the character via its hex code using any of several
methods depending on where I want to put it.
(c) for the intermediate case, of characters that I use with some
frequency but that aren't part of some language's writing
system or where it isn't convenient to switch to a separate
keyboard, I use a character entry utility of my own, available
at: http://billposer.org/Software/CharEntry.html
This works something like gucharmap, but instead of presenting
all of Unicode it provides clickable charts of selected sets of
characters: (a) the consonants of the International Phonetic
Alphabet; (b) the IPA vowels; (c) a large set of roman letters with
diacritics; and (d) a set of combining diacritics. There is also
a widget that accepts hex codes. You can also define custom
clickable character charts by reading a definition from a simple
text file (basically each line consists of the hex code and
the gloss to appear in the tool tip).
Bill
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Jan Willem Stumpel

2007-03-24 18:45:51 UTC

Permalink

[..] I don't know how to type the accented character.[..]

The Compose key is very useful for occasional entry of non-ASCII
characters; see http://en.wikipedia.org/wiki/Compose_key for a
description. E.g. Compose=3Dc becomes =E2=82=AC, Compose"a becomes =
=C3=A4.
Hundreds of such "compose sequences" are available; the wikipedia
article links to a list.

To try, e.g., the right "windows" key as the Compose key, type
setxkbmap -option compose:rwin

You make this permanent by means of an entry in the keyboard
section of the /etc/X11/xorg.conf file:

Option "XkbOptions" "compose:rwin"

You can also choose some other key to be the Compose, like lwin,
menu, etc.

Many other methods are available for entering non-ASCII Unicode
(in all kinds of languages) on English keyboards, but I think the
Compose method is the simplest for occasional use.

Regards, Jan

Pádraig Brady

2007-03-26 08:46:20 UTC

Permalink

Post by Daniel Glassey
Afair the quickest way to enter a character that you don't know how to
type is to use the gnome 'Character Map'. You will find a-umlaut
(latin small letter a with diaeresis) in the Latin block. Just copy
the character and paste it into the terminal.
If you are using accented characters a lot then there are better ways
especially for european languages that use them (but I don't use them
often so I don't know).

You might find this info useful then:
http://www.pixelbeat.org/docs/xkeyboard/

Pádraig.