Discussion:
wcwidth and locale
SrinTuar
2007-04-09 16:26:51 UTC
Permalink
Just a question:

Does anyone know of locales where ambiguous char-cell width
characters, such as ※☠☢☣☤ ♀♂★☆ are treated as double width rather than
single width?

It seems they are double width in most fonts, but on my systems even
in east asian locales they still return widths of 1. (so I get funny
overlaps in my terminals )

I'm probably not running the very latest glibc yet, so I was wondering
if someone else knows or could test i
Egmont Koblinger
2007-04-10 10:36:28 UTC
Permalink
Post by SrinTuar
It seems they are double width in most fonts, but on my systems even
in east asian locales they still return widths of 1. (so I get funny
overlaps in my terminals )
Though I cannot answer your original question, I've just found recently that
glibc's wcwidth database suffers from problems. There are a lot of letters
or letter-like symbols that are unprintable according to glibc (wcwidth
returns -1, iswprint returns 0). For example U+0221 (latin small letter d
with curl) is the first such character. I think we should submit a bugreport
for glibc...

I don't know whether the width info varies or should vary between different
utf-8 locales.
--
Egmont
Rich Felker
2007-04-10 16:17:31 UTC
Permalink
Post by Egmont Koblinger
Though I cannot answer your original question, I've just found recently that
glibc's wcwidth database suffers from problems. There are a lot of letters
or letter-like symbols that are unprintable according to glibc (wcwidth
returns -1, iswprint returns 0). For example U+0221 (latin small letter d
with curl) is the first such character. I think we should submit a bugreport
for glibc...
Indeed, glibc's character data is horribly outdated and incorrect.
There are plenty of unsupported nonspacing characters, even characters
that were present in Unicode 4.0. It also considers nonspacing letters
to be non-alphabetic, which is a real problem for users of languages
which utilize nonspacing letters.

As for wcwidth and iswprint, I recently changed my libc implementation
to consider all Unicode codepoints except illegal/noncharacter/control
codepoints as printable, with a wcwidth of 1 for the BMP and plane 1,
and a wcwidth of 2 for planes 2 and 3. While this is still imperfect
(it won't account for added characters with width 0, for example), it
at least makes it so users with outdated libc/locale data can use the
new characters they might need in a minimal sort of way. I would
recommend that the glibc maintainers do something similar.
Post by Egmont Koblinger
I don't know whether the width info varies or should vary between different
utf-8 locales.
The ambiguous characters are wide in CJK locales and narrow in others.
This is probably annoying for some CJK users since the characters
(such as Greek and Cyrillic) obviously should be narrow
typographically; they're wide only for the sake of old programs and
ascii-art type stuff which were designed for legacy charsets. IMO they
should be made narrow by default in all locales with a modifier like
"@wide" or something for the users who actually need them wide.

~Rich
Abel Cheung
2007-04-16 16:11:12 UTC
Permalink
Post by Rich Felker
Indeed, glibc's character data is horribly outdated and incorrect.
There are plenty of unsupported nonspacing characters, even characters
that were present in Unicode 4.0. It also considers nonspacing letters
to be non-alphabetic, which is a real problem for users of languages
which utilize nonspacing letters.
AFAIK Pablo Saraxtaga has done something about it [1], though I
didn't intend to dig deeper and check what has been done.

[1] http://sourceware.org/bugzilla/show_bug.cgi?id=3885
Post by Rich Felker
The ambiguous characters are wide in CJK locales and narrow in others.
This is probably annoying for some CJK users since the characters
(such as Greek and Cyrillic) obviously should be narrow
typographically; they're wide only for the sake of old programs and
ascii-art type stuff which were designed for legacy charsets. IMO they
should be made narrow by default in all locales with a modifier like
It really depends on the intended audience of the fonts. The original
intention for those double width Greek and Cyrillic characters is to
make them align nicely with all other CJK characters. Then there are
no such thing as wide Greek/Cyrillic characters and wide version of
some other symbols in Unicode, so font designers in Asia are forced
to make them wide and map them to narrow ones, since they must
support legacy encoding for commercial or whatever reason.
They are doing this out of no choice (except discarding those
glyphs, which would offend other users).

I'm also bitten by this issue -- PUA codepoints always have wcwidth=1,
and it would make CJK fonts suck again because characters keep
overlapping against each other. Yes, PUA usage should be avoided
whenever possible, but we would still see legacy systems in the
short future. Not to mention some characters would never have the
chance to enter Unicode.

Abel
Post by Rich Felker
~Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
--
Abel Cheung (GPG Key: 0xC67186FF)
Key fingerprint: 671C C7AE EFB5 110C D6D1 41EE 4152 E1F1 C671 86FF
--------------------------------------------------------------------
* GNOME Hong Kong - http://www.gnome.hk/
* Opensource Application Knowledge Assoc. - http://oaka.org/
* My own cave: http://me.abelcheung.org/
Rich Felker
2007-04-16 17:07:15 UTC
Permalink
Post by Abel Cheung
Post by Rich Felker
Indeed, glibc's character data is horribly outdated and incorrect.
There are plenty of unsupported nonspacing characters, even characters
that were present in Unicode 4.0. It also considers nonspacing letters
to be non-alphabetic, which is a real problem for users of languages
which utilize nonspacing letters.
AFAIK Pablo Saraxtaga has done something about it [1], though I
didn't intend to dig deeper and check what has been done.
[1] http://sourceware.org/bugzilla/show_bug.cgi?id=3885
This works, bug UHG it's so disgusting. Someday people need to realize
that POSIX charmap/localedef format is utterly broken for use with
Unicode and replace it with something reasonable that doesn't take 200
megs of core..
Post by Abel Cheung
It really depends on the intended audience of the fonts. The original
intention for those double width Greek and Cyrillic characters is to
make them align nicely with all other CJK characters. Then there are
no such thing as wide Greek/Cyrillic characters and wide version of
some other symbols in Unicode, so font designers in Asia are forced
to make them wide and map them to narrow ones, since they must
support legacy encoding for commercial or whatever reason.
They are doing this out of no choice (except discarding those
glyphs, which would offend other users).
This is only an issue on character-cell devices which use wcwidth. For
GUI applications, the metrics of the font will govern layout and
alignment, so either can be used. I don't think it's such a big deal
to say these fonts with wide Greek, Cyrillic, etc. aren't suitable for
terminals. In fact they could be automatically used just by squeezing
the glyph horizontally and cropping off the excess spacing.
Post by Abel Cheung
I'm also bitten by this issue -- PUA codepoints always have wcwidth=1,
and it would make CJK fonts suck again because characters keep
overlapping against each other. Yes, PUA usage should be avoided
whenever possible, but we would still see legacy systems in the
short future.
Yes, PUA is very bad. I wouldn't be opposed to designating a certain
portion of the PUA as "wide", but I question whether using the PUA on
charcell devices is even needed.
Post by Abel Cheung
Not to mention some characters would never have the
chance to enter Unicode.
We can debate whether things like the Apple™® symbol are characters or
not all we like, but can you come up with things that should
legitimately be wide (i.e. ideographs) which have no chance to enter
Unicode?

Rich
Abel Cheung
2007-04-16 18:04:32 UTC
Permalink
Post by Rich Felker
Post by Abel Cheung
It really depends on the intended audience of the fonts. The original
intention for those double width Greek and Cyrillic characters is to
make them align nicely with all other CJK characters. Then there are
no such thing as wide Greek/Cyrillic characters and wide version of
some other symbols in Unicode, so font designers in Asia are forced
to make them wide and map them to narrow ones, since they must
support legacy encoding for commercial or whatever reason.
This is only an issue on character-cell devices which use wcwidth.
I'm exactly talking about those apps, like terminals.
Post by Rich Felker
Yes, PUA is very bad. I wouldn't be opposed to designating a certain
portion of the PUA as "wide", but I question whether using the PUA on
charcell devices is even needed.
Needn't question about that, it is ALWAYS needed, just that its usage
is bad for quite a lot of people.

Think about languages that has not migrated fully into Unicode yet.
Without PUA, people simply dumps Unicode and use whatever works
for them. And think about backward compatibility.
Post by Rich Felker
Post by Abel Cheung
Not to mention some characters would never have the
chance to enter Unicode.
We can debate whether things like the Apple™(r) symbol are characters or
not all we like, but can you come up with things that should
legitimately be wide (i.e. ideographs) which have no chance to enter
Unicode?
Certain there are, say some belonging to Taiwan CNS11643, which
is regarded as variation of existing character in Unicode. And there
are other symbols and characters not accepted in unicode, not
necessarily wide. Though I must admit usage of those would certainly
be quite rare.

Abel
Post by Rich Felker
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
--
Abel Cheung (GPG Key: 0xC67186FF)
Key fingerprint: 671C C7AE EFB5 110C D6D1 41EE 4152 E1F1 C671 86FF
--------------------------------------------------------------------
* GNOME Hong Kong - http://www.gnome.hk/
* Opensource Application Knowledge Assoc. - http://oaka.org/
* My own cave: http://me.abelc
Rich Felker
2007-04-16 21:13:07 UTC
Permalink
Post by Abel Cheung
Post by Rich Felker
This is only an issue on character-cell devices which use wcwidth.
I'm exactly talking about those apps, like terminals.
Given how utterly abysmal current terminals' Unicode support is, this
seems like a relatively minor issue. I don't want to disparage concern
about getting it right, but rather investigate where we're at now and
what needs to be done. Along those lines, I recently evaluated some
terminals with the following results:


Konsole and Xfce terminal: no support for nonspacing characters;
unsure about whether cjk wide characters are right.

Gnome Terminal: I assume it's the same since Xfce uses the same
widget. Please correct me if I'm mistaken since I didn't try it.

urxvt and xterm: CJK and nonspacing character widths are correct, but
rendering is minimal overstrike for nonspacing characters. No bidi or
complex script support. xterm default of only 1 combining character
per cell is horribly deficient for any language that doesn't just use
precomposed characters anyway.

aterm/rxvt/Eterm/etc.: unmaintained; no UTF-8 support at all.

mlterm: CJK and nonspacing character widths are correct, bidi is
available (not sure how well it works) with correct Arabic shaping,
and Indic reordering/shaping is available but as a special case (not
sure how well it works either). Also, cursor position becomes
nonsensical (font-dependent too) with Indic shaping, making
screen-mode (my terminology, as opposed to line-mode) apps difficult
to use.

uuterm (experimental; by me): CJK and nonspacing character widths are
correct. Shaping/ligatures are supported and sufficient for all
scripts afaik, but using a nonstandard font system (ucf). Bidi and
reordering (for Indic vowel marks on left) are not available.


So as of now, here is the status of support for particular languages
I'm aware of:


European-script langs using precomposed forms only: any terminal
except legacy stuff lacking UTF-8 support should be fine.

European-script languages with multiple decomposed accents: uuterm is
probably the only one that works.

Languages of India: mlterm and some old, unmaintained Indic-specific
terminals (pre-Unicode I think) are the only ones that work.

CJK, Thai, Lao: urxvt, xterm, mlterm, and uuterm all work. uuterm is
the only one that supports decomposed Korean (Hangul Jamo) though.

Tibetan: uuterm is the only terminal that works correctly, but a
minimal degree of legibility can be obtained with an ugly tailored
font that does not require shaping, so that urxvt, xterm, and mlterm
are usable.

Burmese: not supported by anything.

Arabic and Hebrew: mlterm and perhaps some rtl-specific terminal
emulators I'm not aware of..?

Mongolian: unknown; probably only mlterm and I'm unsure whether it
even works acceptably well.


One additional issue I have not tested is support for characters
outside the BMP. I know GNU screen totally lacks support for these,
and I suspect many terminal emulators have the same problem.


~Rich
Egmont Koblinger
2007-04-17 09:08:44 UTC
Permalink
Post by Rich Felker
Konsole and Xfce terminal: no support for nonspacing characters;
unsure about whether cjk wide characters are right.
CJK is fine in them AFAIK.
Post by Rich Felker
Gnome Terminal: I assume it's the same since Xfce uses the same
widget. Please correct me if I'm mistaken since I didn't try it.
That's right. They both use the widget named "vte", it is the actual
terminal emulator. Both gnome-terminal and Xfce Terminal are just simple
GTK+ stuff (menus etc.) around it.
--
Egmont
Rich Felker
2007-04-17 14:58:10 UTC
Permalink
Post by Egmont Koblinger
Post by Rich Felker
Konsole and Xfce terminal: no support for nonspacing characters;
unsure about whether cjk wide characters are right.
CJK is fine in them AFAIK.
What is the output of:
echo -e '日本語\b\bhello'

It should be: “日本hello” and not “日hello”. I’m not sure which it
does. Also try with explicit cursor positioning escapes.

(I’d appreciate it if you could try and report since I no longer have
them installed and forgot to check this while I did.)

Rich
Marcin 'Qrczak' Kowalczyk
2007-04-17 16:29:48 UTC
Permalink
Post by Rich Felker
echo -e '日本語\b\bhello'
It should be: “日本hello” and not “日hello”.
“日本hello” on gnome-terminal, konsole, xterm, and urxvt.
--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Abel Cheung
2007-04-22 16:16:29 UTC
Permalink
Post by Rich Felker
echo -e '日本語\b\bhello'
Wait. Quick question: how much should '\b' backstep when wide characters are
encountered?

- a whole wide character?
- a single byte?
- a half of wide character?

Which is considered 'correct'?

Abel
Post by Rich Felker
It should be: "日本hello" and not "日hello". I'm not sure which it
does. Also try with explicit cursor positioning escapes.
(I'd appreciate it if you could try and report since I no longer have
them installed and forgot to check this while I did.)
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
--
Abel Cheung (GPG Key: 0xC67186FF)
Key fingerprint: 671C C7AE EFB5 110C D6D1 41EE 4152 E1F1 C671 86FF
--------------------------------------------------------------------
* GNOME Hong Kong - http://www.gnome.hk/
* Opensource Application Knowledge Assoc. - http://oaka.org/
Egmont Koblinger
2007-04-23 07:34:19 UTC
Permalink
Post by Abel Cheung
Wait. Quick question: how much should '\b' backstep when wide characters are
encountered?
It should move the cursor one single-width cell to the left. It has already
been discussed on this list, see here:
http://mail.nl.linux.org/linux-utf8/2005-03/msg00021.html
Post by Abel Cheung
Archive: http://mail.nl.linux.org/linux-utf8/
Ps. Does anyone know what's wrong with the main page of the archive? It
seems that the html file starts "in medias res" with a link to Oct 2001,
without any <html> or similar stuff. Pretty wierd from a (most likely)
auto-generated file :)
--
Egmont
Roger Leigh
2007-04-23 18:17:44 UTC
Permalink
Post by Egmont Koblinger
Post by Abel Cheung
Wait. Quick question: how much should '\b' backstep when wide characters are
encountered?
It should move the cursor one single-width cell to the left. It has already
http://mail.nl.linux.org/linux-utf8/2005-03/msg00021.html
Surely this is terminal-specific? For example, if I have an ECMA-48
terminal and I am moving in the data component instead of the
presentation component, I can move one complete character instead of
one unit cell in the presentation component.


Regards,
Roger
--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
Rich Felker
2007-04-24 15:37:55 UTC
Permalink
Post by Abel Cheung
Post by Rich Felker
echo -e '日本語\b\bhello'
Wait. Quick question: how much should '\b' backstep when wide characters are
encountered?
- a whole wide character?
- a single byte?
- a half of wide character?
One byte is obviously nonsense since the screen contents are not bytes
but characters. Between the other two options, there's always a
tradeoff: if you want to move by character positions and \b works in
columns or vice versa, then you need to know the width (wcwidth) of
the character you're moving over. However..
Post by Abel Cheung
Which is considered 'correct'?
Columns is considered the correct behavior. Otherwise it would be
impossible to position the cursor to a particular visual location
without already knowing the contents of the screen, which a program
might not even know. On the other hand, if you're moving by
characters, then presumably the program knows what the characters on
the screen are, so it can compute widths.

Some terminals (Apple's Terminal.app, I believe) allow you to select
the behavior. This has the benefit of allowing programs which are not
aware of wcwidth to function somewhat usably with wide and/or
nonspacing characters, but at the expense of trashing the column
alignment and visual layout of correct programs. It will also likely
cause serious problems if used with GNU screen, which is width-aware.

One slightly problematic issue is what happens if you position the
cursor 'in the middle' of a double width character and then overwrite
the second column of it. In general the results could be anything
bogus, but good terminals will either erase the character or just
leave half of it there.

uuterm does not yet handle this case, and by chance it will end up
looking for a double-width glyph for the newly written character
(which might exist depending on the font. This behavior of course
should not be relied upon...

Rich

Rich Felker
2007-04-16 21:14:04 UTC
Permalink
Post by Abel Cheung
Post by Rich Felker
not all we like, but can you come up with things that should
legitimately be wide (i.e. ideographs) which have no chance to enter
Unicode?
Certain there are, say some belonging to Taiwan CNS11643, which
is regarded as variation of existing character in Unicode. And there
If they're needed for round trip compatibility with a legacy charset,
it should be possible to encode them in one of the CJK compatibility
sections. Are there still characters missing?
Post by Abel Cheung
are other symbols and characters not accepted in unicode, not
necessarily wide. Though I must admit usage of those would certainly
be quite rare.
If they're not wide then the default wcwidth of 1 is ok, no?

~Rich
Rich Felker
2007-04-10 13:51:30 UTC
Permalink
Post by SrinTuar
Does anyone know of locales where ambiguous char-cell width
characters, such as ※☠☢☣☤ ♀♂★☆ are treated as double width rather than
single width?
Ambiguous width from a Unicode perspective means just that the
characters did not exist in legacy CJK encodings, or that they were
wide in legacy CJK encodings but narrow in others (and should be
narrow), such as Greek.
Post by SrinTuar
It seems they are double width in most fonts, but on my systems even
in east asian locales they still return widths of 1. (so I get funny
overlaps in my terminals )
I think this is a problem with the fonts. There’s no reason a
character like ♀ should be double-width. A few of the examples you
gave are hard to make look nice at 8x16 and could benefit from a
double-width cell, but all of them are legible and distinguishable at
8x16. If you’re using a smaller font size you shouldn’t expect
non-Latin characters to be particularly legible.

At times I’ve thought it would be beneficial to update and standardize
the wcwidth table to make certain characters wide, such as the em
dash and various letters in certain Indic and other scripts which
cannot adequately be represented in a single cell due to their
proportions and level of detail. But I’m not entirely sure how this
should be done, and even if it were done, I don’t think dingbats are
appropriate candidates.

~Rich
Loading...