Indic scripts and wcwidth: comments?

Discussion:

Rich Felker

2006-08-18 04:41:59 UTC

Working on my character cell font/terminal problem, I've been doing
some research on Devanagari and other Indic scripts and the way they
handle consonant clusters. Unlike Tibetan which naturally fits
Unicode's combining character semantics and POSIX's wcwidth(), Indic
scripts are unfortunately very unfriendly to character cell devices,
at least in the existing width interpretation, which I will call
"WI1" for "Width Interpretation #1".

The obvious small problem is the left-combining "i" vowel mark, which
(I was not before aware of this) combines at the left of the whole
cluster, not just the preceding character. This can be handled in WI1,
but requires substitution rules which can span arbitrary numbers of
cells, making update awkward. Also my understanding is that the
superscript stroke of the "i" is supposed to span the whole syllable,
making things more tricky for these arbitrary-width clusters.

A more fundamental problem, which may even be seen as a cause of the
former problem, is the combining/ligature nature of clusters. Many
clusters such as "kka" or anything involving "ra" _should_ occupy
fewer columns than the number of letters in the cluster. With 'dead'
"ra" characters in the cluster it is particularly bad for them to
occupy a column of their own since they actually become combining
marks on another glyph cell (which may be of arbitrary distance from
the "ra" character).

Under WI1, each consonant has a wcwidth of 1. Thus, the only way to
handle character-cell Indic scripts under WI1 is to have the dead "ra"
turn into a blank space, which will look very odd.

What I'd like to propose is a new width interpretation for Indic
scripts, "WI2". Under WI2:

- All independent vowels and consonants have a width of 2.
- The virama has a width of -2 and makes the previous character part
of a combining stack with the character that follows.
- Each double-width character cell contains an entire consonant
cluster, not just a single glyph. Much like Hangul Jamo.
- All dependent vowel marks are simple width-0 nonspacing characters
which apply to the whole cluster. Also like Hangul Jamo. Note that
this includes the left-combining "i" vowel which just appears at the
left of the character cell and whose superscript stroke is of fixed
width.

According to casual examination of many Devanagari clusters, they
appear to fit nicely into double-width cells, with complexity/density
similar to CJK glyphs. Simple fonts without complex ligatures could
just squeeze the individual nominal glyphs into the cell and render
them with overstrike (using some simple context rules for the
positioning) while nice fonts would use either dedicated ligature
glyphs or a mixture of dedicated ligatures with overstriking glyphs
that result in the correct ligatures.

Please note that all of the above applies only to character cell
displays and wcwidth. Naturally it should be ignored and existing
systems used for elegant Indic script layout in variable-width fonts,
but I believe that the system WI2 (or a variant on it) provides much
more reasonable, workable Indic script support than WI1.

If anyone could provide comments on the following issues, I would much
appreciate it:

1. Does any existing character cell application (terminal emulator)
both display correctly-rendered Indic text and conform to WI1, i.e.
does it update column position according to wcwidth() and not the
OpenType-rendered width of the text string? I suspect not. RTFS'ing
mlterm it seems like it does not. I can't find any good info on
ncst-term.

2. Are there serious limitations of WI2 that make it impossible to
display [legibly] certain consonant clusters? Can the ZWJ/ZWNJ
semantics be satisfied correctly?

3. Other comments?

Rich

rajeev joseph sebastian

2006-08-18 10:39:17 UTC

Permalink

Hello Rich Felker,

---- start quote ----
1. Does any existing character cell application (terminal emulator)
both display correctly-rendered Indic text and conform to WI1, i.e.
does it update column position according to wcwidth() and not the
OpenType-rendered width of the text string? I suspect not. RTFS'ing
mlterm it seems like it does not. I can't find any good info on
ncst-term.

2. Are there serious limitations of WI2 that make it impossible to
display [legibly] certain consonant clusters? Can the ZWJ/ZWNJ
semantics be satisfied correctly?

3. Other comments?
---- end quote ----

I have a question on this. By "single width", "double width", do you mean a global width constant, or a width that can be specified by the font ?

Either way, Indic texts on a console would look really bad and be practically unusable if glyphs had to be put into a specified width: there would be too much spacing. Indic texts by their nature are most suited to variable-widths.

Regards,
Rajeev J Sebastian

Rich Felker

2006-08-18 14:51:51 UTC

Permalink

Post by rajeev joseph sebastian
Hello Rich Felker,
---- start quote ----
1. Does any existing character cell application (terminal emulator)
both display correctly-rendered Indic text and conform to WI1, i.e.
does it update column position according to wcwidth() and not the
OpenType-rendered width of the text string? I suspect not. RTFS'ing
mlterm it seems like it does not. I can't find any good info on
ncst-term.
2. Are there serious limitations of WI2 that make it impossible to
display [legibly] certain consonant clusters? Can the ZWJ/ZWNJ
semantics be satisfied correctly?
3. Other comments?
---- end quote ----
I have a question on this. By "single width", "double width", do you
mean a global width constant, or a width that can be specified by
the font ?

Width specified by font is simply not possible, regardless of how nice
it would look or how bad the alternatives would look. The most complex
program that will work correctly with such a system is "cat". Anything
more complex, be it a tabular message list in mutt, the text you're
editing in a text editor or single-line entry line, etc. will corrupt
the display horribly as soon as the presentation width disagrees with
the logical wcwidth width. As bad as too much or too little spacing
looks, having the whole terminal corrupt and leave 'droppings' all
over the place when you move the cursor looks much worse...

There is the possibility within POSIX to use the wcswidth function
instead of wcwidth, which in theory could accommodate
context-sensitive widths. Whether this is considered conformant I
don't know, but I do know that presently few apps support this and
that most apps would require significant rewrites to do so and major
additional complexity.

My proposed WI2 was to treat consonant clusters, rather than
individual consonants, as the element with a fixed width and assign
them the width of 2 (same as CJK ideographs and Hangul Jamo, the
latter of which seems to be the well-handled script with the most in
common with Indic consonant clusters). I'm fairly ignorant about nice
Indic typesetting, but my casual observations found all the common
clusters I could find fitting reasonably into a double-width cell. On
the other hand I'm worried that the "-2 width" for the virama would
confuse applications hopelessly, and that isolated dead letters would
have the wrong width.

Since you seem to be familiar with the matter, perhaps you could
comment on whether displaying text in fixed one-cell-per-character
form without width-alterring ligatures is considered acceptable. My
impression is that it would be mostly acceptable in Devanagari except
for the behavior of "ra", but might be significantly worse in other
scripts (Kannada?) which seem to make more use of vertical combining.

Post by rajeev joseph sebastian
Either way, Indic texts on a console would look really bad and be
there would be too much spacing. Indic texts by their nature are
most suited to variable-widths.

As far as I can tell they're presently unusable. I'm just trying to
find a way to make them usable and hopefully not make them ugly in the
process. If there are any working implementations already (in your
opinion) I'd be happy to hear about how they work.

Rich

Werner LEMBERG

2006-08-18 12:29:34 UTC

Permalink

Post by Rich Felker
What I'd like to propose is a new width interpretation for Indic
scripts, "WI2". [...]

Since I have no idea about Indic scripts, I won't and can't give a
comment. I just want to note that Emacs supports Devanagari with
single, double, and triple width glyphs (IIRC); you may have a look
how they've done it -- from a technical point, not from an encoding
point.

Werner

Andries Brouwer

2006-08-18 18:10:14 UTC

Permalink

Post by Werner LEMBERG
Since I have no idea about Indic scripts, I won't and can't give a
comment. I just want to note that Emacs supports Devanagari with
single, double, and triple width glyphs (IIRC); you may have a look
how they've done it -- from a technical point, not from an encoding
point.

There is a bug (that I have not investigated) in the use of
"emacs -nw" on a uxterm. When symbols occur on the line
of which no glyph is available, then emacs and uxterm have different
ideas about the width of displayed strings, and corruption results.

Andries

Werner LEMBERG

2006-08-18 19:38:10 UTC

Permalink

Post by Werner LEMBERG
Since I have no idea about Indic scripts, I won't and can't give a
comment. I just want to note that Emacs supports Devanagari with
single, double, and triple width glyphs (IIRC); you may have a
look how they've done it -- from a technical point, not from an
encoding point.

There is a bug (that I have not investigated) in the use of "emacs
-nw" on a uxterm.

Please report this to HANDA Ken'ichi or to the emacs-devel list in
case it happens with a snapshot of the current CVS.

Actually, I'm suprised that Devanagari works at all with `emacs -nw'
-- I basically meant to look at the Emacs implementation within an X
Windows frame. Sorry for the confusion.

Werner

Rich Felker

2006-08-18 22:32:55 UTC

Permalink

Post by Andries Brouwer

This is exactly why it's a bug for the terminal emulator to use the
font's idea of glyph width whatsoever. The only correct implementation
is for the terminal emulator to use wcwidth and demand that the font
matches (and in fact render individual glyphs in cells, not use
string-rendering functions).

But yes, Werner was talking about Emacs GUI, not -nw. I doubt the GUI
is really relevant to character-cell stuff though unless they found a
way to coerce Indic scripts into character cells well...

Rich