Rich Felker
2006-08-18 04:41:59 UTC
Working on my character cell font/terminal problem, I've been doing
some research on Devanagari and other Indic scripts and the way they
handle consonant clusters. Unlike Tibetan which naturally fits
Unicode's combining character semantics and POSIX's wcwidth(), Indic
scripts are unfortunately very unfriendly to character cell devices,
at least in the existing width interpretation, which I will call
"WI1" for "Width Interpretation #1".
The obvious small problem is the left-combining "i" vowel mark, which
(I was not before aware of this) combines at the left of the whole
cluster, not just the preceding character. This can be handled in WI1,
but requires substitution rules which can span arbitrary numbers of
cells, making update awkward. Also my understanding is that the
superscript stroke of the "i" is supposed to span the whole syllable,
making things more tricky for these arbitrary-width clusters.
A more fundamental problem, which may even be seen as a cause of the
former problem, is the combining/ligature nature of clusters. Many
clusters such as "kka" or anything involving "ra" _should_ occupy
fewer columns than the number of letters in the cluster. With 'dead'
"ra" characters in the cluster it is particularly bad for them to
occupy a column of their own since they actually become combining
marks on another glyph cell (which may be of arbitrary distance from
the "ra" character).
Under WI1, each consonant has a wcwidth of 1. Thus, the only way to
handle character-cell Indic scripts under WI1 is to have the dead "ra"
turn into a blank space, which will look very odd.
What I'd like to propose is a new width interpretation for Indic
scripts, "WI2". Under WI2:
- All independent vowels and consonants have a width of 2.
- The virama has a width of -2 and makes the previous character part
of a combining stack with the character that follows.
- Each double-width character cell contains an entire consonant
cluster, not just a single glyph. Much like Hangul Jamo.
- All dependent vowel marks are simple width-0 nonspacing characters
which apply to the whole cluster. Also like Hangul Jamo. Note that
this includes the left-combining "i" vowel which just appears at the
left of the character cell and whose superscript stroke is of fixed
width.
According to casual examination of many Devanagari clusters, they
appear to fit nicely into double-width cells, with complexity/density
similar to CJK glyphs. Simple fonts without complex ligatures could
just squeeze the individual nominal glyphs into the cell and render
them with overstrike (using some simple context rules for the
positioning) while nice fonts would use either dedicated ligature
glyphs or a mixture of dedicated ligatures with overstriking glyphs
that result in the correct ligatures.
Please note that all of the above applies only to character cell
displays and wcwidth. Naturally it should be ignored and existing
systems used for elegant Indic script layout in variable-width fonts,
but I believe that the system WI2 (or a variant on it) provides much
more reasonable, workable Indic script support than WI1.
If anyone could provide comments on the following issues, I would much
appreciate it:
1. Does any existing character cell application (terminal emulator)
both display correctly-rendered Indic text and conform to WI1, i.e.
does it update column position according to wcwidth() and not the
OpenType-rendered width of the text string? I suspect not. RTFS'ing
mlterm it seems like it does not. I can't find any good info on
ncst-term.
2. Are there serious limitations of WI2 that make it impossible to
display [legibly] certain consonant clusters? Can the ZWJ/ZWNJ
semantics be satisfied correctly?
3. Other comments?
Rich
some research on Devanagari and other Indic scripts and the way they
handle consonant clusters. Unlike Tibetan which naturally fits
Unicode's combining character semantics and POSIX's wcwidth(), Indic
scripts are unfortunately very unfriendly to character cell devices,
at least in the existing width interpretation, which I will call
"WI1" for "Width Interpretation #1".
The obvious small problem is the left-combining "i" vowel mark, which
(I was not before aware of this) combines at the left of the whole
cluster, not just the preceding character. This can be handled in WI1,
but requires substitution rules which can span arbitrary numbers of
cells, making update awkward. Also my understanding is that the
superscript stroke of the "i" is supposed to span the whole syllable,
making things more tricky for these arbitrary-width clusters.
A more fundamental problem, which may even be seen as a cause of the
former problem, is the combining/ligature nature of clusters. Many
clusters such as "kka" or anything involving "ra" _should_ occupy
fewer columns than the number of letters in the cluster. With 'dead'
"ra" characters in the cluster it is particularly bad for them to
occupy a column of their own since they actually become combining
marks on another glyph cell (which may be of arbitrary distance from
the "ra" character).
Under WI1, each consonant has a wcwidth of 1. Thus, the only way to
handle character-cell Indic scripts under WI1 is to have the dead "ra"
turn into a blank space, which will look very odd.
What I'd like to propose is a new width interpretation for Indic
scripts, "WI2". Under WI2:
- All independent vowels and consonants have a width of 2.
- The virama has a width of -2 and makes the previous character part
of a combining stack with the character that follows.
- Each double-width character cell contains an entire consonant
cluster, not just a single glyph. Much like Hangul Jamo.
- All dependent vowel marks are simple width-0 nonspacing characters
which apply to the whole cluster. Also like Hangul Jamo. Note that
this includes the left-combining "i" vowel which just appears at the
left of the character cell and whose superscript stroke is of fixed
width.
According to casual examination of many Devanagari clusters, they
appear to fit nicely into double-width cells, with complexity/density
similar to CJK glyphs. Simple fonts without complex ligatures could
just squeeze the individual nominal glyphs into the cell and render
them with overstrike (using some simple context rules for the
positioning) while nice fonts would use either dedicated ligature
glyphs or a mixture of dedicated ligatures with overstriking glyphs
that result in the correct ligatures.
Please note that all of the above applies only to character cell
displays and wcwidth. Naturally it should be ignored and existing
systems used for elegant Indic script layout in variable-width fonts,
but I believe that the system WI2 (or a variant on it) provides much
more reasonable, workable Indic script support than WI1.
If anyone could provide comments on the following issues, I would much
appreciate it:
1. Does any existing character cell application (terminal emulator)
both display correctly-rendered Indic text and conform to WI1, i.e.
does it update column position according to wcwidth() and not the
OpenType-rendered width of the text string? I suspect not. RTFS'ing
mlterm it seems like it does not. I can't find any good info on
ncst-term.
2. Are there serious limitations of WI2 that make it impossible to
display [legibly] certain consonant clusters? Can the ZWJ/ZWNJ
semantics be satisfied correctly?
3. Other comments?
Rich