Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Discussion:

Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Rich Felker

2006-10-14 04:22:31 UTC

Working on uuterm[1], I've run into a problem with the characters
0D4A-0D4C and possibly others like them, in regards to wcwidth(3)
behavior. These characters are combining marks that attach on both
sides of a cluster, and have canonical equivalence to the two separate
pieces from which they are built, but yet Markus' wcwidth
implementation and GNU libc assign them a width of 1. It appears very
obvious to me that there's no hope of rendering both of these parts
using only 1 character cell on a character cell device, and even if it
were possible, it also seems horribly wrong for canonically equivalent
strings to have different widths.

I propose amending the wcwidth definitions to assign these characters
(and any like them) a width of 2. Furthermore, I would suggest that
any characters with canonical decompositions be assigned a width that
is the sum of the widths of the decomposition into NFD. This would
avoid similar unfortunate situations in the future that might not yet
have been found. It may also be desirable to do this for compatibility
decompositions (like "dz", etc.); however I suspect it's unlikely that
anyone would use such characters in non-legacy data anyway.

BTW I don't think there's any harm here in breaking compatibility with
existing practice, since obviously no one is using the results of
wcwidth on these characters or they would already have run into thus
problem..

Rich

[1] http://svn.mplayerhq.hu/uuterm/

Bruno Haible

2006-10-16 16:13:58 UTC

Hello Rich,

Post by Rich Felker
These characters are combining marks that attach on both
sides of a cluster, and have canonical equivalence to the two separate
pieces from which they are built, but yet Markus' wcwidth
implementation and GNU libc assign them a width of 1. It appears very
obvious to me that there's no hope of rendering both of these parts
using only 1 character cell on a character cell device, and even if it
were possible, it also seems horribly wrong for canonically equivalent
strings to have different widths.

What rendering to other terminal emulators produce for these characters,
especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
a patch to glibc based on the data of just 1 terminal emulator.

Bruno

Ben Wiley Sittler

2006-10-17 00:38:45 UTC

just tried this in a few terminals, here are the results:

GNOME Terminal 2.16.1:
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 3
NOTE: displays very differently in each case

Konsole 1.6.5:
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 4
NOTE: displays very differently in each case

mlterm 2.9.3:
U+0D30 U+0D4A displayed with width 2
U+0D30 U+0D46 U+0D3E displayed with width 2
NOTE: displays identically in each case

Post by Bruno Haible
Hello Rich,

Post by Rich Felker
These characters are combining marks that attach on both
sides of a cluster, and have canonical equivalence to the two separate
pieces from which they are built, but yet Markus' wcwidth
implementation and GNU libc assign them a width of 1. It appears very
obvious to me that there's no hope of rendering both of these parts
using only 1 character cell on a character cell device, and even if it
were possible, it also seems horribly wrong for canonically equivalent
strings to have different widths.

What rendering to other terminal emulators produce for these characters,
especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
a patch to glibc based on the data of just 1 terminal emulator.
Bruno
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2006-10-17 01:40:00 UTC

Sorry I originally replied off-list to Bruno because the list mail was
slow coming thru and I thought he was just mailing me in private..

Post by Ben Wiley Sittler
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 3
NOTE: displays very differently in each case
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 4
NOTE: displays very differently in each case
U+0D30 U+0D4A displayed with width 2
U+0D30 U+0D46 U+0D3E displayed with width 2
NOTE: displays identically in each case

As we can see, _none_ of these agrees with the current wcwidth
implementation. In fact I'm pretty sure they all ignore wcwidth and
use their own (possibly font-specific) interpretation of width, which
fundamentally dooms the terminal from being able to be used for
anything with columns or cursor positioning.

If they don't even agree with the current wcwidth, and the current
wcwidth cannot reasonably be used for Indic scripts, I see no good
reason why wcwidth tables shouldn't be fixed to at least match values
that _could_ be used for reasonable rendering...

Post by Ben Wiley Sittler

Post by Bruno Haible
What rendering to other terminal emulators produce for these characters,
especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
a patch to glibc based on the data of just 1 terminal emulator.

As I commented in private to Bruno, Apple's Terminal.app even has
broken cursor positioning behavior for CJK and nonspacing characters,
so I think it's hopeless to try to use it for Indic scripts...

Rich

Rich Felker

2006-10-29 21:55:30 UTC

In addition to the issues I raised before about consistency of width
under canonical equivalence, I've found additional problems in the
width definitions which are not technical issues like before, but just
feasibility-of-presentation issues. Specifically, several Indic
scripts including Kannada and Malayalam have several characters which
require 6 or 7 vertical strokes for their standard presentation
glyphs, and numerous characters that require 4 or 5. Moreover, the
standard glyphs shapes for these characters are roughly twice as wide
(sometimes more than twice) as they are tall.

This puts their horizontal complexity on par with most ideographic
characters, and makes it impossible to render them legibly in a single
character cell without huge font size. The possible courses of action
are:

1. Leave them with wcwidth of 1 anyway and assume everyone will use
huge font sizes or else put up with completely illegible glyphs.

2. Assign a global wcwidth of 2 to the affected scripts.

3. Perform "a careful analysis not only of each Unicode character,
but also of each presentation form", as Markus suggested in his
wcwidth.c comments, assigning width of 1/2[/3??] on a per-character
basis.

IMO course 1 is ridiculous. The only argument for it is compatibility,
but obviously no one has ever tried using wcwidth with these scripts
since it just plain doesn't work.

Course 3 is difficult but might give the most visually pleasing
results. On the other hand, it may tend to lock one into a particular
style of presentation forms. If preferred glyph forms change due to
"reforms" or just stylistic preferences, users could be left with a
mess. Part of the analysis for #3 would have to include making sure
that the width assignments could remain reasonable under such
variations, as opposed to being font-specific, but this is probably
not infeasible as long as the amount of "width>1" characters is kept
to a minimum.

Finally there's course 2. In a way it's sort of a cop-out, taking the
easy approach of "fixed width", but that's what character cell widths
have done ever since "i" and "m" received the same width of 1 column.
It's font-independent and ensures that text in a single script can
align well in columns regardless of which characters are used.

I can prepare example bitmaps if anyone is interested in seeing what
the choices might look like, and probably will do this soon anyway.
Again, my goal is revising the wcwidth data (which Markus labelled as
incomplete in the original version) to account for scripts for which
it is not currently being used and for which it does not currently
provide reasonable results. But it's useless for me to just say what I
think it should be. There should be some sort of sane process here, by
which we arrive at a de facto standard which glibc and other
implementations can adopt.

Rich

rajeev joseph sebastian

2006-10-30 12:17:54 UTC

Hello Rich Felker,

It is impossible to fit Malayalam "glyphs" into a given width class, if you want even barely aesthetic text. This is because a given sequence of Unicode characters may map into somewhat different conjunct styles depending on the font: either proper top to bottom (subjoining), or left to right (adjoining) or something in between as well :)

Regards,
Rajeev J Sebastian

PS: Sorry for the top post; Yahoo forces me to do this.

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Sunday, October 29, 2006 11:55:30 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

In addition to the issues I raised before about consistency of width
under canonical equivalence, I've found additional problems in the
width definitions which are not technical issues like before, but just
feasibility-of-presentation issues. Specifically, several Indic
scripts including Kannada and Malayalam have several characters which
require 6 or 7 vertical strokes for their standard presentation
glyphs, and numerous characters that require 4 or 5. Moreover, the
standard glyphs shapes for these characters are roughly twice as wide
(sometimes more than twice) as they are tall.

This puts their horizontal complexity on par with most ideographic
characters, and makes it impossible to render them legibly in a single
character cell without huge font size. The possible courses of action
are:

1. Leave them with wcwidth of 1 anyway and assume everyone will use
huge font sizes or else put up with completely illegible glyphs.

2. Assign a global wcwidth of 2 to the affected scripts.

3. Perform "a careful analysis not only of each Unicode character,
but also of each presentation form", as Markus suggested in his
wcwidth.c comments, assigning width of 1/2[/3??] on a per-character
basis.

IMO course 1 is ridiculous. The only argument for it is compatibility,
but obviously no one has ever tried using wcwidth with these scripts
since it just plain doesn't work.

Course 3 is difficult but might give the most visually pleasing
results. On the other hand, it may tend to lock one into a particular
style of presentation forms. If preferred glyph forms change due to
"reforms" or just stylistic preferences, users could be left with a
mess. Part of the analysis for #3 would have to include making sure
that the width assignments could remain reasonable under such
variations, as opposed to being font-specific, but this is probably
not infeasible as long as the amount of "width>1" characters is kept
to a minimum.

Finally there's course 2. In a way it's sort of a cop-out, taking the
easy approach of "fixed width", but that's what character cell widths
have done ever since "i" and "m" received the same width of 1 column.
It's font-independent and ensures that text in a single script can
align well in columns regardless of which characters are used.

I can prepare example bitmaps if anyone is interested in seeing what
the choices might look like, and probably will do this soon anyway.
Again, my goal is revising the wcwidth data (which Markus labelled as
incomplete in the original version) to account for scripts for which
it is not currently being used and for which it does not currently
provide reasonable results. But it's useless for me to just say what I
think it should be. There should be some sort of sane process here, by
which we arrive at a de facto standard which glibc and other
implementations can adopt.

Rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2006-10-30 17:02:04 UTC

Post by rajeev joseph sebastian
Hello Rich Felker,
It is impossible to fit Malayalam "glyphs" into a given width class,
if you want even barely aesthetic text. This is because a given
sequence of Unicode characters may map into somewhat different
conjunct styles depending on the font: either proper top to bottom
(subjoining), or left to right (adjoining) or something in between
as well :)

Yes, I'm aware of the aesthetic considerations but between the choice
of seeing nothing at all and seeing something with excessive spacing
(still correctly subjoining, but with extra width/spacing to make up
for the second character not using horizontal space), wouldn't the
latter be preferable? I don't claim it will be pretty but I believe
one can put together something which at least avoids being hideously
ugly. I also don't mean to insult your script by presenting it in an
ugly way (even having "i" and "m" the same width is ugly although much
less severely so), but a terminal and the apps that can be run on it
are quite useful IMO and it seems a shame for many people to be unable
to use them on account of language.

BTW the situation for Kannada seems much less severe... do you know
enough about the script to confirm this?

Thanks for the comments.

Rich

P.S. There's also the possibility of treating syllable clusters as the
fundamental unit of display and requiring a context-sensative function
rather than wcwidth to measure width; however from my experience
getting application maintainers just to fix their handling of
nonspacing characters is difficult enough without asking them to add
script-specific processing. Also the curses library (which is a bad
library anyway but many apps use it) doesn't support this model. :(
IMO the best long-term solution is to support both, with a terminal
escape to switch the terminal between "dumb" wcwidth-based spacing for
compatibility with apps that are not specifically Indic-script aware,
and "smart" context-sensitive spacing.

rajeev joseph sebastian

2006-10-31 17:37:34 UTC

Hi Rich Felker,

I find your work to provide support for Indic text on console/terminal to be admirable, and yes, any kind of display is far better than none at all (and I do not consider your statement insulting) :)

What I was referring to was a comment along the lines of "... have a set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to one of those classes... ". (Please forgive me if I misunderstood the last few posts.) The word to note is "glyph". What I'm saying is you cannot in advance specify the width of any given conjunct. It may be different in different fonts.

I suppose, we need to develop console specific fonts which can make proper use of the available width classes (or the structure you propose), however, I don't think any research has occurred in this regard.

Malayalam typography died in the 70s as a result of disastrous script reforms (The peak was the SPCS press, which produced many beautiful types for its publications - SPCS btw is supposed to be the worlds first co-operative of authors). Most artists/graphic designers do not use the stock fonts for any kind of artistic work, other than in running text where they have no choice. A "theory of style" doesnt exist for Malayalam (or afaik in any Indic language).

So, a proper answer to your question: how many width classes, really needs a lot of work both artistic as well as technical. (Malayalam has about 950 conjuncts, so it has to be seen how they can fit into those classes).

Speaking to my older colleague who is a linguist and lexicologist in Dravidian languages, Kannada has pretty much the same structure as Malayalam with regard to conjuncts.

Speaking of curses, doesnt Debian/(K)ubuntu use curses for its installer ? I remember telling the Kubuntu devels to remove Hindi from the list of languages, because looking at the rendering is really horrible (misplaced vowels, and so many other things, unrelated to spacing/width).

It is unfortunate, that many developers think that by using widestrings for each character is equivalent to support for all languages under Unicode. I guess some even think that the dotted-circle is a part of the script ;)

Regards,
Rajeev J Sebastian

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Monday, October 30, 2006 7:02:04 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Post by rajeev joseph sebastian
Hello Rich Felker,
It is impossible to fit Malayalam "glyphs" into a given width class,
if you want even barely aesthetic text. This is because a given
sequence of Unicode characters may map into somewhat different
conjunct styles depending on the font: either proper top to bottom
(subjoining), or left to right (adjoining) or something in between
as well :)

Yes, I'm aware of the aesthetic considerations but between the choice
of seeing nothing at all and seeing something with excessive spacing
(still correctly subjoining, but with extra width/spacing to make up
for the second character not using horizontal space), wouldn't the
latter be preferable? I don't claim it will be pretty but I believe
one can put together something which at least avoids being hideously
ugly. I also don't mean to insult your script by presenting it in an
ugly way (even having "i" and "m" the same width is ugly although much
less severely so), but a terminal and the apps that can be run on it
are quite useful IMO and it seems a shame for many people to be unable
to use them on account of language.

BTW the situation for Kannada seems much less severe... do you know
enough about the script to confirm this?

Thanks for the comments.

Rich

P.S. There's also the possibility of treating syllable clusters as the
fundamental unit of display and requiring a context-sensative function
rather than wcwidth to measure width; however from my experience
getting application maintainers just to fix their handling of
nonspacing characters is difficult enough without asking them to add
script-specific processing. Also the curses library (which is a bad
library anyway but many apps use it) doesn't support this model. :(
IMO the best long-term solution is to support both, with a terminal
escape to switch the terminal between "dumb" wcwidth-based spacing for
compatibility with apps that are not specifically Indic-script aware,
and "smart" context-sensitive spacing.

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2006-10-31 20:32:29 UTC

Post by rajeev joseph sebastian
Hi Rich Felker,
I find your work to provide support for Indic text on
console/terminal to be admirable, and yes, any kind of display is
far better than none at all (and I do not consider your statement
insulting) :)
What I was referring to was a comment along the lines of "... have a
set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to
one of those classes... ". (Please forgive me if I misunderstood the
last few posts.) The word to note is "glyph". What I'm saying is you
cannot in advance specify the width of any given conjunct. It may be
different in different fonts.

Yes, my use of the word character rather than glyph was intentional
however. I know that the typographically correct way to do spacing
would be to measure the width of glyphs, but for better or worse the
only standardized api (wcwidth) works in terms of characters, and
terminals work in terms of characters. Sometimes this has benefits;
for example it makes it so you can hilight text that was printed to
the terminal and paste it into other apps or back into the terminal,
with exact results which are suitable for filenames and such. This
might not be possible if the app running in the terminal had converted
the text to a glyph representation. So in a way it's nice that the
character->glyph conversion is done at the last step, in the terminal,
since it keeps the data in the logical representation instead of the
presentation form. Of course it also has downsides too as I'm sure
we're all aware.

The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.

My hope is to work out a set of width assignments for characters so
that reasonable glyph presentations of the character sequence always
fit in the spacing privided by the sum of the "character widths".
Unfortunately this may result in excess spacing in some (many?) cases,
but I hope it can be made usable if not elegant. My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.

Post by rajeev joseph sebastian
I suppose, we need to develop console specific fonts which can make
proper use of the available width classes (or the structure you
propose), however, I don't think any research has occurred in this
regard.

Well, as long as a reasonable font size were chosen, any font that
fits into the (possibly excessive) width allocation could be used in
principle. For uuterm I'm working on 8x16-cell (and later other larger
sizes) bitmap fonts, which I find much more usable, but there's no
reason other terminal emulators like mlterm couldn't use truetype
fonts in this framework.

Post by rajeev joseph sebastian
So, a proper answer to your question: how many width classes, really
needs a lot of work both artistic as well as technical. (Malayalam
has about 950 conjuncts, so it has to be seen how they can fit into
those classes).

Well my question is much simpler I think: given a character, what's
the "most space" it can take up in any conjunct it forms?

Post by rajeev joseph sebastian
Speaking of curses, doesnt Debian/(K)ubuntu use curses for its
installer ? I remember telling the Kubuntu devels to remove Hindi
from the list of languages, because looking at the rendering is
really horrible (misplaced vowels, and so many other things,
unrelated to spacing/width).

Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.

Post by rajeev joseph sebastian
It is unfortunate, that many developers think that by using
widestrings for each character is equivalent to support for all
languages under Unicode. I guess some even think that the
dotted-circle is a part of the script ;)

Haha yeah. I still can't believe Roman Czyborra drew the original GNU
Unifont with those hideous dotted circles in it... (Yes he knew they
weren't part of the script, but...) My hope is to make it so that
using multibyte char functions + wcwidth is sufficient for _usable_
support for all langs in apps that run on terminals. Then, as more
users of these langs use the apps in question, hopefully other things
(like line folding in scripts without word spacing, better spacing,
integration with input methods, etc.) will come. Unlike most of the
GUI projects working on these issues my goal is not to put
word-processor-type layout in every app, just to fix what's broken and
make them usable with more languages.

Rich

Christopher Fynn

2006-11-01 07:34:14 UTC

Post by rajeev joseph sebastian
Hi Rich Felker,
I find your work to provide support for Indic text on console/terminal to be admirable, and yes, any kind of display is far better than none at all (and I do not consider your statement insulting) :)
What I was referring to was a comment along the lines of "... have a set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to one of those classes... ". (Please forgive me if I misunderstood the last few posts.) The word to note is "glyph". What I'm saying is you cannot in advance specify the width of any given conjunct. It may be different in different fonts.
I suppose, we need to develop console specific fonts which can make proper use of the available width classes (or the structure you propose), however, I don't think any research has occurred in this regard.

Yes, Indic scripts like Malayalam need specific console fonts. I think
for console applications legibility is more important that beauty.

Why not use the typefaces used in old-fashioned Indian typewriters as a
starting point? Most of the popular mono-with fonts for Latin (Courier
etc.) are based on typewriter faces.

Manual mechanical typewriters had a fixed advance width and the
"resolution" was fairly low - a lot of care and expertise went into
designing typefaces that were legible within these constraints.

I know typewriters made by companies like Remington were manufactured
for most Indian scripts - and I suspect a lot of these machines are
still around - so it shouldn't be too hard to come up with some type
samples to use as a starting point.

- Chris

Rich Felker

2006-11-02 21:07:19 UTC

Post by Christopher Fynn
Yes, Indic scripts like Malayalam need specific console fonts. I think
for console applications legibility is more important that beauty.
Why not use the typefaces used in old-fashioned Indian typewriters as a
starting point? Most of the popular mono-with fonts for Latin (Courier
etc.) are based on typewriter faces.
Manual mechanical typewriters had a fixed advance width and the
"resolution" was fairly low - a lot of care and expertise went into
designing typefaces that were legible within these constraints.

Thanks for the constructive ideas. Of course you're totally right,
this approach makes sense. There is still the character/glyph issue
with regard to width, since typewriters of course work with glyphs
rather than characters, but that's unavoidable.

Post by Christopher Fynn
I know typewriters made by companies like Remington were manufactured
for most Indian scripts - and I suspect a lot of these machines are
still around - so it shouldn't be too hard to come up with some type
samples to use as a starting point.

Yes, I'm sure they are. I suppose now I just have to find someone who
has one and who can explain it well.

Rich

rajeev joseph sebastian

2006-11-05 20:26:22 UTC

Using typewriters is an extremely bad idea. Typewriters made by Remington where (please excuse me when I say this) *retarded* (atleast for Malayalam).

In Malayalam, typewriter glyphs are really hated. They are illegible, and totally unacceptable for *any* kind of text. Typewriters in Malayalam were always a hack. You should understand some things: each glyph on the typewriter may be highly tuned and whatever for legibility, but even barely readable text requires far more glyphs than what was available on the typewriter. There is a lot of history, usability and understanding of language/script use behind this, which I can elaborate to interested readers.

In any case, as a technical solution, I would very strongly recommend against a "Typewriter" like console.

Regards,
Rajeev J Sebastian

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Thursday, November 2, 2006 11:07:19 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Post by Christopher Fynn
Yes, Indic scripts like Malayalam need specific console fonts. I think
for console applications legibility is more important that beauty.
Why not use the typefaces used in old-fashioned Indian typewriters as a
starting point? Most of the popular mono-with fonts for Latin (Courier
etc.) are based on typewriter faces.
Manual mechanical typewriters had a fixed advance width and the
"resolution" was fairly low - a lot of care and expertise went into
designing typefaces that were legible within these constraints.

Thanks for the constructive ideas. Of course you're totally right,
this approach makes sense. There is still the character/glyph issue
with regard to width, since typewriters of course work with glyphs
rather than characters, but that's unavoidable.

Post by Christopher Fynn
I know typewriters made by companies like Remington were manufactured
for most Indian scripts - and I suspect a lot of these machines are
still around - so it shouldn't be too hard to come up with some type
samples to use as a starting point.

Yes, I'm sure they are. I suppose now I just have to find someone who
has one and who can explain it well.

Rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

rajeev joseph sebastian

2006-11-05 20:59:03 UTC

Sorry, Yahoo only allows me to top-post, since it doesnt properly quote the previous message. But I have tried to put my message appropriately.

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Tuesday, October 31, 2006 10:32:29 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Post by rajeev joseph sebastian
Hi Rich Felker,
I find your work to provide support for Indic text on
console/terminal to be admirable, and yes, any kind of display is
far better than none at all (and I do not consider your statement
insulting) :)
What I was referring to was a comment along the lines of "... have a
set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to
one of those classes... ". (Please forgive me if I misunderstood the
last few posts.) The word to note is "glyph". What I'm saying is you
cannot in advance specify the width of any given conjunct. It may be
different in different fonts.

Yes, my use of the word character rather than glyph was intentional
however. I know that the typographically correct way to do spacing
would be to measure the width of glyphs, but for better or worse the
only standardized api (wcwidth) works in terms of characters, and
terminals work in terms of characters. Sometimes this has benefits;
for example it makes it so you can hilight text that was printed to
the terminal and paste it into other apps or back into the terminal,
with exact results which are suitable for filenames and such. This
might not be possible if the app running in the terminal had converted
the text to a glyph representation. So in a way it's nice that the
character->glyph conversion is done at the last step, in the terminal,
since it keeps the data in the logical representation instead of the
presentation form. Of course it also has downsides too as I'm sure
we're all aware.

----------
Well, most correctly implemented Unicode-aware applicatons do this also:
have 2 backing stores, one for text and the other for glyphs. Use the glyph representation for display. When a selection is done, the map between the 2 stores is used to derive the correct text for the selected glyphs.

CTL script implementation has a concept of Logical Cluster which is used for this purpose. Basically, text is divided into logical clusters (generally mapping to one or more glyphs) which allows to correctly select text, both programmatically, as well as visually by the user.

This is also useful in the case of Latin text!

Currently, most apps I have seen use the precomposed Latin characters, which is allowed only because of the stability policy. Most apps do not implement complex layout of latin glyphs which causes no-end of problems for Latin transliterations of Indic/other text. Although most of the required characters for Indic transliteration are already available precomposed, the policy of Unicode and the combining mark model do not allow the rest to be encoded. Hence the proliferation of PUA codepoints for this purpose. (I hope the situation changes for GNU/Linux, but I think it is unlikely).
----------

The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.

------------
There is no need for glyph numbers and that is one the strong points of Unicode. I would strongly suggest to look over the HarfBuzz library which is slowly evolving which will allow you to use the work of the best minds in the community. It will transform codepoints into glyphs, which you can then use. (You can also use Pango if need be).
------------

My hope is to work out a set of width assignments for characters so
that reasonable glyph presentations of the character sequence always
fit in the spacing privided by the sum of the "character widths".
Unfortunately this may result in excess spacing in some (many?) cases,
but I hope it can be made usable if not elegant. My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.

-------
This is not always true. For Kannada, I will try to confirm that. For Malayalam, it is most certainly not true. In fact, for Malayalam, you cannot even be sure at any point, whether a particular sequence of characters map to only one glyph or more than one glyph; for different fonts, the number of conjuncts may be different and thus the very same sequence of characters may map to either a single glyph in one font, and multiple glyphs in another, or a different number of glyphs in a third font.
-------

Post by rajeev joseph sebastian
I suppose, we need to develop console specific fonts which can make
proper use of the available width classes (or the structure you
propose), however, I don't think any research has occurred in this
regard.

Well, as long as a reasonable font size were chosen, any font that
fits into the (possibly excessive) width allocation could be used in
principle. For uuterm I'm working on 8x16-cell (and later other larger
sizes) bitmap fonts, which I find much more usable, but there's no
reason other terminal emulators like mlterm couldn't use truetype
fonts in this framework.

Post by rajeev joseph sebastian
So, a proper answer to your question: how many width classes, really
needs a lot of work both artistic as well as technical. (Malayalam
has about 950 conjuncts, so it has to be seen how they can fit into
those classes).

Well my question is much simpler I think: given a character, what's
the "most space" it can take up in any conjunct it forms?

------------
If you mean to say that each logical cluster will be allocated enough width equal to the sum of the widths of each character in that cluster, then I think you will allocate much too much space :)
------------

Post by rajeev joseph sebastian
Speaking of curses, doesnt Debian/(K)ubuntu use curses for its
installer ? I remember telling the Kubuntu devels to remove Hindi
from the list of languages, because looking at the rendering is
really horrible (misplaced vowels, and so many other things,
unrelated to spacing/width).

Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.

----------
I strongly suggest to look over HarfBuzz library. Could you post a link to uuterm development website ?
----------

Regards,
Rajeev J Sebastian

Rich Felker

2006-11-06 08:19:36 UTC

Post by rajeev joseph sebastian
have 2 backing stores, one for text and the other for glyphs. Use
the glyph representation for display. When a selection is done, the
map between the 2 stores is used to derive the correct text for the
selected glyphs.

Yes, this is roughly what uuterm does (except it doesn't keep a glyph
representation, it just dynamically-generates it). However
applications running on the terminal don't have any way to know about
glyphs; all they can access are the characters.

Post by rajeev joseph sebastian
Currently, most apps I have seen use the precomposed Latin
characters, which is allowed only because of the stability policy.
Most apps do not implement complex layout of latin glyphs which
causes no-end of problems for Latin transliterations of Indic/other
text. Although most of the required characters for Indic
transliteration are already available precomposed, the policy of
Unicode and the combining mark model do not allow the rest to be
encoded. Hence the proliferation of PUA codepoints for this purpose.
(I hope the situation changes for GNU/Linux, but I think it is
unlikely).

uuterm already has full support for combining marks, including varied
placement of the diacritics. It doesn't use precomposed glyphs even if
they're available; it always decomposes to NFD (with some additional
decompositions necessary because of stupid Unicode policies) for
rendering.

Post by rajeev joseph sebastian
----------
The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.
------------
There is no need for glyph numbers and that is one the strong points
of Unicode.

I agree totally. However it does mean that applications running on a
terminal don't have any way to operate in terms of glyphs. Everything
they do must be in terms of characters. This is why we're only able to
consider character width and not glyph width for the purposes of
spacing.

Post by rajeev joseph sebastian
I would strongly suggest to look over the HarfBuzz
library which is slowly evolving which will allow you to use the
work of the best minds in the community. It will transform
codepoints into glyphs, which you can then use. (You can also use
Pango if need be).

uuterm is based entirely on bitmap fonts, so these are not appropriate
solutions for it and probably not for kernel-level console drivers
either. However, any character-width tables agreed upon should be able
to be used reasonably with OpenType fonts too of course. It would be
silly to try to adopt a standard that excludes a popular modern
technology. However just like with Latin, fonts whose metrics don't
fit well with the cell widths wouldn't look very good in a terminal
emulator.

IMO, in a way this is part of an argument for the "excessive" spacing
too -- if there's extra space you can fit almost any font in there...
and optionally scale it to try to fill up the space if desired, or
distribute the extra spacing equally spread-out, etc.

Post by rajeev joseph sebastian
My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.
-------
This is not always true. For Kannada, I will try to confirm that.

I have a friend I can check with too, but going from the sparse
information in the Unicode specs and sites like Omniglot and
Wikipedia, it seems to be true that even 'subjunct' conjunct
characters use some of their own horizontal space. Sometimes
characters that would definitely need 2 cells on their own are simple
enough to fit in one cell when they are a subjunct character though,
so spacing is not entirely ideal, but the glyphs I experimented with
drawing seemed to fit legibly anyway. I can send you the xbm files if
you're interested in seeing. (They're not hideously ugly like the
ascii art below.. :)

Post by rajeev joseph sebastian
If you mean to say that each logical cluster will be allocated
enough width equal to the sum of the widths of each character in
that cluster, then I think you will allocate much too much space :)

Yes, I know. :) But given the choice between too much and not enough,
too much is better.

Can I ask you if something like the following (aside from the bad
ascii art :) is horribly offensive:

pa:
#
#
## #
# # #
# # #
#####

ppa:
#
## #
# # #
# # #
###########
#
## #
# # #
##########
(became wide because it was allocated 2 spaces due to two "pa"
characters..)

Hopefully these pictures explain a bit of one way that excess space
could be filled up. Whether it looks reasonable or not, I don't know,
but I suspect it's better than leaving empty space.

Post by rajeev joseph sebastian
Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.
----------
I strongly suggest to look over HarfBuzz library.

I've looked at it before, but much like uuterm it's hardly documented.
A bit of RTFS'ing suggested that it's also excessively complex in
terms of the data structures it uses. :(

In any case, while the HarfBuzz library can handle glyph selection for
a program using OpenType fonts, and likewise the spacing, there's
nothing it can do to solve the problem of spacing on a terminal. This
is because the metrics returned by font libraries are inherently
font-specific, whereas the spacing on a terminal must be
font-independent (since the application attached to the terminal has
no knowledge of the font being used).

I'm not sure if I'll be able to work out any kind of presentation
scheme you'll find acceptable. If not, I'm sorry, but I simply don't
have the time or resources to rewrite the display handling of every
single application which runs on a terminal to make them all aware of
complex spacing interactions, and even if I did, i don't think anyone
has any idea what the _right_ system for this would be.

What I have already been able to do is make a lot of languages which
were previously unusable on terminals usable through simple but
powerful context-sensitive shaping. This is a much easier problem to
solve than context-sensitive spacing. What I can (and hope to)
continue to do is find ways that additional languages/scripts can be
supported without any unreasonable degree of ugliness. It looks like
Kannada will fit pretty well into this system, and Hindi fits ok aside
from the excessive space left when "ra" becomes a nonspacing mark. If
other Indic and Indic-derived scripts work, great! If Burmese
(supposedly very difficult) manages to work that will make me very
happy. Regardless of whether it's ugly or not, though, I think it
would be nice (and beneficial to some users at least) to have
Malayalam supported at least minimally.

Post by rajeev joseph sebastian
Could you post a
link to uuterm development website ?

These are the various relevant links:

http://svn.mplayerhq.hu/uuterm/trunk/
svn://svn.mplayerhq.hu/uuterm/trunk/
http://brightrain.aerifal.cx/~dalias/uuterm/screenshots/
http://brightrain.aerifal.cx/~dalias/ucf/fonts/

Sorry the documentation is so sparse. I'm presently working on getting
nice character coverage in the default distribution so that I can
promote uuterm without potential users saying "wtf how am I supposed
to use this when there are no fonts?!"

Rich

rajeev joseph sebastian

2006-11-06 18:14:20 UTC

I can say that you have done a good job. My point has so far been that some kind of special font system should be created. In any case, the use of straight TTF or OTF is not possible. (is it?). in that case, it may be worthwhile to investigate a kind of OpenType Bitmap font :)

In this case, since each designer will know exactly how much space is available, he can *design* conjuncts to fill as much space as possible. I can talk to the typographer who makes Malayalam fonts for us on this matter, whether he can think about the problem.

Regards,
Rajeev J Sebastian

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Monday, November 6, 2006 10:19:36 AM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Post by rajeev joseph sebastian
have 2 backing stores, one for text and the other for glyphs. Use
the glyph representation for display. When a selection is done, the
map between the 2 stores is used to derive the correct text for the
selected glyphs.

Yes, this is roughly what uuterm does (except it doesn't keep a glyph
representation, it just dynamically-generates it). However
applications running on the terminal don't have any way to know about
glyphs; all they can access are the characters.

Post by rajeev joseph sebastian
Currently, most apps I have seen use the precomposed Latin
characters, which is allowed only because of the stability policy.
Most apps do not implement complex layout of latin glyphs which
causes no-end of problems for Latin transliterations of Indic/other
text. Although most of the required characters for Indic
transliteration are already available precomposed, the policy of
Unicode and the combining mark model do not allow the rest to be
encoded. Hence the proliferation of PUA codepoints for this purpose.
(I hope the situation changes for GNU/Linux, but I think it is
unlikely).

uuterm already has full support for combining marks, including varied
placement of the diacritics. It doesn't use precomposed glyphs even if
they're available; it always decomposes to NFD (with some additional
decompositions necessary because of stupid Unicode policies) for
rendering.

Post by rajeev joseph sebastian
----------
The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.
------------
There is no need for glyph numbers and that is one the strong points
of Unicode.

I agree totally. However it does mean that applications running on a
terminal don't have any way to operate in terms of glyphs. Everything
they do must be in terms of characters. This is why we're only able to
consider character width and not glyph width for the purposes of
spacing.

Post by rajeev joseph sebastian
I would strongly suggest to look over the HarfBuzz
library which is slowly evolving which will allow you to use the
work of the best minds in the community. It will transform
codepoints into glyphs, which you can then use. (You can also use
Pango if need be).

uuterm is based entirely on bitmap fonts, so these are not appropriate
solutions for it and probably not for kernel-level console drivers
either. However, any character-width tables agreed upon should be able
to be used reasonably with OpenType fonts too of course. It would be
silly to try to adopt a standard that excludes a popular modern
technology. However just like with Latin, fonts whose metrics don't
fit well with the cell widths wouldn't look very good in a terminal
emulator.

IMO, in a way this is part of an argument for the "excessive" spacing
too -- if there's extra space you can fit almost any font in there...
and optionally scale it to try to fill up the space if desired, or
distribute the extra spacing equally spread-out, etc.

Post by rajeev joseph sebastian
My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.
-------
This is not always true. For Kannada, I will try to confirm that.

I have a friend I can check with too, but going from the sparse
information in the Unicode specs and sites like Omniglot and
Wikipedia, it seems to be true that even 'subjunct' conjunct
characters use some of their own horizontal space. Sometimes
characters that would definitely need 2 cells on their own are simple
enough to fit in one cell when they are a subjunct character though,
so spacing is not entirely ideal, but the glyphs I experimented with
drawing seemed to fit legibly anyway. I can send you the xbm files if
you're interested in seeing. (They're not hideously ugly like the
ascii art below.. :)

Post by rajeev joseph sebastian
If you mean to say that each logical cluster will be allocated
enough width equal to the sum of the widths of each character in
that cluster, then I think you will allocate much too much space :)

Yes, I know. :) But given the choice between too much and not enough,
too much is better.

Can I ask you if something like the following (aside from the bad
ascii art :) is horribly offensive:

pa:
#
#
## #
# # #
# # #
#####

ppa:
#
## #
# # #
# # #
###########
#
## #
# # #
##########
(became wide because it was allocated 2 spaces due to two "pa"
characters..)

Hopefully these pictures explain a bit of one way that excess space
could be filled up. Whether it looks reasonable or not, I don't know,
but I suspect it's better than leaving empty space.

Post by rajeev joseph sebastian
Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.
----------
I strongly suggest to look over HarfBuzz library.

I've looked at it before, but much like uuterm it's hardly documented.
A bit of RTFS'ing suggested that it's also excessively complex in
terms of the data structures it uses. :(

In any case, while the HarfBuzz library can handle glyph selection for
a program using OpenType fonts, and likewise the spacing, there's
nothing it can do to solve the problem of spacing on a terminal. This
is because the metrics returned by font libraries are inherently
font-specific, whereas the spacing on a terminal must be
font-independent (since the application attached to the terminal has
no knowledge of the font being used).

I'm not sure if I'll be able to work out any kind of presentation
scheme you'll find acceptable. If not, I'm sorry, but I simply don't
have the time or resources to rewrite the display handling of every
single application which runs on a terminal to make them all aware of
complex spacing interactions, and even if I did, i don't think anyone
has any idea what the _right_ system for this would be.

What I have already been able to do is make a lot of languages which
were previously unusable on terminals usable through simple but
powerful context-sensitive shaping. This is a much easier problem to
solve than context-sensitive spacing. What I can (and hope to)
continue to do is find ways that additional languages/scripts can be
supported without any unreasonable degree of ugliness. It looks like
Kannada will fit pretty well into this system, and Hindi fits ok aside
from the excessive space left when "ra" becomes a nonspacing mark. If
other Indic and Indic-derived scripts work, great! If Burmese
(supposedly very difficult) manages to work that will make me very
happy. Regardless of whether it's ugly or not, though, I think it
would be nice (and beneficial to some users at least) to have
Malayalam supported at least minimally.

Post by rajeev joseph sebastian
Could you post a
link to uuterm development website ?

These are the various relevant links:

http://svn.mplayerhq.hu/uuterm/trunk/
svn://svn.mplayerhq.hu/uuterm/trunk/
http://brightrain.aerifal.cx/~dalias/uuterm/screenshots/
http://brightrain.aerifal.cx/~dalias/ucf/fonts/

Sorry the documentation is so sparse. I'm presently working on getting
nice character coverage in the default distribution so that I can
promote uuterm without potential users saying "wtf how am I supposed
to use this when there are no fonts?!"

Rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2006-11-06 20:22:23 UTC

Post by rajeev joseph sebastian
I can say that you have done a good job. My point has so far been
that some kind of special font system should be created. In any
case, the use of straight TTF or OTF is not possible. (is it?). in
that case, it may be worthwhile to investigate a kind of OpenType
Bitmap font :)

It's not a question of the font system not being powerful enough. It's
a question of font-specific spacing not being available. It's much
more fundamental, the information just isn't there. If I do:

cat foo.txt

on a terminal, how does the text file query a font and decide how to
align itself? It's not a program. Even if it were a program, for
example ls, the columnar output would only be correct for one run. If
you did:

ls -C > listing.txt

should ls adapt its output to the current terminal and font it's
running on? What if you then do

cat listing.txt

on a different terminal or with a different font? This is why the
notion of column width must be font-independent. If you're talking
about making a system where spacing is font-dependent, that's
something you can do, but it's a sort of graphic layout language and
not a charactercell terminal anymore, and it won't be useful for
running any existing terminal apps (their output will corrupt,
especially if it causes automargins to wrap in unexpected places) and
loses many of the nice properties of a terminal.

Note that this is an entirely separate issue from the "excessive
spacing" issue. Correction for excessive spacing (with an api more
powerful than wcwidth() that takes context into consideration) is one
possible design direction for a terminal, but the width would still
have to be specified in a font-independent manner.

BTW there are also lots of nice things that can be done to get rid of
the excessive space "problem", for example pushing all the space
forward to the next place where two or more consecutive space, or a
tab, or end of line occurs. This can be done entirely at display time
so that it does desynchronize with the application's idea of the
terminal contents and lead to corruption. The only important thing is
to maintain a concept of cells containing characters, without which
character-based applications cannot work (and I already explained in
the last email why any application running on a terminal must be
character-based and not glyph-based).

Post by rajeev joseph sebastian
In this case, since each designer will know exactly how much space
is available, he can *design* conjuncts to fill as much space as
possible. I can talk to the typographer who makes Malayalam fonts
for us on this matter, whether he can think about the problem.

Last time I checked even typographers for Latin fonts weren't very
fond of character cell terminals... :(

Rich

rajeev joseph sebastian

2006-11-07 09:13:24 UTC

Well, I think I misunderstood ...

-----------

In the first para, I asked whether it was possible to use TrueType in the terminal. If we cannot, then we need to use some hybrid of bitmap fonts and OT fonts, such that, the OT features can be used (atleast the GSUB if nothing else) and the Bitmap features can be used (i.e., using a bitmap instead of outlines).

-----------

In the last para, I said that I would try (or rather the Typographer and I could try) the following:
1) Since you are assigning widths to characters, and since each logical cluster would get a width = sum of the widths of the characters in that cluster, ...
2) ... all we need to do is design the font in such a way that, the glyph corresponding to a logical cluster would use as much space as available to it.

E.g.,

kra cluster consists of ka + chandrakkala + ra
so, when a software (say ls or cat) outputs a sequence ka + chandrakkala + ra, the kra logical cluster will get widthC = width(ka) + width(chandrakkala) + width(ra) allocated to it. In the font, we make sure that the kra *glyph* which corresponds to the kra *logical cluster* uses as much as possible of widthC.

With this, characters have a width specification, and glyphs can be moulded to use as much of the space as possible/necessary as per the widths assigned to each *character*.

----------

I hope I have set things right ?

Regards,
Rajeev J Sebastian

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Monday, November 6, 2006 10:22:23 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Post by rajeev joseph sebastian
I can say that you have done a good job. My point has so far been
that some kind of special font system should be created. In any
case, the use of straight TTF or OTF is not possible. (is it?). in
that case, it may be worthwhile to investigate a kind of OpenType
Bitmap font :)

It's not a question of the font system not being powerful enough. It's
a question of font-specific spacing not being available. It's much
more fundamental, the information just isn't there. If I do:

cat foo.txt

on a terminal, how does the text file query a font and decide how to
align itself? It's not a program. Even if it were a program, for
example ls, the columnar output would only be correct for one run. If
you did:

ls -C > listing.txt

should ls adapt its output to the current terminal and font it's
running on? What if you then do

cat listing.txt

on a different terminal or with a different font? This is why the
notion of column width must be font-independent. If you're talking
about making a system where spacing is font-dependent, that's
something you can do, but it's a sort of graphic layout language and
not a charactercell terminal anymore, and it won't be useful for
running any existing terminal apps (their output will corrupt,
especially if it causes automargins to wrap in unexpected places) and
loses many of the nice properties of a terminal.

Note that this is an entirely separate issue from the "excessive
spacing" issue. Correction for excessive spacing (with an api more
powerful than wcwidth() that takes context into consideration) is one
possible design direction for a terminal, but the width would still
have to be specified in a font-independent manner.

BTW there are also lots of nice things that can be done to get rid of
the excessive space "problem", for example pushing all the space
forward to the next place where two or more consecutive space, or a
tab, or end of line occurs. This can be done entirely at display time
so that it does desynchronize with the application's idea of the
terminal contents and lead to corruption. The only important thing is
to maintain a concept of cells containing characters, without which
character-based applications cannot work (and I already explained in
the last email why any application running on a terminal must be
character-based and not glyph-based).

Post by rajeev joseph sebastian
In this case, since each designer will know exactly how much space
is available, he can *design* conjuncts to fill as much space as
possible. I can talk to the typographer who makes Malayalam fonts
for us on this matter, whether he can think about the problem.

Last time I checked even typographers for Latin fonts weren't very
fond of character cell terminals... :(

Rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2006-11-09 21:28:21 UTC

Post by rajeev joseph sebastian
Well, I think I misunderstood ...

No problem.

Post by rajeev joseph sebastian
-----------
In the first para, I asked whether it was possible to use TrueType
in the terminal. If we cannot, then we need to use some hybrid of
bitmap fonts and OT fonts, such that, the OT features can be used
(atleast the GSUB if nothing else) and the Bitmap features can be
used (i.e., using a bitmap instead of outlines).

Yes, UCF also solves the problem of character->glyph mapping in a way
that's more cell-oriented, but an application (e.g. mlterm) using
OpenType fonts could use the OT tables instead and get the same
effect.

Post by rajeev joseph sebastian
-----------
In the last para, I said that I would try (or rather the Typographer
1) Since you are assigning widths to characters, and since each
logical cluster would get a width = sum of the widths of the
characters in that cluster, ...
2) ... all we need to do is design the font in such a way that, the
glyph corresponding to a logical cluster would use as much space as
available to it.
E.g.,
kra cluster consists of ka + chandrakkala + ra
so, when a software (say ls or cat) outputs a sequence ka +
chandrakkala + ra, the kra logical cluster will get widthC =
width(ka) + width(chandrakkala) + width(ra) allocated to it. In the
font, we make sure that the kra *glyph* which corresponds to the kra
*logical cluster* uses as much as possible of widthC.
With this, characters have a width specification, and glyphs can be
moulded to use as much of the space as possible/necessary as per the
widths assigned to each *character*.
----------
I hope I have set things right ?

Yep, this is right! Maybe you or your typographer friend could try
sketching out a few glyphs and see if it seems to work out well or not
(and what character width assignments would be required). The
character cell size I'm working with for my font with widespread
coverage of lots of scripts is 8x16, but larger or smaller font sizes
could of course be made too. In assigning widths. my inclination is
never to assume that more than 3 (or 4?) vertical strokes can fit in a
single cell, since 3 is the number in the latin characters "m" and "w"
and since a cell size too small to represent latin characters is
probably not useful anywhere.

In terms of simplifying font design, it helps if conjunct forms can be
reduced as much as possible to 'glueing together pieces'. UCF allows
the shape of the pieces to vary depending on the adjacent pieces. For
example a latin "fi" ligature is made not by creating a single wide
"fi" glyph but instead a special glyph for "f when it is followed by
i" and a special glyph for "i when it follows f". In conjunct
formation for many scripts (including diacritic placement for western
scripts, stacking for Tibetan, and various others) this model works
out nicer and greatly reduces the number of glyphs needed (and the
amount of maintainence/font design work). However, if needed, it's
possible to convert whole predrawn "conjunct glyphs" to the UCF rules
format -- it just might require a lot of glyphs. For Malayalam, a mix
of the two approaches is probably appropriate, depending on whether
the particular conjunct is formed by putting together 'reusable' parts
or whether it's highly unique to the character sequence it represents.

Hopefully this information is helpful to you or anyone else thinking
about designing fonts.

Rich

rajeev joseph sebastian

2006-11-11 09:05:40 UTC

Thanks for the info. I will try something out ...

Regards,
Rajeev

----- Original Message ----
From: Rich Felker <***@aerifal.cx>
To: linux-***@nl.linux.org
Sent: Thursday, November 9, 2006 11:28:21 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Post by rajeev joseph sebastian
Well, I think I misunderstood ...

No problem.

Post by rajeev joseph sebastian
-----------
In the first para, I asked whether it was possible to use TrueType
in the terminal. If we cannot, then we need to use some hybrid of
bitmap fonts and OT fonts, such that, the OT features can be used
(atleast the GSUB if nothing else) and the Bitmap features can be
used (i.e., using a bitmap instead of outlines).

Yes, UCF also solves the problem of character->glyph mapping in a way
that's more cell-oriented, but an application (e.g. mlterm) using
OpenType fonts could use the OT tables instead and get the same
effect.

Post by rajeev joseph sebastian
-----------
In the last para, I said that I would try (or rather the Typographer
1) Since you are assigning widths to characters, and since each
logical cluster would get a width = sum of the widths of the
characters in that cluster, ...
2) ... all we need to do is design the font in such a way that, the
glyph corresponding to a logical cluster would use as much space as
available to it.
E.g.,
kra cluster consists of ka + chandrakkala + ra
so, when a software (say ls or cat) outputs a sequence ka +
chandrakkala + ra, the kra logical cluster will get widthC =
width(ka) + width(chandrakkala) + width(ra) allocated to it. In the
font, we make sure that the kra *glyph* which corresponds to the kra
*logical cluster* uses as much as possible of widthC.
With this, characters have a width specification, and glyphs can be
moulded to use as much of the space as possible/necessary as per the
widths assigned to each *character*.
----------
I hope I have set things right ?

Yep, this is right! Maybe you or your typographer friend could try
sketching out a few glyphs and see if it seems to work out well or not
(and what character width assignments would be required). The
character cell size I'm working with for my font with widespread
coverage of lots of scripts is 8x16, but larger or smaller font sizes
could of course be made too. In assigning widths. my inclination is
never to assume that more than 3 (or 4?) vertical strokes can fit in a
single cell, since 3 is the number in the latin characters "m" and "w"
and since a cell size too small to represent latin characters is
probably not useful anywhere.

In terms of simplifying font design, it helps if conjunct forms can be
reduced as much as possible to 'glueing together pieces'. UCF allows
the shape of the pieces to vary depending on the adjacent pieces. For
example a latin "fi" ligature is made not by creating a single wide
"fi" glyph but instead a special glyph for "f when it is followed by
i" and a special glyph for "i when it follows f". In conjunct
formation for many scripts (including diacritic placement for western
scripts, stacking for Tibetan, and various others) this model works
out nicer and greatly reduces the number of glyphs needed (and the
amount of maintainence/font design work). However, if needed, it's
possible to convert whole predrawn "conjunct glyphs" to the UCF rules
format -- it just might require a lot of glyphs. For Malayalam, a mix
of the two approaches is probably appropriate, depending on whether
the particular conjunct is formed by putting together 'reusable' parts
or whether it's highly unique to the character sequence it represents.

Hopefully this information is helpful to you or anyone else thinking
about designing fonts.

Rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

18 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Rich Felker 2006-10-14 04:22:31 UTC

Bruno Haible 2006-10-16 16:13:58 UTC

Ben Wiley Sittler 2006-10-17 00:38:45 UTC

Rich Felker 2006-10-17 01:40:00 UTC

Rich Felker 2006-10-29 21:55:30 UTC

rajeev joseph sebastian 2006-10-30 12:17:54 UTC

Rich Felker 2006-10-30 17:02:04 UTC

rajeev joseph sebastian 2006-10-31 17:37:34 UTC

Rich Felker 2006-10-31 20:32:29 UTC

Christopher Fynn 2006-11-01 07:34:14 UTC

Rich Felker 2006-11-02 21:07:19 UTC

rajeev joseph sebastian 2006-11-05 20:26:22 UTC

rajeev joseph sebastian 2006-11-05 20:59:03 UTC

Rich Felker 2006-11-06 08:19:36 UTC

rajeev joseph sebastian 2006-11-06 18:14:20 UTC

Rich Felker 2006-11-06 20:22:23 UTC

rajeev joseph sebastian 2006-11-07 09:13:24 UTC

Rich Felker 2006-11-09 21:28:21 UTC

rajeev joseph sebastian 2006-11-11 09:05:40 UTC

about - legalese

Loading...