Discussion:
Next Generation Console Font?
Rich Felker
2006-08-02 02:53:50 UTC
Permalink
To Markus et. al.:

I read in the ancient archives for this list some ideas regarding a
so-called next generation console font, supporting unicode level-3
combining in a character cell environment. I'm presently working on a
new terminal emulator called uuterm (think of the uu as µ-ucs or
something) with the goal of extremely efficient but complete unicode
support, and I'm at the point of needing a font format. Using opentype
fonts is not an option since I want it to run on small systems and
possibly even serve as a basis for integrating into linux as a
replacement console device (the current design is set up to run both
on linux fb from userspace and as an X client but is adaptible to
various display and input devices).

Was any progress ever made towards specifying the format or
requirements for NCF? Here are some ideas I have, quoting Markus's
- space for up to 1 million glyphs
I see no realistic need for this many but placing arbitrary limits
doesn't help us anyway.
- efficient access path to these glyphs (i.e., something
better than linear search, that can be accessed from a
memory mapped file)
Absolutely. See below for my (albeitly possibly overcomplex) proposal.
- support for glyph variant options (i.e., the user can activate
or deactivate some style options, which will influence the
character/glyph mapping)
- glyph variations could include bold/italic/wide/etc., they could also
I question the necessity. Character cell devices do not lend
themselves well to italics unless you're going to allow glyphs to
extend outside their cells, and then you're getting into the realm of
advanced pretty-typesetting where you should probably be using
opentype fonts and mlterm.. Bold is a nice option but the vast
majority of scripts simply do not admit legible bold glyphs in
reasonable character-cell sizes. (Arguably even latin does not due to
the letter m).
include CJK style variations, as well as more mundane things
CJK variations are definitely a consideration. My inclination is to
take the Unicode approach and assume that the user will chose a font
suited to the form of the glyphs they're most comfortable with.
However I've also seen cases where traditional Chinese text looks very
strange when the characters existing in Japanese are shown in a
Japanese style while the rest (which have no Japanese variant) are
shown in traditional style. Unfortunately I see no easy way for a
character cell device to handle this anyway, short of adding attribute
modes for CJK variant or interpreting the nonspacing Unicode
characters for variant selection. Still neither of these will be
present when you run "ls" and some of your filenames are Chinese while
others are Japanese, unless you fill your filenames with escape
sequences or something disgusting like that...
such as whether you want to have a slash through a zero or
whether you want to have visible codes for the many different
Unicode space characters (for debugging)
Feature creep.. Wouldn't it make more sense to load a different font
or use something like 'sed -n l'. By the way my proposed design below
lends itself fairly well to 'stacking' multiple fonts if you want to
use a specialized supplemental font for some characters and fall back
to another font for the rest. "Debug glyphs" seems like a case where
an approach like that might be appropriate.
- support for ligature substitution (for languages that depend
on spacing combining characters); this means that a sequence of
several Unicode characters can be replaced by a single wide glyph.
You can look at it that way, or you can look at it as "f gets
displayed as [lefthand part of fi ligature] if following character is
i". While the former is maybe slightly more natural for designing a
font, the latter is much more natural for a character cell device IMO.
During any ligature substitution, the number of cells must always be
invariant anyway (even if this looks ugly); otherwise applications
cannot rely on column formatting without knowing detailed information
about the script (beyond what wcwidth() reports) and the terminal
contents will get corrupted (and look much uglier than overwide
ligatures).

[Yes I know that my "fi" example is idiotic on a character cell device
but I didn't want to try to type Arabic characters which I don't have
any clue about.. :) ]
- support for glyph variations such as smaller uppercase character,
which have enough space on top to fit a combining character over them.
This can again be a contextual variation, like ligatures: "A gets
displayed as [stubby A] if followed by [list of all combining marks
that go above latin charaters]".

Now, a question: should the font have to include such context
definitions inline, or should the font format specify predefined
classes like "superscribed combining marks"?
- support for combining character, not only by simple overstriking,
but also by allowing some offset. In other words, I want the
diaresis over the "a" to be 2 pixels lower than over the "A",
therefore there should be a way to add some "combining_shift(0,-2)"
attribute to the "a" character to make this happen.
Offsets are not necessary; they complicate rendering and can be
implemented equally well with ligature substitution: "[diaresis]
gets displayed as [low diaresis] if attached to a lowercase latin
letter".

To qualify this claim of non-necessity: my main goal is character cell
support for Tibetan script, which is probably the most complicated
stacking script in Unicode. I've successfully produced a set of glyphs
that stack and attach to one another correctly using only simple
ligature-context-style rules, without any offsets.
So an NCF (Next generation Console Font) file would contain
a set of glyphs, each of which comes with some attributes.
These attributes could be
- the sets of Unicode characters that this glyph can represent
under the condition that some style variant has been activated
I strongly _like_ this approach of defining glyphs and listing the
characters they can represent, rather than indexing glyphs by
character, at least as a way of thinking. Whether it's efficient to
store them that way remains to be seen. It may be ideal to have a
"source" format for fonts that works along these lines coupled with a
binary format that's optimized for looking up characters.
- the set of Unicode sequences that this glyph can
represent (ligature substitution), again perhaps conditioned
by style variants
I would prefer contexts to sequences.
- shift offsets for various types of combining characters
Again, IMO not needed. Overstrike handles it all as long as you have
ligature substitutions.
- etc.
:)

Now some ideas of mine for design of a file format. This is not likely
a good source format for editing, but is designed for runtime use by
applications (and possibly the kernel) which want to keep memory usage
and cpu expenditure low. It's also intended that it should be possible
to convert back and forth to the source format without losing any
information about the relationship between glyphs and characters.

So, here goes an idea:

----------------------------------------------------------------------

The file begins with a header consisting of a magic number, the file
format version, the character cell width and height, a search depth
for the binary lookup tree (see below), and an offset to the start of
the tree.

Next (actually the location does not matter) is the binary lookup
tree. Each tree node consists of three numbers: the pivot codepoint
and an offset to the left and right branches. All all codepoints less
than the pivot are to the left; codepoints greater than or equal to
the pivot are to the right.

After walking exactly /depth/ steps down the tree (exactness
requirement allows performance-oriented implementations to perform the
walk branchlessly using an unrolled loop and does no harm to other
implementations; typical depth will be 4-6 depending on presence or
absence of cjk characters), a block header is found at the resulting
file offset.

Blocks group characters and associated glyphs, allowing more compact
data representations. After an initial header, blocks contain a list
of offsets to the character data for each character in the block.

In their least sophisticated form, blocks could simply correspond to
roughly contiguous portions of the UCS character space in order to
avoid having thousands to table entries for unassigned characters. A
depth of zero is even possible, indicating no binary tree and just one
block with a million character table pointers in it. However much more
sophisticated and efficient storage is possible; in order to see how
they might be used I first detail the character/glyph structure and
the relationship between characters and glyphs.

Characters are essentially (hybrid linked/sequential) lists of glyph
substitution rules. In many cases the rule is simply a single inline
glyph. Character data begins with a control code indicating a
substitution rule. If the substitution rule matches, the immediately
following glyph data is used. It may be an inline glyph or a global
offset to any other glyph in the file, for characters which share
glyphs. If the substitution rule fails, the corresponding data is
skipped and another control code is read at the new position.

Glyph data consists of the glyph height (in scanlines), the row number
of the first scanline of the glyph, and the byte-aligned scanline
bitmaps. Width in character cells (single or double) is implicit from
the codepoint (is this reasonable?).

Any or all of these properties can be block-global. If all characters
in the block have only a single glyph, then a flag can be set
indicating that the character pointers point to glyphs directly (no
control code). There is also a flag indicating that height and row
offset are shared among all glyphs. Regardless of which properties are
shared by all characters/glyphs in the block, a fixed offset increment
between characters may be specified in lieu of a table of offsets.
This is particularly useful in minimizing the size of huge CJK blocks.

----------------------------------------------------------------------

Comments?

The main questions I have are:

What format to use for numbers? Some numbers have fixed range
depending on, for example, the character cell width and height. Glyph
height and row offset are 4-bits-each for height=16 and could be
packed into one byte when they fit. Other numbers can widely vary in
range.

As long as a portable, non-host-specific byte order is used I see no
reason to worry about alignment or using special word sizes. Numbers
will need to be read byte-by-byte anyway. They could be stored in a
vlc (variable length coding) format where one bit of each byte is
stolen for use as a continuation bit, or the common length in bytes
could be predeclared in an appropriate context.

The other major issue is glyph substitution (ligature) context rules.
What rules are needed to accommodate all scripts? Here are some ideas:

1. following [characters]
2. base character in [characters]
3. attached to [characters]
4. has [characters] attached
5. followed by [characters]

where each occurrance of [characters] means a bracket-expression-like
pattern of character codepoints to match. What I wonder is if this
list can be compressed down to just two types of rules, one for
previous character (previous cell if the character is spacing, and
previous character in combining stack if character is nonspacing) and
another for next character (next cell or next in stack). The latter
seems sufficient for the most complicated stacking script and the one
I'm familiar with, Tibetan, but I'm unsure if it's sufficient for
everything one would need to do with ligatures. Particularly of
interest is what's needed for vowel-placement-swapping in Indic
languages and everything to do with Arabic ligatures.

Experts, please step up and make suggestions. I expect for this to go
through various stages of revision and I will probably implement and
use drafts in uuterm in the meantime, but I'd like to reach a good
stable final spec, especially if there's hope of finally getting
proper Unicode support into the Linux console.

Rich
Rich Felker
2006-08-02 16:03:03 UTC
Permalink
A revised, simplified file format proposal based on my original
sketch, some of Markus's ideas for "NCF", and an evaluation of which
optimizations were likely to benefit actual font data.

Definitions:

All numeric fields are variable length coded, using the high bit of
each byte as a continuation bit. This is something like UTF-8 except
more efficient since we don't need synchronization or ASCII-compat
properties.

Many fields are such that large values would be nonsense anyway, but
everything is vlc just for consistency. I have marked the fields which
should not be more than 1 byte as such.

All offsets are displacements relative to the byte immediately
following the vlc in which the offset is stored. Thus an offset of 0
means the data being pointed to immediately follows the vlc.

File header:
- magic number: 8 bytes
- version: vlc
- cell width: vlc (1 byte)
- cell height: vlc (1 byte)
- tree depth: vlc (1 byte)
- tree offset: vlc
- context count: vlc
- context spacing: vlc (1 byte)
- offsets to individual context structures, padded with junk if
necessary so that successive offset vlcs are spaced context_spacing
bytes apart from one another.

Tree nodes:
- pivot: vlc
- right offset: vlc

The left child of each node begins immediately following the last
field of the node structure. The right child is indexed by the right
offset vlc. Pivot is the first codepoint that falls on the right
child, expressed relative to the parent node's pivot element (which is
implicitly 0 at the root).

After following the full depth of the tree as specified in the header
(which may be 0, i.e. degenerate), the resulting position contains a
block. Blocks represent (hopefully) contiguous ranges of (hopefully)
uniform-width characters. The block structure itself has exactly one
field, a vlc "stride" indicating the difference between successive
character record positions in the block. The first character of the
block begins immediately after the stride vlc. In particular, if
stride is 0, all characters in the block are represented by a single
record. This is particularly useful for representing ranges of
unassigned code points or characters not covered by the font.

A character record begins with an opcode vlc. The following values are
special:
- 0: a glyph to be used immediately follows the vlc
- 1: a vlc offset immediately follows the vlc. if the offset is 0 then
no glyph is available for the character. otherwise process the
character record at the specified offset.
- 2-3: reserved
- 4-...: context code from the context table, followed by a vlc
offset. if the character's context matches the context code, then
the glyph to be used is located at the specified offset. otherwise,
process the character record immediately following the offset vlc.

In all cases, glyph data is just the raw scanline bits.


In general, a well-formed file looks like:

1. Header

2. Context tables

3. Binary tree with 4-8 levels to avoid the need for a huge (million
entry) table of character record pointers, weed out empty
codepoints and group double-width characters together.

4. Within each tree leaf, either fixed-size records for each character
or a list of pointers to the actual character records. The program
writing the file can select which works best for its purposes.

5. Character records which select a glyph based on the table of
contexts. These can be essentially degenerate for characters
represented by exactly one glyph.

6. Glyphs can be stored alongside the last of the character records
they service (which helps save some space on offsets) or at the end
of the file, or anywhere else you can manage to put them.

Nice properties:

Computational overhead is very low. Finding a glyph to match a
character and context requires one conditional and "depth" memory
accesses to reach the block head, a small amout of arithmetic to reach
the character record, and then whatever conditionals the context rules
require (just one for ligature-free characters). Cost rises somewhat
if you want to tolerate corrupt font files gracefully but remains
inexpensive.

Memory overhead is low. In addition, while not required it is
certainly possible and encouraged that related glyphs be kept close
together. This will promote efficient demand-paging so that scripts
not in use do not consume memory. The format was designed so that it
can be parsed on demand while mmapped, without the need to predecode
and cache large tables in process-local memory. Embedded systems may
use a compressed filesystem or emulate something similar in userspace.

Assuming a source format of the form "collection of glyphs with each
glyph listing which characters it can represent and in what contexts
it can do so with priorities assigned to each context to resolve
conflicts", the above mentioned binary format can be transformed back
and forth to and from the source format without loss of important
information. In particular, the notion that a single glyph is
representing several characters need not be lost. Other attributes
such as a descriptive name for the glyph or comments may of course be
lost however.

Expected size for a complete BMP font at 8x16 is 1.7-2 megs. Omitting
Traditional Chinese could reduce this (how much? what portion of the
CJK characters are Traditional-only?) and leave most users happy.
Omitting CJK entirely leaves the expected size around 200-250k.

Known Limitations:

Glyph width is stored nowhere in the font file. This is intentional
because the font does not get to decide on character width; it must
match wcwidth(). Yet it does require the user to have a working
wcwidth implementation supporting any character that will be used. In
the absence of such a routine, the font file is still parsable and
valid but display will be incorrect. One possible "fix" for this is to
use separate left-half and right-half "glyphs" for double-width
characters and define contexts of "left half" and "rigth half". This
approach may more accurately reflect the needs of applications anyway.
It's also possible that full-width versions of ordinary narrow
combining marks may be needed for use on CJK characters...? Can anyone
comment on this?

Variable-width fonts are not supported. Honestly I am not interested
in them since variable-width bitmap fonts have no practical modern
applications; printing and pretty layout like a web browser might due
both require high-quality scalable fonts. I do have some ideas for how
variable-width fonts could be supported if anyone really cares for
them.

References to common glyphs can only be forward references, not back
references. If this is a problem a signed vlc could be used for the
glyph offset. The chaining "opcode 1" is intentionally unsigned
however to make infinite loops impossible so that an implementation
does not have to waste complexity trying to detect them. It's always
possible to store the glyphs such that forward references are
sufficient, but for the sake of optimizing the number of pages mapped
into ram, it may be desirable to locate a shared glyph with the more
common range it is used in. However it's unlikely that an automated
font compiler would know how to do this anyway.

Context format for ligature rules has not been specified, pending
expert advice on the matter. :)

Rich
Werner LEMBERG
2006-08-02 22:21:56 UTC
Permalink
Post by Rich Felker
A revised, simplified file format proposal based on my original
sketch, some of Markus's ideas for "NCF", and an evaluation of which
optimizations were likely to benefit actual font data.
What about using bitmap-only TrueType fonts, as planned by the X
Windows people? This has the huge advantage that both FreeType and
FontForge (and X Windows too in the not too distant future) already
support them.

I can't see an immediate advantage of a new bitmap format.


Werner
Rich Felker
2006-08-03 04:28:13 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
A revised, simplified file format proposal based on my original
sketch, some of Markus's ideas for "NCF", and an evaluation of which
optimizations were likely to benefit actual font data.
What about using bitmap-only TrueType fonts, as planned by the X
Windows people?
Could you direct me to good information? I have serious doubts but I'd
at least like to read what they have to say.
Post by Werner LEMBERG
This has the huge advantage that both FreeType and
FontForge (and X Windows too in the not too distant future) already
support them.
Quite frankly it doesn't matter it FontForge supports a bitmap font
format because "xbitmap" is the ideal tool for making bitmap fonts.
However as long as you can import and export glyphs I see no reason
FontForge couldn't be used for this too.

I also get the impression from some Apple papers i was browsing
recently that TTF/OpenType put the burden of knowing how to stack
combining characters and produce ligatures onto the software rather
than the font. Under such a system, applications will never support
all scripts unless they use one of the unweildy libraries with all of
this taken care of...
Post by Werner LEMBERG
I can't see an immediate advantage of a new bitmap format.
...on the other hand, at least for bitmap fonts, simple rule-based
substitutions set up by the font designer can easily provide the
needed functionality with less than 5kb of code doing all the glyph
processing.

Right now we're at an unfortunate point where the core X font system
has been deprecated, but there is nothing suitable in its place.
Moreover non-X unix consoles are essentially deprecated as well since
they lack all but some patronizing Euro-centric 512-glyph "Unicode"
support. Do you think someone is going to integrate FreeType into
Linux anytime soon? :)

All problem solving is about choosing the right tool for the job.
Storing bitmap fonts in the TTF/OpenType framework is like using a
nuclear missile to toast fruit flies, or like driving an SUV to
commute to the office...

When it comes to character cell fonts (which is an even narrower
problem field than bitmap fonts), the goal is something that can
provide the baseline support for readable and correct display of any
script and that can work in any environment needing character cell
display, from embedded systems to unix console drivers to 'poweruser'
X sessions with 50 terminals open across 8 virtual desktops and half
of them scrolling text constantly... What is NOT needed is more
substanceless eyecandy that takes 500 megs of ram and 3ghz to run
smoothly. Doesn't anyone find it a bit ironic that you can get
translations of the featureless GNOME and KDE applets (which are about
as bare and useless as the MS Windows "accessories") into almost any
language and that the widgets display all the scripts correctly, but
then when you go try to USE your language for any serious work on unix
you find that everything displays bogus when you type "ls" in your
shell, that most of the powerful text editors have no idea about
something as basic as nonspacing characters, that ELinks still insists
on dumbing-down the perfect UTF-8 it received from the web into a
legacy codepage before converting it back to UTF-8 to display on your
terminal, etc.?

There's a severe gap in where the focus on m17n and i18n is being
placed and where it's needed, and IMO a huge part of that comes from
the fact that most competent unix users _scorn_ m17n and i18n because
of the perception that it's inherently bloated. I've met plenty of
people whose knee-jerk reaction after typing ./configure --help is to
--disable any i18n-related option they see out of fear that it will
fill up their disk with unwanted crap, introduce security vulns, or
just make the program use 3-10x the memory it should.. Fears like this
are compounded by the fact that, at present, the user is forced to
make a choice between "lightweight configuration with incorrect or no
m17n support" and "bloated configuration that pulls in Pango, glib,
fontconfig, Xft, Xrender, ..." just to be able to get "odd scripts I
don't really care about" to display properly. This sort of dichotomy
of course perpetuates the lack of good support. GNOME coders who have
no features in their apps except for the beautiful behavior of the gui
widgets are happy to spend effort (and tens of megabytes) linking to
the behemoth libs and patting themselves on the back for being
"internationally friendly", while developers working on long-standing
projects where most of the substance lies somewhere other than the gui
presentation are baffled by these libs for which they understand
neither the necessity nor the implementation. And unlike rad gui
monkeys who are happy to copy-and-paste-and-glue whatever libs someone
throws at them, people working on mature projects with actual features
are weary of including support for anything they don't understand. The
perpetuation of the myth that "correct rendering of all the world's
scripts is difficult and requires advanced software" of course makes
them even more weary.

I could go on and on for years and years about this for sure. I'm
extremely bitter about the sad state of m17n on unix and the fact that
there is not even one working Unicode terminal with simultaneous
support for all scripts.

So with that said, I'll continue on with my draft bitmap font format
(which already has a lot more simplifications -- remember, a work of
art is only complete when you can't find anything left to _remove_
from it), write my 5kb of code, integrate it into uuterm, and
somewhere in the next few months aim to have the first working Unicode
terminal emulator... in a 50kb static binary.

So much for "m17n is bloated crap"...

Rich


P.S. At the same time I'd be very happy to discuss bitmap and
character cell fonts, others users' and developers' requirements for
them (particularly for scripts I'm not familiar with). Also I have a
tendancy to flame especially when I'm bitter about something. Please
don't take anything I said too seriously except for that overall
thesis that things are in a bad state and lightweight m17n support is
critically needed in order to enable use and development of m17n in
serious apps.
Russell Shaw
2006-08-03 05:07:09 UTC
Permalink
Rich Felker wrote:

... snip long stuff

I agree on the total crappiness of current "mainstream" GUI implementations.

A decent X GUI application should run blazingly fast on a 66MHz 486, and only
be a few tens of kB in size. I'd love to have X apps run well on old laptops
with 4MB video ram.

When "they" say core X fonts are deprecated, "they" do not represent anyone but
their own agendas.

You might like to know that STSF (http://stsf.sourceforge.net/about.html) has
been unmaintained for a couple of years, but is being resurrected with a recent
patch or two. It does server-side rendering and unicode.

http://developers.sun.com/dev/gadc/technicalpublications/presentations/iuc22-stsf.pdf

I don't know how stsf handles exotic scripts (still learning stuff like that myself).

The other way is to use freetype2 and Xrender. This will also be quite compact in size,
but you'll have to add all the glyph formatting scripts for different languages if required.

STSF is designed for pluggable rendering engines without needing to recompile X iirc.
Werner LEMBERG
2006-08-03 06:44:38 UTC
Permalink
Post by Russell Shaw
You might like to know that STSF
(http://stsf.sourceforge.net/about.html) has been unmaintained for a
couple of years, but is being resurrected with a recent patch or
two. It does server-side rendering and unicode.
What about the stuff on http://www.m17n.org? This looks promising too.
Post by Russell Shaw
The other way is to use freetype2 and Xrender. This will also be
quite compact in size, but you'll have to add all the glyph
formatting scripts for different languages if required.
I don't understand exactly what you mean. Please give an example.


Werner
Russell Shaw
2006-08-03 07:18:11 UTC
Permalink
Post by Werner LEMBERG
Post by Russell Shaw
You might like to know that STSF
(http://stsf.sourceforge.net/about.html) has been unmaintained for a
couple of years, but is being resurrected with a recent patch or
two. It does server-side rendering and unicode.
What about the stuff on http://www.m17n.org? This looks promising too.
Interesting. I'll study it more.
Post by Werner LEMBERG
Post by Russell Shaw
The other way is to use freetype2 and Xrender. This will also be
quite compact in size, but you'll have to add all the glyph
formatting scripts for different languages if required.
I don't understand exactly what you mean. Please give an example.
I mean that when you enter characters from the keyboard, they are
not one-to-one with the glyph to print, so you need to build grapheme
clusters and all that and map them to glyphs according to the language,
which you probably know (it's in the big Unicode Standard book).
Rich Felker
2006-08-03 16:21:49 UTC
Permalink
Post by Russell Shaw
... snip long stuff
I agree on the total crappiness of current "mainstream" GUI implementations.
Thanks. It's refreshing to have some support from the non-bloat crowd
in m17n issues. Usually there's the standard dichotomy where
bloat==m17n and lean==latin1...
Post by Russell Shaw
A decent X GUI application should run blazingly fast on a 66MHz 486, and only
be a few tens of kB in size. I'd love to have X apps run well on old laptops
with 4MB video ram.
For anyone laughing at your 486, remember: if your text rendering
system takes 300 mhz for responsive realtime rendering, that's 300 mhz
lost that could be spent on real processing. For instance (since I
come from MPlayer I'll use a video example) decoding the movie rather
than rendering the subtitles.
Post by Russell Shaw
When "they" say core X fonts are deprecated, "they" do not represent anyone but
their own agendas.
Here's a great example of the propaganda machine:
http://www.unifont.org/iuc27/html/img21.html

Still, I think it is correct to say that the X core font system is
deprecated, or at least the traditional use of it where fonts are
encoded in character sets is deprecated. The only way I see for using
it efficiently is to split a font into many glyphsets using
fontspecific encoding, then access the glyphsets you need and
translate test to glyphs before passing it to X. The problem is that
this translation is fontspecific and the font is serverside while the
information is needed clientside..

One possible approach I've considered is having the client application
provide an X font server to serve its own fonts, the sole purpose
being to allow them to be cached on the server side. The same thing
can be done with serverside bitmaps/pixmaps however and it's probably
less disgusting that way.
Post by Russell Shaw
You might like to know that STSF (http://stsf.sourceforge.net/about.html) has
been unmaintained for a couple of years, but is being resurrected with a recent
patch or two. It does server-side rendering and unicode.
I am interested although I think it's both outside the scope of and
unusable for what I'm doing right now. A character-cell font system
needs to work in cells, not entire lines/words/strings. This is needed
both for efficient update and for making sure the positions are
correct. Also I need support for non-X devices.

HOWEVER: since uuterm's display handling is modular and can easily be
swapped for another implementation, I would be interested in having an
STSF-based display module for it in the future.
Post by Russell Shaw
I don't know how stsf handles exotic scripts (still learning stuff like that myself).
My threshold of what's "exotic" seems to be much higher than most
people's by the way.
Post by Russell Shaw
The other way is to use freetype2 and Xrender. This will also be quite compact in size,
but you'll have to add all the glyph formatting scripts for different
languages if required.
Xrender is unfortunately very slow, and for my present needs (a
terminal emulator) truetype/opentype is really not feasible. You
simply can't get good 8x16 glyphs out of an outline font... and if
you're just going to have bitmap why bother with the whole big
truetype framework?
Post by Russell Shaw
STSF is designed for pluggable rendering engines without needing to recompile X iirc.
It would be better if it didn't even need pluggable engines, just some
simple tables or interpreted bytecode. Honestly I wouldn't feel very
comfortable loading new script modules written by m17n coders into the
X server to run as root, given the quality of code I've seen from this
scene in the past.

Rich
Russell Shaw
2006-08-03 17:46:29 UTC
Permalink
Post by Rich Felker
Post by Russell Shaw
... snip long stuff
I agree on the total crappiness of current "mainstream" GUI implementations.
Thanks. It's refreshing to have some support from the non-bloat crowd
in m17n issues. Usually there's the standard dichotomy where
bloat==m17n and lean==latin1...
Post by Russell Shaw
A decent X GUI application should run blazingly fast on a 66MHz 486, and only
be a few tens of kB in size. I'd love to have X apps run well on old laptops
with 4MB video ram.
For anyone laughing at your 486, remember: if your text rendering
system takes 300 mhz for responsive realtime rendering, that's 300 mhz
lost that could be spent on real processing. For instance (since I
come from MPlayer I'll use a video example) decoding the movie rather
than rendering the subtitles.
Post by Russell Shaw
When "they" say core X fonts are deprecated, "they" do not represent anyone but
their own agendas.
http://www.unifont.org/iuc27/html/img21.html
Still, I think it is correct to say that the X core font system is
deprecated, or at least the traditional use of it where fonts are
encoded in character sets is deprecated. The only way I see for using
it efficiently is to split a font into many glyphsets using
fontspecific encoding, then access the glyphsets you need and
translate test to glyphs before passing it to X. The problem is that
this translation is fontspecific and the font is serverside while the
information is needed clientside..
One possible approach I've considered is having the client application
provide an X font server to serve its own fonts, the sole purpose
being to allow them to be cached on the server side. The same thing
can be done with serverside bitmaps/pixmaps however and it's probably
less disgusting that way.
If the X font system had some extra features added, server-side fonts
would probably be ok.

All that's needed is extra protocols for loading extra font info from the
server to the client on an as-needed per-glyph basis. All the details
required of a font need to be included, such as bezier curve coefficients
or bitmaps, kerning tables, bounding boxes, horizontal/vertical bearing and
advance etc. Unlike the current setup, it wouldn't load a whole font file
in one go.

I still haven't made up my mind if server-side fonts are really needed.
They'd be needed for users running X clients on a remote machine they
don't have an account on, but want to be able to set their own fonts.

It is somewhat appealing to have the X server be more of a dumb rastering
device that doesn't even have draw_line() or draw_text() functions, but
just draws primitive objects such as polygons, with all the intelligent
drawing done in the client libraries. Something more efficient than polygons
is needed, to avoid large amounts of network traffic.
Post by Rich Felker
Post by Russell Shaw
You might like to know that STSF (http://stsf.sourceforge.net/about.html) has
been unmaintained for a couple of years, but is being resurrected with a recent
patch or two. It does server-side rendering and unicode.
I am interested although I think it's both outside the scope of and
unusable for what I'm doing right now. A character-cell font system
needs to work in cells, not entire lines/words/strings. This is needed
both for efficient update and for making sure the positions are
correct. Also I need support for non-X devices.
HOWEVER: since uuterm's display handling is modular and can easily be
swapped for another implementation, I would be interested in having an
STSF-based display module for it in the future.
Post by Russell Shaw
I don't know how stsf handles exotic scripts (still learning stuff like that myself).
My threshold of what's "exotic" seems to be much higher than most
people's by the way.
Post by Russell Shaw
The other way is to use freetype2 and Xrender. This will also be quite
compact in size, but you'll have to add all the glyph formatting scripts
for different languages if required.
Xrender is unfortunately very slow, and for my present needs (a
terminal emulator) truetype/opentype is really not feasible. You
simply can't get good 8x16 glyphs out of an outline font... and if
you're just going to have bitmap why bother with the whole big
truetype framework?
Xrender can be very slow when it used for things it is unsuited for.
This includes anything with curves such as circles or arcs. Tons of
polygons are needed to get a smooth edge. You need higher order primitives
without flat edges if you don't want to flood your network with small flat-edged
polygons.

Xrender can hold lists of bitmap glyphs in the server, which means you just send
a glyph index and x/y coordinate to render one. I think that's quite fast. You'd
easily get the impression that Xrender is very slow from the typical bloated and
slow apps that use/abuse it for eye candy.

Freetype2 can be compiled in pieces, so you could leave out the truetype
part i think.

However, when you render a glyph with freetype into a bitmap for Xrender, you
can do sub-pixel rgb rendering that gives the effect of anti-aliasing that is
three times sharper than pixel-level antialising. That gives quite smooth curves
that could look ok, making mono-space truetype fonts useable for terminal stuff.

I noticed that to do kerning, you have to place each glyph individually in
Xrender, even though Xrender can take strings of glyphs. It would be easy
to modify it so that you could get per-glyph kerning on whole strings. I think
it was intentionally crippled because of political BS as evidenced in the last
paragraph: http://lists.freedesktop.org/archives/cairo/2003-July/000253.html
Post by Rich Felker
Post by Russell Shaw
STSF is designed for pluggable rendering engines without needing to recompile X iirc.
It would be better if it didn't even need pluggable engines, just some
simple tables or interpreted bytecode. Honestly I wouldn't feel very
comfortable loading new script modules written by m17n coders into the
X server to run as root, given the quality of code I've seen from this
scene in the past.
By rendering engines, i meant things like freetype that do the actual
bitmap generation.
Rich Felker
2006-08-03 19:17:33 UTC
Permalink
Post by Russell Shaw
Post by Rich Felker
One possible approach I've considered is having the client application
provide an X font server to serve its own fonts, the sole purpose
being to allow them to be cached on the server side. The same thing
can be done with serverside bitmaps/pixmaps however and it's probably
less disgusting that way.
If the X font system had some extra features added, server-side fonts
would probably be ok.
All that's needed is extra protocols for loading extra font info from the
server to the client on an as-needed per-glyph basis. All the details
Eh? The goal is to move things to the server, not back to the client.
Post by Russell Shaw
required of a font need to be included, such as bezier curve coefficients
or bitmaps, kerning tables, bounding boxes, horizontal/vertical bearing and
advance etc. Unlike the current setup, it wouldn't load a whole font file
in one go.
So the client downloads the font from the server, renders it, then
uploads the bitmaps (or other rendering primitives) back? This sounds
really inefficient. Or do you have the words client and server
reversed here?
Post by Russell Shaw
I still haven't made up my mind if server-side fonts are really needed.
They'd be needed for users running X clients on a remote machine they
don't have an account on, but want to be able to set their own fonts.
Basically there are two things programs use fonts for: interface and
documents/data. Interface fonts belong in the server and should be
rendered entirely by the server. Document fonts (fonts used in your
word processor, ps/pdf renderer, graphic design program, paint
program, ...) need to be under the control of the client at least to
some extent. When the client's output files depend on the particular
font rendering (e.g. gimp) then all the font handling and rendering
really needs to be client-side, like all other raster operations. When
the fonts are just needed for a wysiwyg preview (e.g. a word
processor) it's acceptable to have the server render the fonts as long
as the client can establish that the server has the right fonts.
Post by Russell Shaw
Xrender can hold lists of bitmap glyphs in the server, which means you just send
a glyph index and x/y coordinate to render one. I think that's quite fast. You'd
easily get the impression that Xrender is very slow from the typical bloated and
slow apps that use/abuse it for eye candy.
But why use xrender for this? The core X protocol can do the same and
it will work on any X server, not just new ones with an extension,
using plain X bitmaps.
Post by Russell Shaw
However, when you render a glyph with freetype into a bitmap for Xrender, you
can do sub-pixel rgb rendering that gives the effect of anti-aliasing that is
three times sharper than pixel-level antialising. That gives quite smooth curves
that could look ok, making mono-space truetype fonts useable for terminal stuff.
Subpixel only works on LCDs, which produce ugly output. Also it does
not give more vertical resolution which is actually what's needed in
certain scripts for small fonts to look good.
Post by Russell Shaw
I noticed that to do kerning, you have to place each glyph individually in
Xrender, even though Xrender can take strings of glyphs. It would be easy
to modify it so that you could get per-glyph kerning on whole strings. I think
it was intentionally crippled because of political BS as evidenced in the last
paragraph: http://lists.freedesktop.org/archives/cairo/2003-July/000253.html
I'm so sick of intentional crippling. It HURTS us in the fight against
patents rather than helping us...

Rich
Russell Shaw
2006-08-04 04:05:00 UTC
Permalink
Post by Rich Felker
Post by Russell Shaw
Post by Rich Felker
One possible approach I've considered is having the client application
provide an X font server to serve its own fonts, the sole purpose
being to allow them to be cached on the server side. The same thing
can be done with serverside bitmaps/pixmaps however and it's probably
less disgusting that way.
If the X font system had some extra features added, server-side fonts
would probably be ok.
All that's needed is extra protocols for loading extra font info from the
server to the client on an as-needed per-glyph basis. All the details
Eh? The goal is to move things to the server, not back to the client.
The only advantage of server-side fonts is that the user has control over
them on their own machine. Apart from that, everything that you can do with
client-side fonts should be doable with server-side fonts.

Server-side font mode is simply what internet web browsers do. You look
at web pages from a machine you have no account on. The only way to change
the fonts displayed is to change the font settings of your web browser.
Post by Rich Felker
Post by Russell Shaw
required of a font need to be included, such as bezier curve coefficients
or bitmaps, kerning tables, bounding boxes, horizontal/vertical bearing and
advance etc. Unlike the current setup, it wouldn't load a whole font file
in one go.
So the client downloads the font from the server, renders it, then
uploads the bitmaps (or other rendering primitives) back? This sounds
really inefficient. Or do you have the words client and server
reversed here?
This is how Xrender glyph rendering currently works, except that the
font is initially loaded on the client side.

The server-side setup could be done two ways:

1)In normal use, the client can just send glyph positioning coordinates
to the server, and the server does the bitmap creation and draws glyphs.
If the client needs to do extra stuff with glyphs, it can load information
about specific glyphs if needed.

2)The client downloads the font from the server, renders it, then uploads
the bitmaps back, where they're cached in the server. Each glyph would
get rendered and cached as needed, to avoid large delays of processing
a whole font file at once.

By server-side fonts, i simply mean that fonts can be loaded from the X server,
but the bitmap creation isn't neccessarily done by the server. It can still be
done on the client side, so you get all the advantages of client-side fonts.
Post by Rich Felker
Post by Russell Shaw
I still haven't made up my mind if server-side fonts are really needed.
They'd be needed for users running X clients on a remote machine they
don't have an account on, but want to be able to set their own fonts.
Basically there are two things programs use fonts for: interface and
documents/data. Interface fonts belong in the server and should be
rendered entirely by the server. Document fonts (fonts used in your
word processor, ps/pdf renderer, graphic design program, paint
program, ...) need to be under the control of the client at least to
some extent. When the client's output files depend on the particular
font rendering (e.g. gimp) then all the font handling and rendering
really needs to be client-side, like all other raster operations. When
the fonts are just needed for a wysiwyg preview (e.g. a word
processor) it's acceptable to have the server render the fonts as long
as the client can establish that the server has the right fonts.
Post by Russell Shaw
Xrender can hold lists of bitmap glyphs in the server, which means you just send
a glyph index and x/y coordinate to render one. I think that's quite fast. You'd
easily get the impression that Xrender is very slow from the typical bloated and
slow apps that use/abuse it for eye candy.
But why use xrender for this? The core X protocol can do the same and
it will work on any X server, not just new ones with an extension,
using plain X bitmaps.
The core thing loads whole font files if you query it iirc. Also, you can't
get more specific things such as the raw bitmap or bezier curve data.
There's some back-and-forth networking required when you want to calculate
the extents of words frequently, which i didn't like. A server-loadable
font setup should be as fast as a client-side font setup, which means
having access to all the font data from the server.

I really hate the grey blurry anti-aliased fonts. I've tried configuring
fontconfig and freetype and installing the latest freetype etc. With Xrender,
i can optimize the bitmap rendering myself with freetype.
Post by Rich Felker
Post by Russell Shaw
However, when you render a glyph with freetype into a bitmap for Xrender, you
can do sub-pixel rgb rendering that gives the effect of anti-aliasing that is
three times sharper than pixel-level antialising. That gives quite smooth curves
that could look ok, making mono-space truetype fonts useable for terminal stuff.
Subpixel only works on LCDs, which produce ugly output.
I think sub-pixel rendering also works for a crt, but a sudden change
in pixel value (such as the edge of a black square on a white background)
is smeared (convolved with the step response of analog electronics bandwidth)
into a few pixels on the crt. It shouldn't make sharpness any worse.
Post by Rich Felker
Also it does
not give more vertical resolution which is actually what's needed in
certain scripts for small fonts to look good.
It does give the effect of extra vertical resolution. The effect is that
of a small amount of sub-pixel antialiasing, making sloping lines look
less jaggered. With a black edge on a white background, sub-pixel rendering
makes the individual r/g/b sub-pixels go from 100%(white)->67%->33%->0%(black).
Full-pixel anti-aliasing is what really wrecks the sharpness and darkness
of a glyph.
Post by Rich Felker
Post by Russell Shaw
I noticed that to do kerning, you have to place each glyph individually in
Xrender, even though Xrender can take strings of glyphs. It would be easy
to modify it so that you could get per-glyph kerning on whole strings. I think
it was intentionally crippled because of political BS as evidenced in the last
paragraph: http://lists.freedesktop.org/archives/cairo/2003-July/000253.html
I'm so sick of intentional crippling. It HURTS us in the fight against
patents rather than helping us...
Rich Felker
2006-08-05 00:42:15 UTC
Permalink
Post by Russell Shaw
Post by Rich Felker
Subpixel only works on LCDs, which produce ugly output.
I think sub-pixel rendering also works for a crt, but a sudden change
in pixel value (such as the edge of a black square on a white background)
is smeared (convolved with the step response of analog electronics bandwidth)
into a few pixels on the crt. It shouldn't make sharpness any worse.
No, this is absolutely incorrect. Subpixel is fundamentally impossible
on a CRT because the CRT's rgb cells have nothing to do with the video
card's idea of "pixel". You can enable it and the degree to which it
looks bad will depend on a lot of factors, but it's most certainly not
doing what subpixel is intended to do.
Post by Russell Shaw
Post by Rich Felker
Also it does
not give more vertical resolution which is actually what's needed in
certain scripts for small fonts to look good.
It does give the effect of extra vertical resolution.
No it does not. If you claim this you should back it up with an
explanation.
Post by Russell Shaw
The effect is that
of a small amount of sub-pixel antialiasing, making sloping lines look
less jaggered.
Yes, sloping lines. This is because the increase in resolution is
horizontal, not vertical.
Post by Russell Shaw
With a black edge on a white background, sub-pixel rendering
makes the individual r/g/b sub-pixels go from
100%(white)->67%->33%->0%(black).
Full-pixel anti-aliasing is what really wrecks the sharpness and darkness
of a glyph.
It sounds like everything you're saying comes from a very vague
understanding from a user standpoint rather than knowing what the
terms actually mean and what they do..

Rich
Russell Shaw
2006-08-05 03:56:11 UTC
Permalink
Post by Rich Felker
Post by Russell Shaw
Post by Rich Felker
Subpixel only works on LCDs, which produce ugly output.
I think sub-pixel rendering also works for a crt, but a sudden change
in pixel value (such as the edge of a black square on a white background)
is smeared (convolved with the step response of analog electronics
bandwidth) into a few pixels on the crt. It shouldn't make sharpness any
worse.
No, this is absolutely incorrect. Subpixel is fundamentally impossible
on a CRT because the CRT's rgb cells have nothing to do with the video
card's idea of "pixel". You can enable it and the degree to which it
looks bad will depend on a lot of factors, but it's most certainly not
doing what subpixel is intended to do.
Not so. The videa card DAC is putting out an analog signal that is a
readout of consecutive pixel values from the video ram.
The analog signal passes through various stages that cause the voltage
from a single pixel to contribute to other pixels next to it, so that
all video-ram pixels end up contributing overlapping values in a horizontal
direction on the crt. The amount of overlap depends on the analog bandwidth
of all the amplifiers and cable between the video DAC and electron gun.
(The crt gun output is the convolution of the DAC output and intermediate
channel impulse response).
Post by Rich Felker
Post by Russell Shaw
Post by Rich Felker
Also it does not give more vertical resolution which is actually what's
needed in certain scripts for small fonts to look good.
It does give the effect of extra vertical resolution.
No it does not. If you claim this you should back it up with an
explanation.
I have. There is effectively antialiasing with a granularity of single r/g/b
cells instead of rgb triplets (or whole pixels).
Post by Rich Felker
Post by Russell Shaw
The effect is that of a small amount of sub-pixel antialiasing, making
sloping lines look less jaggered.
Yes, sloping lines. This is because the increase in resolution is
horizontal, not vertical.
What i mean is that even though the increase in resolution is horizontal,
that still makes sloping lines less jaggered.
Post by Rich Felker
Post by Russell Shaw
With a black edge on a white background, sub-pixel rendering makes the
individual r/g/b sub-pixels go from 100%(white)->67%->33%->0%(black).
Full-pixel anti-aliasing is what really wrecks the sharpness and darkness
of a glyph.
It sounds like everything you're saying comes from a very vague
understanding from a user standpoint rather than knowing what the
terms actually mean and what they do..
Well, since designing analog and radio-frequency electronics from when i was
10 and still doing it daily, i have a somewhat detailed understanding of it all.
--
Russell Shaw, B.Eng, M.Eng(Research)
Werner LEMBERG
2006-08-03 23:33:36 UTC
Permalink
Post by Rich Felker
One possible approach I've considered is having the client
application provide an X font server to serve its own fonts, the
sole purpose being to allow them to be cached on the server side.
The same thing can be done with serverside bitmaps/pixmaps however
and it's probably less disgusting that way.
This is the X Rendering Extension using the Xft library, AFAIK.
Post by Rich Felker
You simply can't get good 8x16 glyphs out of an outline font... and
if you're just going to have bitmap why bother with the whole big
truetype framework?
Not the TrueType framework, but the SFNT container format is useful
even for bitmapped fonts.


Werner
Werner LEMBERG
2006-08-03 06:41:35 UTC
Permalink
Post by Rich Felker
Post by Werner LEMBERG
What about using bitmap-only TrueType fonts, as planned by the X
Windows people?
Could you direct me to good information? I have serious doubts but
I'd at least like to read what they have to say.
http://www.pps.jussieu.fr/~jch/software/xfree86-bitmap-fonts.html

I don't know the current status of such fonts w.r.t. X Windows.
Post by Rich Felker
Quite frankly it doesn't matter it FontForge supports a bitmap font
format because "xbitmap" is the ideal tool for making bitmap fonts.
Please give an URL. Another good bitmap font editor is xmbdfed from
Mark Leisher.
Post by Rich Felker
I also get the impression from some Apple papers i was browsing
recently that TTF/OpenType put the burden of knowing how to stack
combining characters and produce ligatures onto the software rather
than the font. Under such a system, applications will never support
all scripts unless they use one of the unweildy libraries with all
of this taken care of...
This is the wrong impression. What you probably mean is that some
language data needs to be proprocessed into a normalized form before
it is fed into the font, for example Indic and Arabic scripts.
However, it is possible to add arbitrary tables to the font (which is
another advantage of the SFNT format) which could move this
preprocessing into the font.
Post by Rich Felker
...on the other hand, at least for bitmap fonts, simple rule-based
substitutions set up by the font designer can easily provide the
needed functionality with less than 5kb of code doing all the glyph
processing.
This is handled by the GSUB table. There are many different formats,
beginning with simple glyph replacing and ending with complex
contextual glyph substitutions.
Post by Rich Felker
Right now we're at an unfortunate point where the core X font system
has been deprecated, but there is nothing suitable in its place.
You should contact Keith Packard regarding this issue. I think there
is just some delay in the conversion of PCFs to SFNT due to more
important problems.
Post by Rich Felker
Moreover non-X unix consoles are essentially deprecated as well
since they lack all but some patronizing Euro-centric 512-glyph
"Unicode" support. Do you think someone is going to integrate
FreeType into Linux anytime soon? :)
Why not? FreeType is very modular by design; it would be possible to
remove almost everything but bitmap-only SFNT handling. Note,
however, that this library doesn't interpret GSUB and other advanced
OpenType tables by itself. You need Pango or something similar for
this.
Post by Rich Felker
All problem solving is about choosing the right tool for the job.
Storing bitmap fonts in the TTF/OpenType framework is like using a
nuclear missile to toast fruit flies, or like driving an SUV to
commute to the office...
You are underestimating the problem, I think. The proper bitmap
format is the least important thing, and the compact SFNT bitmap
formats are not a bad choice IMHO. Much more important is the ability
to store the glyph substitution tables efficiently.
Post by Rich Felker
When it comes to character cell fonts (which is an even narrower
problem field than bitmap fonts), the goal is something that can
provide the baseline support for readable and correct display of any
script
What about top-to-down scripts like Mongolian which can't be written
horizontally? So I repeat my question: Which scripts do you imagine
to support?
Post by Rich Felker
[...] I'm extremely bitter about the sad state of m17n on unix and
the fact that there is not even one working Unicode terminal with
simultaneous support for all scripts.
There is a simple reason for this: What you want to do is impossible.
There will never be a program which supports `all' scripts. Just
think of Urdu, a special variant of Arabic, which isn't just a R2L
script: It actually has this writing direction:

/ / /
/ / /
/ / /

The longer a word, the bigger is its vertical height.
Post by Rich Felker
So with that said, I'll continue on with my draft bitmap font format
(which already has a lot more simplifications -- remember, a work of
art is only complete when you can't find anything left to _remove_
from it), write my 5kb of code, integrate it into uuterm, and
somewhere in the next few months aim to have the first working
Unicode terminal emulator... in a 50kb static binary.
Good luck in handling Arabic and Indic scripts -- and Mongolian :-)


Werner
Rich Felker
2006-08-03 15:06:40 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
Post by Werner LEMBERG
What about using bitmap-only TrueType fonts, as planned by the X
Windows people?
Could you direct me to good information? I have serious doubts but
I'd at least like to read what they have to say.
http://www.pps.jussieu.fr/~jch/software/xfree86-bitmap-fonts.html
I don't know the current status of such fonts w.r.t. X Windows.
Wow, I had no idea that XF86 was so stupid to gzip a file format that
was meant to be mmapped. That saves some disk space (dirt cheap) at
the expense of lots of load time and memory usage (expensive). If disk
space is really scarce a compressed fs should be used instead so mmap
is still available.
Post by Werner LEMBERG
Post by Rich Felker
Quite frankly it doesn't matter it FontForge supports a bitmap font
format because "xbitmap" is the ideal tool for making bitmap fonts.
Please give an URL. Another good bitmap font editor is xmbdfed from
Mark Leisher.
Oops, I meant "bitmap". It's a trivial Xaw app that's been included
with X since near the beginning. I have xmbdfed too but the "Xm" part
of it makes it rather painful to use. Maybe it would be better if I
upgraded lesstif but I generally dislike motif anyway, and since the
BDF model identified characters with glyphs (which I think has been
well-established is a very bad idea) I'd rather use an editor that
just treats bitmaps as bitmaps without trying to treat them as fonts.
Post by Werner LEMBERG
Post by Rich Felker
I also get the impression from some Apple papers i was browsing
recently that TTF/OpenType put the burden of knowing how to stack
combining characters and produce ligatures onto the software rather
than the font. Under such a system, applications will never support
all scripts unless they use one of the unweildy libraries with all
of this taken care of...
This is the wrong impression. What you probably mean is that some
language data needs to be proprocessed into a normalized form before
it is fed into the font, for example Indic and Arabic scripts.
What sort of preprocessing? Reordering vowels? Replacement of Arabic
characters with the appropriate presentation forms?
Post by Werner LEMBERG
However, it is possible to add arbitrary tables to the font (which is
another advantage of the SFNT format) which could move this
preprocessing into the font.
Are there any papers on the SFNT format and its table language?
Post by Werner LEMBERG
Post by Rich Felker
...on the other hand, at least for bitmap fonts, simple rule-based
substitutions set up by the font designer can easily provide the
needed functionality with less than 5kb of code doing all the glyph
processing.
This is handled by the GSUB table. There are many different formats,
beginning with simple glyph replacing and ending with complex
contextual glyph substitutions.
I found some docs on the format from MS, but they were hopelessly
poorly written and contained no information on how the font represents
the conditions under which the substitution should be performed.
Post by Werner LEMBERG
Post by Rich Felker
Right now we're at an unfortunate point where the core X font system
has been deprecated, but there is nothing suitable in its place.
You should contact Keith Packard regarding this issue. I think there
is just some delay in the conversion of PCFs to SFNT due to more
important problems.
How will this solve anything? The core protocol is still unacceptable
because all the glyph info has to be transmitted to the client, and
this info is way too big. The core protocol also seems unable to
perform any sort of nontrivial character->glyph mapping. Must every
application have font-specific information on how to do this, even
though the fonts are located on the server side and thus inaccessible
to the app? Or am I missing something?
Post by Werner LEMBERG
Post by Rich Felker
Moreover non-X unix consoles are essentially deprecated as well
since they lack all but some patronizing Euro-centric 512-glyph
"Unicode" support. Do you think someone is going to integrate
FreeType into Linux anytime soon? :)
Why not? FreeType is very modular by design; it would be possible to
remove almost everything but bitmap-only SFNT handling. Note,
however, that this library doesn't interpret GSUB and other advanced
OpenType tables by itself. You need Pango or something similar for
this.
As far as I can tell, if it's not doing outline rendering and not
using GSUB, etc. then FreeType isn't really doing anything except
parsing the file format and looking up glyphs. I don't see how this
would merit including FreeType at all; a trivial ~200-line
implementation should be able to do the same unless the file format is
hopelessly painful to work with.
Post by Werner LEMBERG
Post by Rich Felker
All problem solving is about choosing the right tool for the job.
Storing bitmap fonts in the TTF/OpenType framework is like using a
nuclear missile to toast fruit flies, or like driving an SUV to
commute to the office...
You are underestimating the problem, I think.
The only part I'm potentially underestimating is the extent of context
information needed to choose a glyph. I'm aware that in extremely nice
rendering of script-style fonts you can often need context several
characters away, but as far as I know all scripts can be rendered in
their basic "print" form with only nearest-neighbor context. What I'm
unsure of is whether nearest-neighbor should mean character neighbors
only, or all character-CELL neighbors (which could be many more with
combining). I suspect it's the latter.
Post by Werner LEMBERG
The proper bitmap
format is the least important thing, and the compact SFNT bitmap
formats are not a bad choice IMHO. Much more important is the ability
to store the glyph substitution tables efficiently.
What I mean by bitmap font format is the character->glyph mapping
system. Obviously the format of the actual glyph bitmaps is simple.

Is 3-4 bytes per potential substitution inefficient? That's what I'm
looking at. This is not counting context definitions specifying when
the substitution would be applied, but these definitions can often be
reused by many glyphs in the same script. As a simple example all the
Latin capital letters can share the "if superscribed combining mark is
attached" context. As a more nontrivial example, in the Tibetan font
I've drawn, there are many characters that share "subjoined below ra"
and "subjoined below sa" rules in order to have the appropriate size
and placement.
Post by Werner LEMBERG
Post by Rich Felker
When it comes to character cell fonts (which is an even narrower
problem field than bitmap fonts), the goal is something that can
provide the baseline support for readable and correct display of any
script
What about top-to-down scripts like Mongolian which can't be written
horizontally? So I repeat my question: Which scripts do you imagine
to support?
Mongolian can be and is written horizontally as well. Certainly you
can write vertical Mongolian in a Mongolian-only editor, or in a
top-down context in some sort of higher level word processor or markup
file, but the idea that you should see Mongolian filenames vertically
when you type "ls" somehow mixed in with other filenames in horizontal
orientation is hopeless.

mlterm (which I find works very poorly but seems to be the only
implementation of a multilingual terminal) does support vertical
orientation, but only if you run it in vertical mode, in which case
everything (including Latin) is printed vertically. IMO this is the
only sane approach. Especially since Mongolian _can_ be written
horizontally, you need to treat horizontal versus vertical orientation
as a localization or user preference applying to the system as a
whole (in the absence of higher level markup), not as a property of
the script.
Post by Werner LEMBERG
Post by Rich Felker
[...] I'm extremely bitter about the sad state of m17n on unix and
the fact that there is not even one working Unicode terminal with
simultaneous support for all scripts.
There is a simple reason for this: What you want to do is impossible.
Maybe so, but the state is also much sadder than I made it sound.
Basically there's only one terminal that supports much more than
Western, CJK, Thai, Hebrew, and maybe a few other scripts. That one,
mlterm, lacks the ability to use information in the fonts for correct
combining and only supports Indic languages because it uses special
script-specific libraries.
Post by Werner LEMBERG
There will never be a program which supports `all' scripts. Just
think of Urdu, a special variant of Arabic, which isn't just a R2L
/ / /
/ / /
/ / /
The longer a word, the bigger is its vertical height.
I'm told there's also a script that runs R2L and L2R alternating on
successive rows, i.e. snakes back and forth, though I've never
actually been told what it is so perhaps it's a myth. Whether it's
possible or reasonable to support such things remains to be seen. IMO
like Mongolian some of these issues need to be treated as a locale
issue or user preference issue instead of a necessity of the script.

Honestly if I used a R2L language, I would be much happier having the
whole terminal run right-to-left (including the Latin text used for
unix commands.. possibly with glyphs mirrored..?) then having to deal
with the headache of bidirectional text and my language being treated
as a second-class part of the interface. But then again there's
numerals and all kinds of other mess to screw it up. :)

Anyway the question with stuff like Urdu is whether it's imperative to
typeset the text in its standard written form or whether a 'computer
style' line-based form or something is acceptable. Keep in mind that
even Latin is written differently on a terminal than it is when
written by hand or in print; the "i" is as wide as the "m", for
instance. I'm not saying that it's justifiable to have crap support
for languages or scripts, just that sometimes a language has to adapt
and develop alternate presentation forms that _will_ work with
technology, or risk becoming irrelevant as technology becomes more
important in society.
Post by Werner LEMBERG
Post by Rich Felker
So with that said, I'll continue on with my draft bitmap font format
(which already has a lot more simplifications -- remember, a work of
art is only complete when you can't find anything left to _remove_
from it), write my 5kb of code, integrate it into uuterm, and
somewhere in the next few months aim to have the first working
Unicode terminal emulator... in a 50kb static binary.
Good luck in handling Arabic and Indic scripts -- and Mongolian :-)
Indic is easy. Actually this is the part I'm most bitter about --
people treating something that should be easy as if it were a huge
unsolved problem and then not supporting it.. Mongolian too as long as
you follow the outline above.

Rich
Werner LEMBERG
2006-08-03 23:30:34 UTC
Permalink
I have xmbdfed too but the "Xm" part of it makes it rather painful
to use.
At least for Linux Mark provides a binary with a statically linked
Motif library, AFAIK.
What you probably mean is that some language data needs to be
proprocessed into a normalized form before it is fed into the
font, for example Indic and Arabic scripts.
What sort of preprocessing? Reordering vowels? Replacement of Arabic
characters with the appropriate presentation forms?
Arabic needs tagging of glyphs as being `initial', `medial', `final',
and `isolated', as specified in the Unicode book. Since this is
identical for all fonts the OpenType designers have decided to make
this information not being part of the font itself. In the long run,
this makes the fonts smaller. Something similar is done for Indic --
on the OpenType list you can right now find a discussion about a
reimplementation Indic font handling.
However, it is possible to add arbitrary tables to the font (which
is another advantage of the SFNT format) which could move this
preprocessing into the font.
Are there any papers on the SFNT format and its table language?
Here are the two main references.

http://www.microsoft.com/typography/SpecificationsOverview.mspx
http://developer.apple.com/textfonts/TTRefMan/
This is handled by the GSUB table. There are many different
formats, beginning with simple glyph replacing and ending with
complex contextual glyph substitutions.
I found some docs on the format from MS, but they were hopelessly
poorly written and contained no information on how the font
represents the conditions under which the substitution should be
performed.
The process is simple (at least in theory -- there are many tricky
details): A font contains a number of `features' like `use small caps'
or `use old ligatures', or `use a different set of digits'. Each
feature consists of an ordered set of `lookups'.

Having a string of input character codes, you apply the first lookup
table, then you start again and process the next one, and so on until
all lookup tables have been applied.
How will this solve anything? The core protocol is still
unacceptable because all the glyph info has to be transmitted to the
client, and this info is way too big.
AFAIK is it possible to have fonts on the client side, avoiding the
overhead of transmitting fonts.
The core protocol also seems unable to perform any sort of
nontrivial character->glyph mapping. Must every application have
font-specific information on how to do this, even though the fonts
are located on the server side and thus inaccessible to the app? Or
am I missing something?
Please read this:

http://keithp.com/~keithp/talks/usenix2001/xrender/

It discusses the X Rendering Extension which has become standard
meanwhile, I think.
As far as I can tell, if it's not doing outline rendering and not
using GSUB, etc. then FreeType isn't really doing anything except
parsing the file format and looking up glyphs. I don't see how this
would merit including FreeType at all; a trivial ~200-line
implementation should be able to do the same unless the file format is
hopelessly painful to work with.
You still need code to handle the SFNT format. As mentioned in
another mail, you can compile FreeType without any support for outline
formats, using SFNT bitmap fonts only.
What I mean by bitmap font format is the character->glyph mapping
system.
I doubt that you find something really better than the abilities of
GSUB and GPOS tables.
Is 3-4 bytes per potential substitution inefficient? That's what I'm
looking at. This is not counting context definitions specifying when
the substitution would be applied, but these definitions can often
be reused by many glyphs in the same script. As a simple example all
the Latin capital letters can share the "if superscribed combining
mark is attached" context.
In OpenType parlance this is called a `glyph class', defined in the
GDEF table.
Mongolian can be and is written horizontally as well.
Using Cyrillic, yes, but not the traditional script, AFAIK.
Certainly you can write vertical Mongolian in a Mongolian-only
editor, or in a top-down context in some sort of higher level word
processor or markup file, but the idea that you should see Mongolian
filenames vertically when you type "ls" somehow mixed in with other
filenames in horizontal orientation is hopeless.
Exactly. We are again at the point where we have to define which
scripts should be supported...
I'm told there's also a script that runs R2L and L2R alternating on
successive rows, i.e. snakes back and forth, though I've never
actually been told what it is so perhaps it's a myth.
This is called `boustrophedon'. Ancient Greek uses it, and Rongorongo
also (the undeciphered script from the Easter Island).
Whether it's possible or reasonable to support such things remains
to be seen.
It's not reasonable IMHO. Another (quite natural) limitation of the
scripts to support.
Anyway the question with stuff like Urdu is whether it's imperative
to typeset the text in its standard written form or whether a
'computer style' line-based form or something is acceptable.
Ah, this is similar to the discussion whether it is acceptable to
represent the German `ü', `ä', and `"ö' with `ue', `ae', and `oe',
respectively. For me as a native German speaker, this is extremely
ugly, and still a lot of computerized systems used in, say, public
transport facilities are displaying this.

So my answer is: No, this is not acceptable.
I'm not saying that it's justifiable to have crap support for
languages or scripts, just that sometimes a language has to adapt
and develop alternate presentation forms that _will_ work with
technology, or risk becoming irrelevant as technology becomes more
important in society.
I'm quite conservative here: It's a very bad idea to adopt a language
or script to the computer. It should be the opposite.


Werner
Andries Brouwer
2006-08-04 00:36:44 UTC
Permalink
Post by Werner LEMBERG
Arabic needs tagging of glyphs as being `initial', `medial', `final',
and `isolated', as specified in the Unicode book. Since this is
identical for all fonts the OpenType designers have decided to make
this information not being part of the font itself.
I just had to struggle with this a little.

The ARABIC LETTER HEH (U+0647) is a letter with 4 glyph forms.
In Kurdish (written in the Sorani, essential arabic, alphabet)
one has two letters (let me call them Kurdish H and Kurdish E)
and these 4 glyph forms become the two forms of Kurdish H
and the two forms of Kurdish E.
Now these four glyphs are tagged with `initial', `medial', `final',
and `isolated', and that is correct if the glyphs are used to write
arabic, but incorrect when precisely the same glyphs are used
to write Kurdish.

I wonder what the correct way is to write Kurdish in Unicode
(without using language tagging).
Are new Unicode code points needed? Do these exist already?

Andries
Werner LEMBERG
2006-08-04 06:39:59 UTC
Permalink
Post by Andries Brouwer
I wonder what the correct way is to write Kurdish in Unicode
(without using language tagging).
Are new Unicode code points needed? Do these exist already?
Sorry, I have no idea. This is the ideal question to the Unicode list
(if you don't mind to get some, hmm, lengthy answers of people who
like to talk more than they should :-)


Werner
Mark Leisher
2006-08-13 14:48:08 UTC
Permalink
Post by Andries Brouwer
Post by Werner LEMBERG
Arabic needs tagging of glyphs as being `initial', `medial', `final',
and `isolated', as specified in the Unicode book. Since this is
identical for all fonts the OpenType designers have decided to make
this information not being part of the font itself.
I just had to struggle with this a little.
The ARABIC LETTER HEH (U+0647) is a letter with 4 glyph forms.
In Kurdish (written in the Sorani, essential arabic, alphabet)
one has two letters (let me call them Kurdish H and Kurdish E)
and these 4 glyph forms become the two forms of Kurdish H
and the two forms of Kurdish E.
Now these four glyphs are tagged with `initial', `medial', `final',
and `isolated', and that is correct if the glyphs are used to write
arabic, but incorrect when precisely the same glyphs are used
to write Kurdish.
I wonder what the correct way is to write Kurdish in Unicode
(without using language tagging).
Are new Unicode code points needed? Do these exist already?
A similar situation exists in Uyghur. The way it is solved currently is
to use the characters in the Arabic Presentation Form blocks of Unicode.
I will have to check if it is possible to use Unicode to write about
Uyghur in Kurdish, or about Kurdish in Uyghur without language tags.

But this is not relevant to the console font topic. All I can say to
Rich is that his best course of action will be to go off and implement
his vision. There is no better way to understand why the font situation
is the way the way it is today.

I have believed for many years that software in general is getting
hideously complicated and clumsy. I don't complain about it much any
more because I know the effort required to find simpler solutions. And
simpler solutions are often ignored in favor of something that works.
--
---------------------------------------------------------------------------
Mark Leisher
Computing Research Lab Nowadays, the common wisdom is to
New Mexico State University celebrate diversity - as long as you
Box 30001, MSC 3CRL don't point out that people are
Las Cruces, NM 88003 different. -- Colin Quinn
Rich Felker
2006-08-04 01:32:12 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
What you probably mean is that some language data needs to be
proprocessed into a normalized form before it is fed into the
font, for example Indic and Arabic scripts.
What sort of preprocessing? Reordering vowels? Replacement of Arabic
characters with the appropriate presentation forms?
Arabic needs tagging of glyphs as being `initial', `medial', `final',
and `isolated', as specified in the Unicode book. Since this is
identical for all fonts the OpenType designers have decided to make
this information not being part of the font itself. In the long run,
this makes the fonts smaller.
With my proposed context system it doesn't save but a few bytes total
in the font file since the context rules can be shared by all the
characters that need them.
Post by Werner LEMBERG
The process is simple (at least in theory -- there are many tricky
details): A font contains a number of `features' like `use small caps'
or `use old ligatures', or `use a different set of digits'. Each
feature consists of an ordered set of `lookups'.
Having a string of input character codes, you apply the first lookup
table, then you start again and process the next one, and so on until
all lookup tables have been applied.
Wow, what a horribly bad design. No wonder including arabic
initial/medial/final information would make the font so big.
Post by Werner LEMBERG
Post by Rich Felker
How will this solve anything? The core protocol is still
unacceptable because all the glyph info has to be transmitted to the
client, and this info is way too big.
AFAIK is it possible to have fonts on the client side, avoiding the
overhead of transmitting fonts.
1. If the fonts are client side you're no longer using the core font
system which was the topic under discussion: whether or not the
core font system is usable.
2. If the fonts are client side you must transmit large amounts of
data to the server for rendering.
Post by Werner LEMBERG
http://keithp.com/~keithp/talks/usenix2001/xrender/
It discusses the X Rendering Extension which has become standard
meanwhile, I think.
Again this has nothing to do with the original X font system.
Post by Werner LEMBERG
You still need code to handle the SFNT format. As mentioned in
another mail, you can compile FreeType without any support for outline
formats, using SFNT bitmap fonts only.
Why would I want to use this format after you explained above how
stupid its substitution system is? I designed something much better on
the very first attempt and since have refined it much more. I'll post
the new ideas soon.
Post by Werner LEMBERG
Post by Rich Felker
What I mean by bitmap font format is the character->glyph mapping
system.
I doubt that you find something really better than the abilities of
GSUB
Already have.
Post by Werner LEMBERG
and GPOS tables.
GPOS is undesirable for character cell glyphs. It makes much more
sense just to include variants with the position pre-applied. The only
advantage of GSUB is for doing arbitrarily-long combining/stacking,
which has no hope of working except with variable-size fonts that can
extend outside of their original bounding boxes.
Post by Werner LEMBERG
Post by Rich Felker
Is 3-4 bytes per potential substitution inefficient? That's what I'm
looking at. This is not counting context definitions specifying when
the substitution would be applied, but these definitions can often
be reused by many glyphs in the same script. As a simple example all
the Latin capital letters can share the "if superscribed combining
mark is attached" context.
In OpenType parlance this is called a `glyph class', defined in the
GDEF table.
Post by Rich Felker
Mongolian can be and is written horizontally as well.
Using Cyrillic, yes, but not the traditional script, AFAIK.
No, in the traditional script.
Post by Werner LEMBERG
Post by Rich Felker
Certainly you can write vertical Mongolian in a Mongolian-only
editor, or in a top-down context in some sort of higher level word
processor or markup file, but the idea that you should see Mongolian
filenames vertically when you type "ls" somehow mixed in with other
filenames in horizontal orientation is hopeless.
Exactly. We are again at the point where we have to define which
scripts should be supported...
No. Like I said Mongolian can be and is written horizontally (L2R I
believe even though the original vertical version was written from
right to left). I believe the Unicode standard even mentions this
somewhere though I may be mistaken. If you want me to check again I
can ask friends who are familiar with the language.
Post by Werner LEMBERG
Post by Rich Felker
I'm told there's also a script that runs R2L and L2R alternating on
successive rows, i.e. snakes back and forth, though I've never
actually been told what it is so perhaps it's a myth.
This is called `boustrophedon'. Ancient Greek uses it, and Rongorongo
also (the undeciphered script from the Easter Island).
Well as long as it's a dead languages only I won't lose any sleep over
not supporting it. The dead can rise and complain to me if they care.
Post by Werner LEMBERG
Post by Rich Felker
Whether it's possible or reasonable to support such things remains
to be seen.
It's not reasonable IMHO. Another (quite natural) limitation of the
scripts to support.
Another viewpoint is that the script is obviously readable in both
directions and that printing it in either is equally correct.
Alternating orientation on successive lines would then be seen as a
local stylistic preference of the people who used those scripts rather
than a necessity of the script, and would be available as an option in
applications supporting it.
Post by Werner LEMBERG
Post by Rich Felker
Anyway the question with stuff like Urdu is whether it's imperative
to typeset the text in its standard written form or whether a
'computer style' line-based form or something is acceptable.
Ah, this is similar to the discussion whether it is acceptable to
represent the German `ü', `ä', and `"ö' with `ue', `ae', and `oe',
respectively.
How is it similar at all? One is a question of using the correct
characters (even the correct number of characters!) while the other is
a matter of spacing and layout. You have a justification to be mad
when "ü" is written as "ue" because it's a result of English-centric
legacy systems. There's nothing _fundamentally_ difficult about
displaying a "ü".

On the other hand, ...

[Now before we go on, there's a difference between a word processor and
a text editor or command line. This needs to be abundantly clear
because what I'm about to say of course does not apply to a word
processor or quality typesetting system intended to produce
publications for print.]

...a speaker of Urdu is (IMO) not justified in being mad about the
lack of support for this diagonal layout. All the glyphs are correct.
They're all in the right order and orientation. Etc. etc. etc.

Ask yourself this: what would a speaker of Urdu do if they needed to
write a message and the only paper they had was barely tall enough for
one handwritten letter. If your answer is "write it all on one line"
or even "write it essentially on one line with each word slanted
slightly diagonal" then there's absolutely no reason the same can't be
done on a computer terminal.
Post by Werner LEMBERG
For me as a native German speaker, this is extremely
ugly, and still a lot of computerized systems used in, say, public
transport facilities are displaying this.
I agree, I just don't find it relevant.
Post by Werner LEMBERG
Post by Rich Felker
I'm not saying that it's justifiable to have crap support for
languages or scripts, just that sometimes a language has to adapt
and develop alternate presentation forms that _will_ work with
technology, or risk becoming irrelevant as technology becomes more
important in society.
I'm quite conservative here: It's a very bad idea to adopt a language
or script to the computer. It should be the opposite.
You've hit the nail on the head with regard to why so much of m17n and
i18n software is bloated to hell: this is the exact philosophy of
bloatware! Rather than thinking like a computer and using a computer
in the natural way a computer works, the bloatware philosophy is to
stop at nothing until the computer has been beaten into submission and
forced to think like a human. This is why we have abominations like
Lisp, Java, Perl, garbage collection, MS Office assistant, "Wizards",
'auto-correct', ... (and yet they all still fail to think like a
human... ;)

If you insist on this kind of inefficiency with any other kind of
technology, people would call you mad. It's like spending billions of
dollars to create a fertile paradise in the middle of a desert (hey
they did it in Dubai...) rather than building your home in a sane
place to begin with.

I have no interest in telling people to revise their languages to make
them conform to legacy ASCII limitations, nor to make them conform to
international expectations or any other imperial bullshit. However I
do believe very strongly that all people (including English speakers)
should tolerate having their language displayed in a form that
respects both the intended look of the script and the nature of the
medium on which it's displayed. Most languages already have many
different ways of being written depending on whether their use is in a
book, on a street sign, on the sign for a place of business (often
vertical even in non-vertical-script languages), on a poster or flyer,
etc. Computers (and particularly on-topic now, terminals) are yet
another such medium.

Rich
Werner LEMBERG
2006-08-04 07:04:43 UTC
Permalink
Post by Rich Felker
Post by Werner LEMBERG
Arabic needs tagging of glyphs as being `initial', `medial',
`final', and `isolated', as specified in the Unicode book. Since
this is identical for all fonts the OpenType designers have
decided to make this information not being part of the font
itself. In the long run, this makes the fonts smaller.
With my proposed context system it doesn't save but a few bytes
total in the font file since the context rules can be shared by all
the characters that need them.
Details, please.
Post by Rich Felker
Post by Werner LEMBERG
Having a string of input character codes, you apply the first
lookup table, then you start again and process the next one, and
so on until all lookup tables have been applied.
Wow, what a horribly bad design. No wonder including arabic
initial/medial/final information would make the font so big.
Why do you think that it is bad design? How would you activate and
deactive typographical features? This goes far beyond `console
fonts', so it is of course more `complicated' than you expect.
However, it works reasonably well, and noone asks you to use more than
a single lookup.

You might compare this with AAT from Apple (the `morx' table as
documented in the URL I posted in a previous mail). This is something
similar but far more complicated to debug since it uses automatons
which can have almost infinite states.
Post by Rich Felker
Why would I want to use this format after you explained above how
stupid its substitution system is? I designed something much better
on the very first attempt and since have refined it much more. I'll
post the new ideas soon.
Aah, you will receive the Nobel prize for this. What you've done is
apparently better than man-years of works done by font experts.

Be serious! I want to see not only ideas but a complete
specification. THEN I believe you. And there is still the question
who is going to implement this.
Post by Rich Felker
GPOS is undesirable for character cell glyphs.
Not at all. It handles accent stacking, and it can be even used for
fixed-width fonts (which simplifies the tables enormously).
Post by Rich Felker
Post by Werner LEMBERG
Post by Rich Felker
Mongolian can be and is written horizontally as well.
Using Cyrillic, yes, but not the traditional script, AFAIK.
No, in the traditional script.
I stand corrected, I haven't known that. Can you give a URL?
Post by Rich Felker
Ask yourself this: what would a speaker of Urdu do if they needed to
write a message and the only paper they had was barely tall enough
for one handwritten letter. If your answer is "write it all on one
line" or even "write it essentially on one line with each word
slanted slightly diagonal" then there's absolutely no reason the
same can't be done on a computer terminal.
Uh, oh, I can also write German with uppercase letters only if there
isn't enough room for descenders. Is this a solution?
Post by Rich Felker
Post by Werner LEMBERG
I'm quite conservative here: It's a very bad idea to adopt a language
or script to the computer. It should be the opposite.
You've hit the nail on the head with regard to why so much of m17n
and i18n software is bloated to hell: this is the exact philosophy
of bloatware!
Please stop with these useless rhetoric explosions. They don't help
to solve the very problems we have.
Post by Rich Felker
[...] I do believe very strongly that all people (including English
speakers) should tolerate having their language displayed in a form
that respects both the intended look of the script and the nature of
the medium on which it's displayed.
Of course! I've asked you already *which scripts* you want to
support, and you still haven't answered. Just say that the normal
Urdu style of writing isn't going to be supported; they have to use
standard Arabic writing (using their extended character set, of
course).


Werner
Rich Felker
2006-08-05 07:34:03 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
With my proposed context system it doesn't save but a few bytes
total in the font file since the context rules can be shared by all
the characters that need them.
Details, please.
I've got an email I was preparing to send with more details. Got
interrupted earlier and then my screen session crashed but I recovered
it and will send soon, I expect.
Post by Werner LEMBERG
Post by Rich Felker
Post by Werner LEMBERG
Having a string of input character codes, you apply the first
lookup table, then you start again and process the next one, and
so on until all lookup tables have been applied.
Wow, what a horribly bad design. No wonder including arabic
initial/medial/final information would make the font so big.
Why do you think that it is bad design?
Let's say you have 40 characters that all need to change glyph when
they are preceded by any one of 30 other characters. This makes 30*40
substitution rules! If instead you could express the "preceded by one
of those 30 characters" as a single context definition, then you would
just need one context definition and 40 rules, one for each character
to tell the alternate glyph to use under this context.
Post by Werner LEMBERG
How would you activate and
deactive typographical features? This goes far beyond `console
fonts', so it is of course more `complicated' than you expect.
Yes it's outside the scope of what I'm considering; however it might
be worth making a spec that could be used for high quality scalable
fonts too.
Post by Werner LEMBERG
However, it works reasonably well, and noone asks you to use more than
a single lookup.
What do you mean by this "single lookup"?

BTW another issue of the substitution rules is that, as far as I can
tell, they can delete or insert extra glyphs arbitrarily. From a
character cell perspective that's very bad, since it makes it possible
that the font represents things that cannot be displayed consistently
in the cells.
Post by Werner LEMBERG
You might compare this with AAT from Apple (the `morx' table as
documented in the URL I posted in a previous mail). This is something
similar but far more complicated to debug since it uses automatons
which can have almost infinite states.
Yes I read a little bit about it and I agree. In a way it's more like
what I want, but overly complex. My proposed bytecode (or rather
vlc-code) system intentionally lacks all constructs that can lead to
loops and near-infinite state spaces.
Post by Werner LEMBERG
Aah, you will receive the Nobel prize for this. What you've done is
apparently better than man-years of works done by font experts.
Font experts are not machine experts.
Post by Werner LEMBERG
Be serious! I want to see not only ideas but a complete
specification. THEN I believe you.
You probably won't be satisfied yet but I'll post anyway so you can
see how it's progressing.
Post by Werner LEMBERG
And there is still the question
who is going to implement this.
Anyone can since the spec is trivial to implement. With all but the
context-matcher implemented so far, my implementation compiles to 528
bytes of i386 code.
Post by Werner LEMBERG
Post by Rich Felker
GPOS is undesirable for character cell glyphs.
Not at all. It handles accent stacking, and it can be even used for
fixed-width fonts (which simplifies the tables enormously).
Well, it does reduce the number of glyph variants needed for accent
marks, but at the expense of allowing the font to specify something
that cannot be represented in the character cells, and of complicating
the rendering implementation.
Post by Werner LEMBERG
Post by Rich Felker
Post by Werner LEMBERG
Post by Rich Felker
Mongolian can be and is written horizontally as well.
Using Cyrillic, yes, but not the traditional script, AFAIK.
No, in the traditional script.
I stand corrected, I haven't known that. Can you give a URL?
I didn't find any better references searching google than you would.
It seems to be a new invention, and the glyphs are rotated 90 degrees
from their vertical presentation in order to combine nicely. I don't
know what the people's attitude towards this style is.
Post by Werner LEMBERG
Post by Rich Felker
Ask yourself this: what would a speaker of Urdu do if they needed to
write a message and the only paper they had was barely tall enough
for one handwritten letter. If your answer is "write it all on one
line" or even "write it essentially on one line with each word
slanted slightly diagonal" then there's absolutely no reason the
same can't be done on a computer terminal.
Uh, oh, I can also write German with uppercase letters only if there
isn't enough room for descenders. Is this a solution?
Very different issue. The Urdu system required unbounded height for a
single line of text. Surely you can shrink your text by a few percent
to fit descenders, but can you shink it by an unbounded factor? :)
Post by Werner LEMBERG
Post by Rich Felker
[...] I do believe very strongly that all people (including English
speakers) should tolerate having their language displayed in a form
that respects both the intended look of the script and the nature of
the medium on which it's displayed.
Of course! I've asked you already *which scripts* you want to
support, and you still haven't answered.
In terms of what I "want" (for my own use): Latin, Tibetan, Japanese,
and mathematical notations.

In terms of what I demand that my system support, the answer should be
"everything", modulo strange line flow conventions that are
incompatible with the notion of a character cell terminal. At this
time I don't know enough to make a good bidi terminal implementation,
so we'll say the initial release of uuterm will probably only support
l2r scripts properly. However this will not be a limitation of the
glyph system, just a limitation of my terminal emulation which will be
remedied either as I learn more about bidi, or as someone else
volunteers to write the support. :)
Post by Werner LEMBERG
Just say that the normal
Urdu style of writing isn't going to be supported; they have to use
standard Arabic writing (using their extended character set, of
course).
OK, we're just arguing over defintions. I say Urdu script is
supported, but Urdu line layout style is not. You can say this however
you like.

Rich
Werner LEMBERG
2006-08-05 09:11:02 UTC
Permalink
Post by Rich Felker
Post by Werner LEMBERG
Why do you think that it is bad design?
Let's say you have 40 characters that all need to change glyph when
they are preceded by any one of 30 other characters. This makes
30*40 substitution rules! If instead you could express the
"preceded by one of those 30 characters" as a single context
definition, then you would just need one context definition and 40
rules, one for each character to tell the alternate glyph to use
under this context.
This is GSUB either format 5 (context substitution) or 6 (context
substitution with ability to look back and look forward). Exactly one
substitution, provided it comes from a single lookup, which is rather
likely.
Post by Rich Felker
Post by Werner LEMBERG
However, it works reasonably well, and noone asks you to use more than
a single lookup.
What do you mean by this "single lookup"?
This is a technical term of the OpenType specification. As mentioned
earlier, a `feature' consists of an arbitrary number of lookups
(normally, there is only one lookup table in a feature), and a single
lookup means that you apply its substitution rule to a string of
glyphs. Please check the GSUB table information in the OpenType
specification for more details.
Post by Rich Felker
BTW another issue of the substitution rules is that, as far as I can
tell, they can delete or insert extra glyphs arbitrarily.
Of course. How would you handle a ligature? `f' + `l' = `fl' -- this
means that a character has been deleted.
Post by Rich Felker
From a character cell perspective that's very bad, since it makes it
possible that the font represents things that cannot be displayed
consistently in the cells.
The font designer has to be careful to get this right.
Post by Rich Felker
Post by Werner LEMBERG
And there is still the question who is going to implement this.
Anyone can since the spec is trivial to implement. With all but the
context-matcher implemented so far, my implementation compiles to 528
bytes of i386 code.
I'm *really* interested to see this :-)
Post by Rich Felker
Post by Werner LEMBERG
Mongolian can be and is written horizontally as well.
I didn't find any better references searching google than you would.
It seems to be a new invention, and the glyphs are rotated 90
degrees from their vertical presentation in order to combine nicely.
I rather think that this is an invention to overcome the complications
with computers. I'll ask a friend who is an expert for Mongolian.
Post by Rich Felker
In terms of what I "want" (for my own use): Latin, Tibetan, Japanese,
and mathematical notations.
Hmm. Mathematical notation is two-dimensional by its ver
Rich Felker
2006-08-05 09:44:53 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
BTW another issue of the substitution rules is that, as far as I can
tell, they can delete or insert extra glyphs arbitrarily.
Of course. How would you handle a ligature? `f' + `l' = `fl' -- this
means that a character has been deleted.
That's my exact point: this type of substitution is NOT possible in a
character cell environment. You can still make a ligature out of them,
but it will necessarily have to be two cells wide. This is the small
price you pay in 'prettiness' and huge reward you receive in
simplicity for using a character cell device. The spacing of text does
not vary depending on font (which the application knows nothing
about); it only depends on assumed-constant properties of the
_characters_ involved.

Disadvantages:
- sometimes a bit ugly; good fonts designed to look nice in a
character cell environment make the situation a lot better.

Advantages:
- extremely low bandwidth for remote access since no font metric
information or glyphs need be exchanged.
- simple application implementation.
- high performance screen updates.
- compatible with a huge amount of existing software.
- makes adding unicode support to existing software much easier.

IMO users got spoiled in legacy 8bit environments with fancy text
rendering, which was fast and easy with just 256 glyphs but much
harder to make efficient with all of unicode. The nice property of
character cell environments is that they don't lose performance or
massively grow in complexity when you add 60-100k characters.
Post by Werner LEMBERG
Post by Rich Felker
Post by Werner LEMBERG
And there is still the question who is going to implement this.
Anyone can since the spec is trivial to implement. With all but the
context-matcher implemented so far, my implementation compiles to 528
bytes of i386 code.
I'm *really* interested to see this :-)
OK but it's nothing fancy, just a tiny interpreter for the sort of
language I described in my other post (actually for an earlier version
of the same idea). I'll post the code to the new one once I get a
little more of it done. Most of the fancy code will be in the font
compiler that builds the interpreted code from the list of glyphs and
the characters/contexts they correspond to.
Post by Werner LEMBERG
Post by Rich Felker
Post by Werner LEMBERG
Mongolian can be and is written horizontally as well.
I didn't find any better references searching google than you would.
It seems to be a new invention, and the glyphs are rotated 90
degrees from their vertical presentation in order to combine nicely.
I rather think that this is an invention to overcome the complications
with computers. I'll ask a friend who is an expert for Mongolian.
From what I could gather there are maybe two styles. One is a form
that was essentially a hack to write your document on the computer
horizontally, then rotate the paper after you print it... :P Nasty
hack, eh? I think this form was written right-to-left to make the
printing come out right.

There also seems to be a form that's meant to be read as-is and mixed
with left-to-right horizontal text in other scripts and languages.
Unless this is highly offensive to many Mongolian speakers (which I
kinda doubt since they're probably used to using Cyrillic quite a
bit..) I think it's reasonable to believe that this is the preferred
form for use in multilingual (as opposed to localized Mongolian)
computers except when preparing traditional-style typeset output.
Post by Werner LEMBERG
Post by Rich Felker
In terms of what I "want" (for my own use): Latin, Tibetan, Japanese,
and mathematical notations.
Hmm. Mathematical notation is two-dimensional by its very nature.
Please elaborate.
Obviously if you want anything fancy you should be using LaTeX, but if
you're just trying to express yourself coherently in an email, having
a huge collection of mathematical characters not available in ASCII
can be very helpful. Also, LaTeX source files could be a lot more
compact and legible to non-experts if they contained, for example, the
Greek character alpha instead of \alpha for each occurrance. I'm not
sure on the status of support for things like this, but I've seen some
LaTeX packages for using UTF-8 in source files.

Rich
Behdad Esfahbod
2006-08-03 15:53:33 UTC
Permalink
(Chances are very high that my message doesn't make it to the list)
Post by Werner LEMBERG
Post by Rich Felker
...on the other hand, at least for bitmap fonts, simple rule-based
substitutions set up by the font designer can easily provide the
needed functionality with less than 5kb of code doing all the glyph
processing.
This is handled by the GSUB table. There are many different formats,
beginning with simple glyph replacing and ending with complex
contextual glyph substitutions.
Post by Rich Felker
Right now we're at an unfortunate point where the core X font system
has been deprecated, but there is nothing suitable in its place.
You should contact Keith Packard regarding this issue. I think there
is just some delay in the conversion of PCFs to SFNT due to more
important problems.
Post by Rich Felker
Moreover non-X unix consoles are essentially deprecated as well
since they lack all but some patronizing Euro-centric 512-glyph
"Unicode" support. Do you think someone is going to integrate
FreeType into Linux anytime soon? :)
Why not? FreeType is very modular by design; it would be possible to
remove almost everything but bitmap-only SFNT handling. Note,
however, that this library doesn't interpret GSUB and other advanced
OpenType tables by itself. You need Pango or something similar for
this.
Or just the Open Type Layout code, which can be found in the HarfBuzz
module:

http://www.freedesktop.org/wiki/Software_2fHarfBuzz

However, OpenType Layout is complex, and the binary takes 100kb.
Post by Werner LEMBERG
Post by Rich Felker
All problem solving is about choosing the right tool for the job.
Storing bitmap fonts in the TTF/OpenType framework is like using a
nuclear missile to toast fruit flies, or like driving an SUV to
commute to the office...
You are underestimating the problem, I think. The proper bitmap
format is the least important thing, and the compact SFNT bitmap
formats are not a bad choice IMHO. Much more important is the ability
to store the glyph substitution tables efficiently.
Not even that. The entire font format thing is of the least importance.
The Unicode algorithms and how you want to interact with applications is
the hard part and why I gave up working on bidirectional terminals.
--
behdad
http://behdad.org/

"Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill"
-- Dan Bern, "New American Language"
Mark Leisher
2006-08-13 14:23:30 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
Post by Werner LEMBERG
What about using bitmap-only TrueType fonts, as planned by the X
Windows people?
Could you direct me to good information? I have serious doubts but
I'd at least like to read what they have to say.
http://www.pps.jussieu.fr/~jch/software/xfree86-bitmap-fonts.html
I don't know the current status of such fonts w.r.t. X Windows.
Post by Rich Felker
Quite frankly it doesn't matter it FontForge supports a bitmap font
format because "xbitmap" is the ideal tool for making bitmap fonts.
Please give an URL. Another good bitmap font editor is xmbdfed from
Mark Leisher.
Actually xmbdfed (Motif based) is no longer maintained, but it still
works. The latest version is http://crl.nmsu.edu/~mleisher/gbdfed.html
(GTK+ based). I plan on adding the ability to create bitmap-only OTF
fonts to gbdfed, but haven't had time and it looks like I won't have
time for a few months yet.
--
---------------------------------------------------------------------------
Mark Leisher
Computing Research Lab Nowadays, the common wisdom is to
New Mexico State University celebrate diversity - as long as you
Box 30001, MSC 3CRL don't point out that people are
Las Cruces, NM 88003 different. -- Colin Quinn
George W Gerrity
2006-08-03 05:40:29 UTC
Permalink
Please. Let's not have yet another *NIX font encoding and presenting
scheme! Why don't you set up a team to rationalise the existing
encodings and presentation methods. The biggest headache in *NIX
(with the exception of Mac OS X's underlying version) is the
haphazard way that handling of non-ASCII characters and the I18n has
developed. It is especially grotty at the system level, and as you
commented below, one of the reasons is that (English-only speaking)
*NIX systems people think that handling of non-ASCII charsets should
somehow be trivial and not bulky in code.

I am no longer up-to-date with kernel and system details in *NIX, and
am not a developer — perhaps an interested bystander is where I fit
in — but I used to do a lot of coding in that area, so I know how
difficult it can be. My view is that what is needed is a modular (and
unified) way of slotting in support for handling various alphabets
and languages, based on Unicode categories, that can be easily set up
at system build time. Moreover, *NIX is greatly in need of a way of
unifying all the various ways for formatting and representing
characters at all level, using system-level code. This may even imply
some minor tweaking of the POSIX standard.

I know that a real-life problem (with a deadline?) has got you
energised to tackle this can of worms, but a quick fix or re-
invention of the wheel is just not the way to go. Someone with energy
and know-how has got to get a team together and fix what is broken in
the guts of *NIX so that it presents a good, clean interface for I18n
and multiple character set representation.

George
------
Dr George W Gerrity Ph: +61 2 6386 3431
GWG Associates Fax: +61 2 6386 4431
P O Box 229 Time: +10 hours (ref GMT)
Harden, NSW 2587 PGP RSA Public Key Fingerprint:
AUSTRALIA 73EF 318A DFF5 EB8A 6810 49AC 0763 AF07
------
Post by Rich Felker
Post by Werner LEMBERG
Post by Rich Felker
A revised, simplified file format proposal based on my original
sketch, some of Markus's ideas for "NCF", and an evaluation of which
optimizations were likely to benefit actual font data.
What about using bitmap-only TrueType fonts, as planned by the X
Windows people?
Could you direct me to good information? I have serious doubts but I'd
at least like to read what they have to say.
Post by Werner LEMBERG
This has the huge advantage that both FreeType and
FontForge (and X Windows too in the not too distant future) already
support them.
Quite frankly it doesn't matter it FontForge supports a bitmap font
format because "xbitmap" is the ideal tool for making bitmap fonts.
However as long as you can import and export glyphs I see no reason
FontForge couldn't be used for this too.
I also get the impression from some Apple papers i was browsing
recently that TTF/OpenType put the burden of knowing how to stack
combining characters and produce ligatures onto the software rather
than the font. Under such a system, applications will never support
all scripts unless they use one of the unweildy libraries with all of
this taken care of...
Post by Werner LEMBERG
I can't see an immediate advantage of a new bitmap format.
...on the other hand, at least for bitmap fonts, simple rule-based
substitutions set up by the font designer can easily provide the
needed functionality with less than 5kb of code doing all the glyph
processing.
Right now we're at an unfortunate point where the core X font system
has been deprecated, but there is nothing suitable in its place.
Moreover non-X unix consoles are essentially deprecated as well since
they lack all but some patronizing Euro-centric 512-glyph "Unicode"
support. Do you think someone is going to integrate FreeType into
Linux anytime soon? :)
All problem solving is about choosing the right tool for the job.
Storing bitmap fonts in the TTF/OpenType framework is like using a
nuclear missile to toast fruit flies, or like driving an SUV to
commute to the office...
When it comes to character cell fonts (which is an even narrower
problem field than bitmap fonts), the goal is something that can
provide the baseline support for readable and correct display of any
script and that can work in any environment needing character cell
display, from embedded systems to unix console drivers to 'poweruser'
X sessions with 50 terminals open across 8 virtual desktops and half
of them scrolling text constantly... What is NOT needed is more
substanceless eyecandy that takes 500 megs of ram and 3ghz to run
smoothly. Doesn't anyone find it a bit ironic that you can get
translations of the featureless GNOME and KDE applets (which are about
as bare and useless as the MS Windows "accessories") into almost any
language and that the widgets display all the scripts correctly, but
then when you go try to USE your language for any serious work on unix
you find that everything displays bogus when you type "ls" in your
shell, that most of the powerful text editors have no idea about
something as basic as nonspacing characters, that ELinks still insists
on dumbing-down the perfect UTF-8 it received from the web into a
legacy codepage before converting it back to UTF-8 to display on your
terminal, etc.?
There's a severe gap in where the focus on m17n and i18n is being
placed and where it's needed, and IMO a huge part of that comes from
the fact that most competent unix users _scorn_ m17n and i18n because
of the perception that it's inherently bloated. I've met plenty of
people whose knee-jerk reaction after typing ./configure --help is to
--disable any i18n-related option they see out of fear that it will
fill up their disk with unwanted crap, introduce security vulns, or
just make the program use 3-10x the memory it should.. Fears like this
are compounded by the fact that, at present, the user is forced to
make a choice between "lightweight configuration with incorrect or no
m17n support" and "bloated configuration that pulls in Pango, glib,
fontconfig, Xft, Xrender, ..." just to be able to get "odd scripts I
don't really care about" to display properly. This sort of dichotomy
of course perpetuates the lack of good support. GNOME coders who have
no features in their apps except for the beautiful behavior of the gui
widgets are happy to spend effort (and tens of megabytes) linking to
the behemoth libs and patting themselves on the back for being
"internationally friendly", while developers working on long-standing
projects where most of the substance lies somewhere other than the gui
presentation are baffled by these libs for which they understand
neither the necessity nor the implementation. And unlike rad gui
monkeys who are happy to copy-and-paste-and-glue whatever libs someone
throws at them, people working on mature projects with actual features
are weary of including support for anything they don't understand. The
perpetuation of the myth that "correct rendering of all the world's
scripts is difficult and requires advanced software" of course makes
them even more weary.
I could go on and on for years and years about this for sure. I'm
extremely bitter about the sad state of m17n on unix and the fact that
there is not even one working Unicode terminal with simultaneous
support for all scripts.
So with that said, I'll continue on with my draft bitmap font format
(which already has a lot more simplifications -- remember, a work of
art is only complete when you can't find anything left to _remove_
from it), write my 5kb of code, integrate it into uuterm, and
somewhere in the next few months aim to have the first working Unicode
terminal emulator... in a 50kb static binary.
So much for "m17n is bloated crap"...
Rich
P.S. At the same time I'd be very happy to discuss bitmap and
character cell fonts, others users' and developers' requirements for
them (particularly for scripts I'm not familiar with). Also I have a
tendancy to flame especially when I'm bitter about something. Please
don't take anything I said too seriously except for that overall
thesis that things are in a bad state and lightweight m17n support is
critically needed in order to enable use and development of m17n in
serious apps.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2006-08-03 15:58:38 UTC
Permalink
Post by George W Gerrity
Please. Let's not have yet another *NIX font encoding and presenting
scheme! Why don't you set up a team to rationalise the existing
encodings and presentation methods.
This is the sort of mentality that sickens me. "Please oh please don't
make something good because there's so much crap out there that you
should fix instead!" This is the sort of mentality that lead to
abominations like BIND and Sendmail surviving as long as they did,
OpenSSH (in all its glory of vulnerabilities) being forked from the
old SSH code instead of rewritten from scratch, etc.
Post by George W Gerrity
The biggest headache in *NIX
(with the exception of Mac OS X's underlying version) is the
haphazard way that handling of non-ASCII characters and the I18n has
developed. It is especially grotty at the system level, and as you
The system level has nothing to do with fonts... Until you get to
fonts and rendering, m17n and i18n are extremely trivial.
Post by George W Gerrity
commented below, one of the reasons is that (English-only speaking)
*NIX systems people think that handling of non-ASCII charsets should
somehow be trivial and not bulky in code.
I'm not English-only speaking yet I'm quite confident that it should
be trivial and not bulky in code, and that applications should not
even have to think about it.

The difference between your approach (and the approach of people who
have written most of the existing applications with extensive script
support) and mine is exactly the same as the difference between the
early efforts at converting to Unicode (especially by MS) and UTF-8:
The MS/Unicode approach was to pull the rug out from under everyone
and force them to drop C, drop UNIX, drop all existing internet
protocols, and store text as 16bit characters in UCS-2. The UTF-8
approach on the other hand recognizes that most of the time when
programs are dealing with text they don't care about the encoding or
meaning of the text at all. At most they care about some codepoints in
the first 128 positions that have special meaning to the software.
Thus Unicode can be supported _without_ any special effort from
developers.

The obvious exception to this comes when it's time to display the
text on a visual device for the user. :) Terminals, if they work
correctly with the necessary scripts, provide a very clean solution to
the problem because the application doesn't have to think about the
presentation of the text. Historically it meant the application could
just assume 1 byte == 1 character position for non-control characters.
Now, the same requires mbtowc/wcwidth, but it's not any huge burden.
Surely a lot less burden than doing the text rendering yourself.

But what about applications that _do_ want/need to do the text
rendering themselves? This must include at least the terminal
emulator, :) and also imaging programs, presentation apps,
visually-oriented web browsers, ... As long as the program does not
need to do its _own_ text display it may be able to rely on a widget
set, which basically gives all the same advantages as using a terminal
with regard to implementation simplicity. (However now we need to add
widget sets to the list of things that need to do text rendering..)

This whole line of questioning raises a lot more questions than it
answers and I'm going to easily get sidetracked if I continue...
Post by George W Gerrity
I am no longer up-to-date with kernel and system details in *NIX, and
am not a developer � perhaps an interested bystander is where I fit
in � but I used to do a lot of coding in that area, so I know how
difficult it can be. My view is that what is needed is a modular (and
Why modular? "Modular" is the magic panacea word among people writing
this bloatware, and all it does is massively increase memory
requirements and complexity.
Post by George W Gerrity
unified) way of slotting in support for handling various alphabets
and languages,
The view that supporting a new alphabet or language requires a new
module is fundamentally wrong. All it should require is proper
information in the font.
Post by George W Gerrity
based on Unicode categories, that can be easily set up
at system build time.
So at build time you either choose "bloatware with m17n" or "legacy
ascii/latin1 crap"? Sounds like the current problem we're stuck with.
The bloatware distros will mostly choose the former and the ones
targetting more advanced users who dislike bloat will choose the
latter, perpetuating the problem that competent developers despise
m17n and i18n and therefore do not include support in their programs.
Post by George W Gerrity
Moreover, *NIX is greatly in need of a way of
unifying all the various ways for formatting and representing
characters at all level, using system-level code.
Huh? What does this even mean? Are you confusing glyphs with
characters? Representing characters is trivial.
Post by George W Gerrity
This may even imply
some minor tweaking of the POSIX standard.
.....
Post by George W Gerrity
I know that a real-life problem (with a deadline?) has got you
No deadline except being tired of having a legacy system.
Post by George W Gerrity
energised to tackle this can of worms, but a quick fix or re-
invention of the wheel is just not the way to go.
Someone once said: "when the wheel is square you need to reinvent it".
Post by George W Gerrity
Someone with energy
and know-how has got to get a team together and fix what is broken in
the guts of *NIX so that it presents a good, clean interface for I18n
and multiple character set representation.
Absolutely not. This is the bloatware doctrine, that new interfaces
and libs are a panacea, that they're best designed by teams and
committees, etc. What's needed is _simplicity_. When you have
simplicity everything else follows.

There is a possibility here to solve a simple, almost-trivial unsolved
problem. What you propose is abandoning the simple problem and trying
to solve much more difficult problems instead, many of which will not
be solved anytime in the near future due as much to personal and
political reasons as to technical ones. Moreover, even if the more
difficult problem is solved, the solution will not be useful to anyone
except the ./configure --enable-bloat crowd (myself included). Why
would I want to abandon a real solvable problem in order to attempt at
solving a problem that's uninteresting to me?

Rich
George W Gerrity
2006-08-04 04:16:04 UTC
Permalink
Post by Rich Felker
Post by George W Gerrity
Please. Let's not have yet another *NIX font encoding and
presenting scheme! Why don't you set up a team to rationalise the
existing encodings and presentation methods.
This is the sort of mentality that sickens me. "Please oh please
don't make something good because there's so much crap out there
that you should fix instead!" This is the sort of mentality that
lead to abominations like BIND and Sendmail surviving as long as
they did, OpenSSH (in all its glory of vulnerabilities) being
forked from the old SSH code instead of rewritten from scratch, etc.
Actually, that is what I was opposing. But any solution to console
representation has to handle three things together — localisation,
internationalisation, and multilingualisation — or there will still
be the mess where these things are dealt with inconsistently in
separate and in multiple places in existing *NIX systems, and even in
the POSIX standard.

The font encoding is incidental unless it is too simple to provide
the rendering required for complext script systems. Moreover, the
problem has nothing to do with font encoding, except that the
decoding and rendering are done in so many different places in a *NIX
system.

If you are spending your effort on a new (compact) glyph
representation to use at the console to avoid bloat or proprietary
software, then you are wasting your time. A font requires more than
the encoding of glyph representation if it is to be compact: there
must be some way to combine simple glyphs to form a more complex
glyph before rendering as a glyph image. Experts in font encoding
have spent years in developing their encoding methods to be both
efficient in time and in space, while at the same time enabling the
encoding to handle fonts for _any_ script system: I doubt that you
can improve on them, but go ahead and try, keeping in mind that
support for describing some glyphs in complex fonts is still not
fully specified even in Unicode, much less than in the font encodings.

And having done that, you still have to fix L10n, I18n, and m17n so
that it is handled properly at the console level and so that the
routines for these features are available only once and don't have to
be replicated by every application and/or in different interfaces.
Post by Rich Felker
Post by George W Gerrity
The biggest headache in *NIX (with the exception of Mac OS X's
underlying version) is the haphazard way that handling of non-
ASCII characters and the I18n has developed. It is especially
grotty at the system level, and as you ...
The system level has nothing to do with fonts... Until you get to
fonts and rendering, m17n and i18n are extremely trivial.
It depends on how character strings are handled before they get to
the console application. In some *NIX systems, this is handled in the
kernel, mixed up with I/O handling. This was done for efficient I/O
handling, including efficient buffering. As I said in my first email,
I am no longer cognisant of how this sort of code is handled, but
when I was working on *NIX, I had to rewrite a lot of that code to
remove assumptions about what a word was, what a char was, what a
byte was. I know that this has been cleaned up since, but I would be
surprised if all the dependencies of low-lying data handling have
been removed.

The other point is that rendering _is_ required at the console level
for more complex script systems: you cannot special-case consoles to
fixed width and avoid rendering problems in the _majority_ of non-
Latin scripts.

With proper m17n, L10n and I18n, someone speaking Hindi, for instance
(more of them than English speakers!) should be able to boot into
single user with both prompts and commands using the appropriate
script for Hindi. Correct rendering of Indic scripts is _not_ trivial
(and therefore the code is bulky). At the single user level, many
implementations of *NIX incorporate this terminal rendering code
either into the kernel or as system code.

Naturally, one wouldn't want this code for rendering every script
system incorporated into each and every system, and therefore,
modularity makes sense: you put in what you need at build time.
That's done all the time with *NIX implementations. Moreover, with
proper (modular) design, you can build in your m17n one script system
and its supported languages at a time, releasing new code as you go,
and not having to rewrite what you started with every time you add.
Post by Rich Felker
Post by George W Gerrity
commented below, one of the reasons is that (English-only
speaking) *NIX systems people think that handling of non-ASCII
charsets should somehow be trivial and not bulky in code.
I'm not English-only speaking yet I'm quite confident that it
should be trivial and not bulky in code, and that applications
should not even have to think about it.
But do your linguistic skills extend to a language using a non-latin
script — or, more relevant — a language that uses a complex script
system?
Post by Rich Felker
The difference between your approach (and the approach of people
who have written most of the existing applications with extensive
script support) and mine is exactly the same as the difference
between the early efforts at converting to Unicode (especially by
MS) and UTF-8: The MS/Unicode approach was to pull the rug out from
under everyone and force them to drop C, drop UNIX, drop all
existing internet protocols, and store text as 16bit characters in
UCS-2.
Please don't imply that I would support MS in any way. They have used
every trick in the trade to lock users into their products, and all
of their products started as spaghetti code written in assembler.
They eventually switched to C and then to their brand of C++, but
still produced monolithic code to control their APIs and keep most of
them hidden from independent developers. In any case, they had no
choice but to pull the rug out from under anyone (including their own
people) because there was no other way to upgrade to Unicode from the
crap API they had.

Modular approaches to programming have nothing whatever to do with
bloat and everything to do with maintainability, understandability,
orthogonality, and extensibility. Indeed, some *NIX systems (such as
Mach) are modular and are _not_ bloated and are highly efficient. A
good argument can be made that Mach is more easily maintained than
say, Linux, which has a huge, non-modular kernel.
Post by Rich Felker
The UTF-8 approach on the other hand recognizes that most of the
time when programs are dealing with text they don't care about the
encoding or meaning of the text at all. At most they care about
some codepoints in the first 128 positions that have special
meaning to the software. Thus Unicode can be supported _without_
any special effort from developers.
Yep. As long as all code developers are required to learn English to
an acceptable level: very Anglo-centric. I applaud the extension of C/
C++, etc, to use and represent variable names and commands in scripts
other than latin-1.
Post by Rich Felker
The obvious exception to this comes when it's time to display the
text on a visual device for the user. :) Terminals, if they work
correctly with the necessary scripts, provide a very clean solution
to the problem because the application doesn't have to think about
the presentation of the text. Historically it meant the application
could just assume 1 byte == 1 character position for non-control
characters.
Obviously, we are talking at cross purposes. You seem to be agreeing
that text rendering for multiple scripts needs to be available for
any application (including the login process and "sh" and its
derivatives?). That requires delving into pretty low-level and pretty
ancient code bits, some of which, _may_ require a change to the
interface, ie, the API, and that would mean that the POSIX standard
would be breached unless and until it was altered. It also means a
complete rewrite of this low-level code.

I am asking for this complete rewrite, as opposed to quick fixes. If
that is what you are actually doing, I support it. But, it is a big
job, and would benefit from some help and consultation with like-
minded individuals. I am _not_ suggestion an IBM/MS-type code team:
efficiency can be — and often is — achieved by a small team of
experienced and dedicated programmers working together closely.

And I repeat, writing rendering code is _not_ trivial, and it _is_
bulky, and it _is_ necessary at the console level.
Post by Rich Felker
Now, the same requires mbtowc/wcwidth, but it's not any huge
burden. Surely a lot less burden than doing the text rendering
yourself.
If you have to map code to representation, then you are doing
rendering. Rendering in some scripts maps multiple code points to one
glyph position. For instance, Vietnamese, which uses the Latin
alphabet, can have up to five accents applied to a basic Latin
character, to present one glyph fitting into one (wide or narrow)
character position. Representing Vietnamese in a fixed-width simple
terminal emulator requires considerable rendering code, even though
most of the required accents and all of the alphabet is found in ascii.
Post by Rich Felker
But what about applications that _do_ want/need to do the text
rendering themselves? This must include at least the terminal
emulator, :) and also imaging programs, presentation apps, visually-
oriented web browsers, ... As long as the program does not need to
do its _own_ text display it may be able to rely on a widget set,
which basically gives all the same advantages as using a terminal
with regard to implementation simplicity. (However now we need to
add widget sets to the list of things that need to do text
rendering..)
Agreed.
Post by Rich Felker
This whole line of questioning raises a lot more questions than it
answers and I'm going to easily get sidetracked if I continue...
If you don't get sidetracked enough to deal with difficult scripts —
at least at the design level — your solution will be yet another
inadequate kludge: it won't be flexible enough to add support for
more difficult scripts.
Post by Rich Felker
Post by George W Gerrity
I am no longer up-to-date with kernel and system details in *NIX,
and am not a developer — perhaps an interested bystander is where
I fit in — but I used to do a lot of coding in that area, so I
know how difficult it can be. My view is that what is needed is a
modular (and ...
Why modular? "Modular" is the magic panacea word among people
writing this bloatware, and all it does is massively increase
memory requirements and complexity.
See my comments above. Modularity generally does increase code size
compared to spaghetti code, but it has a number of advantages: 1) In
the case discussed above, it allows one to remove rendering code for
script systems that do not need to be supported at build time; 2) it
allows you to develop code step by step or in parallel with others.
You can start with simple Latin scripts, add support for Cyrillic and
Greek. Then you need to tackle the simplest right-to-left script,
Hebrew. Next you can deal with something like Vietnamese, which uses
multiple code points from Latin-1 to render single glyphs. And so on;
3) It is more easily maintained than non-modular designs.

In addition, well-designed modular code is not particularly larger
than non-modular code. bulky code is usually feature-bloated code
generated by tacking these features onto a core application that was
in itself badly designed. MS Word is a good example: there is still
an annoying rendering flaw in it that has been there since its
inception, but nobody can fix it because the original code was so
badly designed and written, and the add-ons only bury it deeper.
Post by Rich Felker
Post by George W Gerrity
unified) way of slotting in support for handling various alphabets
and languages,
The view that supporting a new alphabet or language requires a new
module is fundamentally wrong. All it should require is proper
information in the font.
Not true! Rendering of non-latin fonts is much more complex than
that. Rendering involves a complex (multiple) code point-to-glyph
mapping that can be context dependent and may require reordering of
the code points before mapping. The mapping is script dependent,
language dependent, and font-dependent. Rendering Arabic Script, for
instance, is highly context dependent, since the form of glyph used
depends on whether the constant is at the beginning, middle, or end
of the word, and on what vowel is associated with the constant. It is
also language dependent, since, for instance, Farsi (spoken in Iran
and Afghanistan), uses some extra glyphs not found in Arabic-language
Arabic script. I don't believe that reordering is required, but then
I am not a user of Arabic script.

Both modern Greek and modern Hebrew also have a few consonants that
are rendered differently when they are at the end of a word.

All Indic scripts require extensive code-point reordering, since
traditionally, some vowels are placed before certain consonants, even
though they are pronounced after the consonant, and the reason is
that they are a key to the combined vowel-consonant glyph to be
rendered. I could go on, but I repeat: rendering of non-latin scripts
is _not_ trivial. Moreover, like Arabic scripts, rendering of an
Indic script is also language dependent. A Devanagiri font has
alternate and extra glyphs in order to cater for a number of Sanscrit-
derived modern languages that use it.
Post by Rich Felker
Post by George W Gerrity
based on Unicode categories, that can be easily set up at system
build time.
So at build time you either choose "bloatware with m17n" or "legacy
ascii/latin1 crap"? Sounds like the current problem we're stuck
with. The bloatware distros will mostly choose the former and the
ones targetting more advanced users who dislike bloat will choose
the latter, perpetuating the problem that competent developers
I would use "one-eyed" rather than "competent". Competency is not
limited to *NIX system coders, nor are all *NIX coders competent.
Post by Rich Felker
despise m17n and i18n and therefore do not include support in their
programs.
Once again, you misunderstand my position: bloatware _will_ result if
you tack on L10n, m17n, and I18n to the existing "legacy ascii/latin1
crap". You need to start from scratch, maybe even changing some APIs,
to remove the ascii/latin1 design bias built into original UNIX. But,
handling and rendering non-latin-1 scripts is complex, and hence the
code will be considerably larger than what is being replaced.
Moreover, this needs to be done whatever font encoding is used, and
font encoding is largely orthogonal to L10n, m17n, and I18n. Only the
rendering engine is dependent on font encoding.
Post by Rich Felker
Post by George W Gerrity
Moreover, *NIX is greatly in need of a way of unifying all the
various ways for formatting and representing characters at all
level, using system-level code.
Huh? What does this even mean? Are you confusing glyphs with
characters? Representing characters is trivial.
My slip of the tongue. What I was trying to say is that various
applications including printer drivers, terminal drivers, text
editors, etc, seem to use different APIs, software, and tables to
format and to render text. Rewriting from scratch with the idea that
all text is Unicode and with the rendering done at a low level based
on L10n and included Unicode control characters (some of which are
multiple byte and some of which specifiy language and script system
to be rendered), then the system presents a uniform API for all
applications that need code rendered. Perhaps some text processing
systems will need to do their own rendering, but I imagine that in
some cases they can access a fixed-width rendering system applied to
a variable-width font, and do their own spacing. Alternatively, if
rendered fonts use anti-aliasing and colours/shades, maybe your basic
renderer won't be useful.
Post by Rich Felker
Post by George W Gerrity
This may even imply some minor tweaking of the POSIX standard.
.....
Post by George W Gerrity
I know that a real-life problem (with a deadline?) has got you
No deadline except being tired of having a legacy system.
Great! You have time to spend on a careful, modular design based on a
good understanding of the problems that arise with m17n being
included at the basic console level.
Post by Rich Felker
Post by George W Gerrity
energised to tackle this can of worms, but a quick fix or re-
invention of the wheel is just not the way to go.
Someone once said: "when the wheel is square you need to reinvent it".
Agreed.
Post by Rich Felker
Post by George W Gerrity
Someone with energy and know-how has got to get a team together
and fix what is broken in the guts of *NIX so that it presents a
good, clean interface for I18n and multiple character set
representation.
Absolutely not. This is the bloatware doctrine, that new interfaces
and libs are a panacea, that they're best designed by teams and
committees, etc. What's needed is _simplicity_. When you have
simplicity everything else follows.
A hammer is a simple tool. If that's all you've got, then every
problem is a nail that needs to be whacked in. Unfortunately, not all
mechanical problems can be rectified with just a hammer.

Cars are much more complex than the original Model T, and a lot of it
is bloatware. But no one in their right mind would want to drive at
100kph with mechanical-linkage brakes. No one could afford an engine
as inefficient as the V-12 that was in some 40s and 50s Oldsmobiles.
No one today would design an engine w/o fuel injection and computer
control of fuel/air ratio, because in this case, complexity yields
efficiency.

Your basic premise is wrong: I18n, m17n, and L10n _is_ very complex,
even if implemented with fixed-width fonts at the console level. your
little hammer won't do, and the solution will be big compared to
handling ascii.
Post by Rich Felker
There is a possibility here to solve a simple, almost-trivial
unsolved problem.
If it were trivial, it would have been solved long ago: it is not.
Post by Rich Felker
What you propose is abandoning the simple problem
I repeat: the problem is not simple.
Post by Rich Felker
and trying to solve much more difficult problems instead, many of
which will not be solved anytime in the near future due as much to
personal and political reasons as to technical ones.
The only political problem has to do with font encoding: both Adobe
and MS want to keep control of the huge font rendering and font
foundary industry, Adobe because they have a near monopoly on
firmware rendering systems, and MS so they can keep technology away
from any competitors, including Apple. Their union is unstable, and
conflicting interests has indeed led to complex font encoding and to
inadequate standardisation. Having said that, Adobe has been trying
to deal with the complexities of rendering non-Latin script systems
without having to start from scratch: that leads to bloat, but you do
have to take economics into consideration.
Post by Rich Felker
Moreover, even if the more difficult problem is solved, the
solution will not be useful to anyone except the ./configure --
enable-bloat crowd (myself included). Why would I want to abandon a
real solvable problem in order to attempt at solving a problem
that's uninteresting to me?
Because you misunderstand the problem? Because if it really were
simple, it would have been simply done already?

George
------
Werner LEMBERG
2006-08-04 07:17:18 UTC
Permalink
Rendering in some scripts maps multiple code points to one glyph
position. For instance, Vietnamese, which uses the Latin alphabet,
can have up to five accents applied to a basic Latin character, to
present one glyph fitting into one (wide or narrow) character
position.
Hmm, only two, AFAIK: Vietnamese has a set of tone marks (acute,
grave, hook, tilde, dot below) which can be applied to a base letter,
and some of the base letters have an accent (breve, circumflex, horn).
Representing Vietnamese in a fixed-width simple terminal emulator
requires considerable rendering code, even though most of the
required accents and all of the alphabet is found in ascii.
This isn't correct since Unicode contains all Vietnamese letters in
precomposed form. Anyway, this is an exception.


Werner
Rich Felker
2006-08-05 06:49:59 UTC
Permalink
Post by George W Gerrity
Actually, that is what I was opposing. But any solution to console
representation has to handle three things together \windows-1252-0277 localisation,
internationalisation, and multilingualisation \windows-1252-0277 or there will still
be the mess where these things are dealt with inconsistently in
separate and in multiple places in existing *NIX systems, and even in
the POSIX standard.
If you're going to make bold claims like this you need to back them
up, especially claiming POSIX is inconsistent on the matter.
Post by George W Gerrity
The font encoding is incidental unless it is too simple to provide
the rendering required for complext script systems. Moreover, the
That is exactly the topic: rendering "complex" scripts. Better Unicode
console support would still be interesting even without this (for
example, CJK would be very useful to many many people), but I'm not
interested in any solution that doesn't cover the so-called (IMO a
misnomer) complex scripts.
Post by George W Gerrity
[...] then you are wasting your time.
How am I "wasting my time" if the end result is something I can use,
whereas nothing I can use exists now??
Post by George W Gerrity
A font requires more than
the encoding of glyph representation if it is to be compact: there
must be some way to combine simple glyphs to form a more complex
glyph before rendering as a glyph image.
If you'd read any of this thread you'd know that I'm quite aware of
combining and the requirments for characters to have varying glyphs
under different contexts and combinations. However, the combining
process is not complex. It can all be accomplished with simple
overstrike provided you have sufficiently powerful rules for
expressing which glyph to use for a character depending on context.
Developing such a system is the question this thread was started in
order to answer.
Post by George W Gerrity
Experts in font encoding
Proof by appeal to authority generally does not impress
mathematicians. :)
Post by George W Gerrity
have spent years in developing their encoding methods to be both
efficient in time and in space,
It's been established that their methods are _not_ efficient in space.
They're only efficient in time because of their severe limitations (at
most 64k glyphs, etc.).
Post by George W Gerrity
while at the same time enabling the
Only with added script-specific knowledge from the rendering
implementation, which may need to be upgraded when new scripts are
added.. This is not acceptable since some software will not be
updated, due to laziness/disappearance by authors, etc. As long as the
only burden is on the font files, then any font containing glyphs for
a script will necessarily provide full support for that script in any
application, without the application's authors having to explicitly
include support.
Post by George W Gerrity
Post by Rich Felker
The system level has nothing to do with fonts... Until you get to
fonts and rendering, m17n and i18n are extremely trivial.
It depends on how character strings are handled before they get to
the console application. In some *NIX systems, this is handled in the
kernel, mixed up with I/O handling. This was done for efficient I/O
handling, including efficient buffering. As I said in my first email,
I am no longer cognisant of how this sort of code is handled, but
when I was working on *NIX, I had to rewrite a lot of that code to
remove assumptions about what a word was, what a char was, what a
byte was. I know that this has been cleaned up since, but I would be
surprised if all the dependencies of low-lying data handling have
been removed.
This entire paragraph shows a complete ignorance about unix which
almost amounts to trolling. There is exactly one place where the
kernel needs to have an awareness of character encoding, and this is
at the 'cooked/icanon' tty level where the kernel handles simple
line-editing operations. Failure to be aware of multibyte encoding,
fullwidth characters, and nonspacing characters will result in
backspace and such behaving incorrectly when the terminal is in
canonical input mode.

Otherwise, there are a few optional things like filename translation
when mounting windows UTF-16 filesystems, which the kernel can handle.
But for the most part the kernel is unaware of encoding and has no
need to care about encoding.
Post by George W Gerrity
The other point is that rendering _is_ required at the console level
The kernel-internal terminal on the console video device is another
issue that can be handled by the kernel, but there's no fundamental
reason one of these is needed. It can also be implemented in
userspace, which is what I'm doing for the time being. If my work (or
someone else's better work) can eventually be integrated into Linux,
I'll be happy, but it's not essential. Terminals that run on the
framebuffer device, svgalib, or under X are all perfectly usable.
Post by George W Gerrity
for more complex script systems: you cannot special-case consoles to
fixed width and avoid rendering problems in the _majority_ of non-
Latin scripts.
A terminal is a character-cell device, with fixed-width character
cells. This is not open to discussion, but fear not, it's not a
problem! On a modern terminal there are three character widths: zero
(nonspacing/combining), one (most scripts including Latin, Greek,
Cyrillic, Indic, ...), and two ("full width" CJK ideographs, etc.).

To my knowledge there is still no official standard as to which
characters have which width, but POSIX specifies the function used to
obtain the width of each character (and defines the results as
'locale-specific'), and Markus Kuhn's implementation is the de facto
standard and is based on applying very reasonable rules to the
published Unicode data (East Asian Width tables and Mn and Cf classes,
mainly).
Post by George W Gerrity
With proper m17n, L10n and I18n, someone speaking Hindi, for instance
(more of them than English speakers!) should be able to boot into
single user with both prompts and commands using the appropriate
script for Hindi.
I agree completely.
Post by George W Gerrity
Correct rendering of Indic scripts is _not_ trivial
(and therefore the code is bulky).
This is a non sequitur. I've written plenty of non-trivial code in
well under 10k. Also rendering Indic scripts is nowhere near as
complicated as people make it sound. The main issue, vowel placement,
can be handled with simple glyph selection rules in the font as
follows: any character followed by the reordering vowel uses the vowel
glyph as its glyph; the vowel takes its glyph from the character
appearing before it. Notice that neither the application using the
terminal nor the terminal itself had to know anything about the
concept of reordering. Everything takes place at the glyph selection
stage.
Post by George W Gerrity
Naturally, one wouldn't want this code for rendering every script
system incorporated into each and every system, and therefore,
Why not? If it's 5k what's the harm?? My claim is that I can support
all scripts in under 10k of code, provided you have an appropriate
font with the right (also small) tables. In this implementation, users
wanting to minimize system size can select a font with only the
scripts they need, rather than compiling a crippled application. The
same application/kernel binaries will then support all scripts if you
just give them a more-complete font.
Post by George W Gerrity
Post by Rich Felker
I'm not English-only speaking yet I'm quite confident that it
should be trivial and not bulky in code, and that applications
should not even have to think about it.
But do your linguistic skills extend to a language using a non-latin
script -- or, more relevant -- a language that uses a complex script
system?
Yes. བོད་སྐད་དང་བོད་ཡིག་ཤེས་ཀྱི་ཡོད༌
(Hope I didn't mess that up... I typed it blind since I don't yet have
a terminal that can display it.)
Post by George W Gerrity
Post by Rich Felker
The difference between your approach (and the approach of people
who have written most of the existing applications with extensive
script support) and mine is exactly the same as the difference
between the early efforts at converting to Unicode (especially by
MS) and UTF-8: The MS/Unicode approach was to pull the rug out from
under everyone and force them to drop C, drop UNIX, drop all
existing internet protocols, and store text as 16bit characters in
UCS-2.
Please don't imply that I would support MS in any way. They have used
I'm not; I was just implying that your "pull out the rug" idea is
similar to theirs, and that it's based on a wrong assumption that
unix/C/etc. are somehow broken when it comes to m17n, which they're
not.
Post by George W Gerrity
them hidden from independent developers. In any case, they had no
choice but to pull the rug out from under anyone (including their own
people) because there was no other way to upgrade to Unicode from the
crap API they had.
UTF-8 would have worked just as well on Windows as it does on UNIX.
Maybe someday they'll finally offer UTF-8 as an option for the 8bit
character encoding, though somehow I doubt it since that would be like
admitting UCS-2 was a stupid idea.
Post by George W Gerrity
I applaud the extension of C/
C++, etc, to use and represent variable names and commands in scripts
other than latin-1.
Yes! Actually AFAIK Latin-1 was never legal. They went straight from
ASCII to Unicode.
Post by George W Gerrity
Post by Rich Felker
Now, the same requires mbtowc/wcwidth, but it's not any huge
burden. Surely a lot less burden than doing the text rendering
yourself.
If you have to map code to representation, then you are doing
rendering.
You don't. A program with a terminal interface works only with
characters, never glyphs/representation. There are still some
questions about r2l/bidi stuff which I don't think have satisfactory
answers yet, but this is a different issue from glyphs/characters.
Post by George W Gerrity
Representing Vietnamese in a fixed-width simple
terminal emulator requires considerable rendering code, even though
most of the required accents and all of the alphabet is found in ascii.
The rendering code is trivial. The terminal emulator just accumulates
all nonspacing characters after the base character in one character
cell, and blits the appropriate glyphs on top of one another. The only
nontrivial part is when the glyphs to use depend on the context. This
is the problem I'm working on -- efficiently mapping characters
(including combining characters) to glyphs based on context.
Post by George W Gerrity
Not true! Rendering of non-latin fonts is much more complex than
that. Rendering involves a complex (multiple) code point-to-glyph
mapping that can be context dependent
There's never a need for a mapping of a multi-element sequence of
codepoints to one glyph. You juse use overstrike with appropriate
variants. In the worst case you can emulate many-to-one with
sufficient use of contextual glyph selection to make the base
character map to a glyph containing all of the modifications while the
combining characters all map to a blank glyph, but in practice you can
almost always do something much more reasonable with fewer contextual
rules.
Post by George W Gerrity
It is
also language dependent, since, for instance, Farsi (spoken in Iran
and Afghanistan), uses some extra glyphs not found in Arabic-language
Arabic script. I don't believe that reordering is required, but then
I am not a user of Arabic script.
AFAIK it's not required, but I'm not a user of Arabic either.
Post by George W Gerrity
Both modern Greek and modern Hebrew also have a few consonants that
are rendered differently when they are at the end of a word.
Greek has separate codepoints for final/nonfinal sigma. I don't know
whether Greek users are used to typing these as separate characters or
whether the font should do it for them.. Anyway it's a trivial
context-based replacement.
Post by George W Gerrity
Post by Rich Felker
Post by George W Gerrity
energised to tackle this can of worms, but a quick fix or re-
invention of the wheel is just not the way to go.
Someone once said: "when the wheel is square you need to reinvent it".
Agreed.
:)
Post by George W Gerrity
Your basic premise is wrong: I18n, m17n, and L10n _is_ very complex,
even if implemented with fixed-width fonts at the console level. your
little hammer won't do, and the solution will be big compared to
handling ascii.
Yes, compared to ASCII handling it will be bigger. It's about a
thousand times bigger. 5k instead of 5 bytes. :)
Post by George W Gerrity
Post by Rich Felker
There is a possibility here to solve a simple, almost-trivial
unsolved problem.
If it were trivial, it would have been solved long ago: it is not.
UTF-8 encoding: solved a long time ago with mb[r]towc and friends.

Unicode terminals with nonspacing and doublewide characters: solved by
the wcwidth library call and implemented by urxvt, xterm, mlterm, ...

"Complex" scripts on terminal: not solved. mlterm handles some scripts
but only the ones it speficially supports. The problem is not
complexity but the lack of the appropriate data. Due to the rendering
methods mlterm uses (especially if using core X font system) it
doesn't have access to GSUB/GPOS type info. This can be solved for
most scripts just by getting more direct/intelligent access to the
fonts, but I'm taking a different approach (which works for all
scripts not just some) of letting the font tell you what glyphs to
use.

The amount that remains to be done here is really small. It's like
putting in the last few pieces of a puzzle. I don't claim getting bidi
perfect or getting all the data tables right for perfect glyph mapping
will be easy or something that I can do on my own, but I _do_ claim
that building the framework that can support all cases is trivial.

Rich
Werner LEMBERG
2006-08-05 07:11:39 UTC
Permalink
Post by Rich Felker
A terminal is a character-cell device, with fixed-width character
cells. This is not open to discussion, but fear not, it's not a
problem!
Actually, this limitation makes some things more complicated, because
you have to use glyph variants instead of glyph composition in case
the width of the composed glyph changes.
Post by Rich Felker
Representing Vietnamese in a fixed-width simple terminal emulator
requires considerable rendering code, even though most of the
required accents and all of the alphabet is found in ascii.
The rendering code is trivial. The terminal emulator just
accumulates all nonspacing characters after the base character in
one character cell, and blits the appropriate glyphs on top of one
another.
At least for the u -> uhorn `accent' this doesn't work well...
Post by Rich Felker
There's never a need for a mapping of a multi-element sequence of
codepoints to one glyph. You juse use overstrike with appropriate
variants. In the worst case you can emulate many-to-one with
sufficient use of contextual glyph selection to make the base
character map to a glyph containing all of the modifications while
the combining characters all map to a blank glyph, but in practice
you can almost always do something much more reasonable with fewer
contextual rules.
Isn't the first sentence a contradiction to the remaining of the
paragraph? I think I'm misunderstanding. Please give an example.


Werner
Rich Felker
2006-08-05 07:39:19 UTC
Permalink
Post by Werner LEMBERG
Post by Rich Felker
A terminal is a character-cell device, with fixed-width character
cells. This is not open to discussion, but fear not, it's not a
problem!
Actually, this limitation makes some things more complicated, because
you have to use glyph variants instead of glyph composition in case
the width of the composed glyph changes.
For nonspacing combining characters (Mn), the width of the combined
character is the same as the width of the base character. Spacing
combining characters (Mc) occupy their own character cell of course
and behave exactly like a noncombining character except that they
almost surely have ligatures. I'm sure you can give examples of where
this behavior is not ideal, but at least it works.
Post by Werner LEMBERG
Post by Rich Felker
The rendering code is trivial. The terminal emulator just
accumulates all nonspacing characters after the base character in
one character cell, and blits the appropriate glyphs on top of one
another.
At least for the u -> uhorn `accent' this doesn't work well...
For a few cases where it doesn't work well, you make a special glyph
variant, optionally making the glyph for the combining character
become blank in this case. That's how the Indic consonants where part
of the glyph is deleted can be handled as well.
Post by Werner LEMBERG
Post by Rich Felker
There's never a need for a mapping of a multi-element sequence of
codepoints to one glyph. You juse use overstrike with appropriate
variants. In the worst case you can emulate many-to-one with
sufficient use of contextual glyph selection to make the base
character map to a glyph containing all of the modifications while
the combining characters all map to a blank glyph, but in practice
you can almost always do something much more reasonable with fewer
contextual rules.
Isn't the first sentence a contradiction to the remaining of the
paragraph? I think I'm misunderstanding. Please give an example.
You're just misunderstanding the terminology I'm using. I'm speaking
with regard to how it's implemented in the font's tables, not how you
might think about it.

Rich
Chris Heath
2006-08-06 11:34:16 UTC
Permalink
Post by Rich Felker
To my knowledge there is still no official standard as to which
characters have which width, but POSIX specifies the function used to
obtain the width of each character (and defines the results as
'locale-specific'), and Markus Kuhn's implementation is the de facto
standard and is based on applying very reasonable rules to the
published Unicode data (East Asian Width tables and Mn and Cf classes,
mainly).
I think you are making some incorrect assumptions about wcwidth.

Firstly, it *is* locale-dependent. On my Fedora Core 4 system, I used
this simple C program to test:

#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int i;
sscanf(argv[2], "%x", &i);
if (setlocale(LC_ALL, argv[1]))
printf("wcwidth(0x%04X)=%d in locale %s\n", i, wcwidth(i), argv[1]);
else
printf("Locale '%s' not found.\n", argv[1]);
}

And I got this output:

wcwidth(0x00C0)=2 in locale ja_JP.eucJP
wcwidth(0x00C0)=1 in locale ja_JP.UTF8


Secondly, wcwidth doesn't appear to be derived from the East Asian width
tables any more. UAX #11 lists U+00C0 as neutral, but the above example
demonstrates that it is treated as ambiguous.


Thirdly, I'm not sure how you plan to handle Hangul Jamo. Because
wcwidth works on the level of Unicode characters, not glyphs, I can't
see how you can handle the more general cases described in Section 3.12
of the Unicode standard.

On my machine, wcwidth returns 2 for the leading Jamo consonants (L),
and zero for the vowels (V) and trailing consonants (T). So if you have
two leading consonants in a row, the second one should overstrike the
first, but also has an extra width of 2 associated with it.

I tried wcswidth to see if it returned 2 or 4 when you pass it a string
with two leading consonants. It returned 4, which might be "incorrect"
to a Korean eye, but at least it is consistent with wcwidth. However,
the Single Unix Specification doesn't mandate that wcswidth return the
sum of wcwidth for each character in the string. So maybe your font
system should base its widths on wcswidth instead of wcwidth, in case
wcswidth is changed to handle this more general case in the future.

Chris
Rich Felker
2006-08-17 14:10:48 UTC
Permalink
Post by Chris Heath
Post by Rich Felker
To my knowledge there is still no official standard as to which
characters have which width, but POSIX specifies the function used to
obtain the width of each character (and defines the results as
'locale-specific'), and Markus Kuhn's implementation is the de facto
standard and is based on applying very reasonable rules to the
published Unicode data (East Asian Width tables and Mn and Cf classes,
mainly).
I think you are making some incorrect assumptions about wcwidth.
Not entirely, but yes, some.
Post by Chris Heath
Firstly, it *is* locale-dependent. On my Fedora Core 4 system, I used
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int i;
sscanf(argv[2], "%x", &i);
if (setlocale(LC_ALL, argv[1]))
printf("wcwidth(0x%04X)=%d in locale %s\n", i, wcwidth(i), argv[1]);
else
printf("Locale '%s' not found.\n", argv[1]);
}
wcwidth(0x00C0)=2 in locale ja_JP.eucJP
wcwidth(0x00C0)=1 in locale ja_JP.UTF8
This is nothing but glibc being idiotic. Yes it's _allowed_ to do this
according to POSIX (POSIX makes no requirements about correspondence
of the values returned to any other standard) but it's obviously
incorrect for the width of À to be anything but 1, even if it was
historically displayed wide (wtf?!) on some legacy CJK terminal types.

In practice the only way wcwidth's results should be "locale
dependent" is when __STDC_ISO_10646__ is not defined, i.e. when the
implementation does not use UCS codepoints for wchar_t in non-UTF-8
locales. Some implementations (including [old?] BSD) use a one-to-one
mapping of char to wchar_t for legacy 8bit locales and other simpler
mappings for legacy CJK locales.

Keep in mind that as long as your wchar_t values come from mb*towc
functions, the two locale dependencies will cancel out and in practice
the widths returned will be "locale independent" in this case.
Post by Chris Heath
Secondly, wcwidth doesn't appear to be derived from the East Asian width
tables any more. UAX #11 lists U+00C0 as neutral, but the above example
demonstrates that it is treated as ambiguous.
Again this is glibc being idiotic. File a bug report. :)
Post by Chris Heath
Thirdly, I'm not sure how you plan to handle Hangul Jamo. Because
wcwidth works on the level of Unicode characters, not glyphs, I can't
see how you can handle the more general cases described in Section 3.12
of the Unicode standard.
I'm aware that there's an issue with Hangul Jamo, but uncertain how
severe it is and what all the implications are.
Post by Chris Heath
On my machine, wcwidth returns 2 for the leading Jamo consonants (L),
and zero for the vowels (V) and trailing consonants (T). So if you have
two leading consonants in a row, the second one should overstrike the
first, but also has an extra width of 2 associated with it.
Why should two leading consonants in a row overstrike one another? Is
this actually used in the script? I seriously doubt that overstriking
is the correct behavior there but I don't know the script.
Post by Chris Heath
I tried wcswidth to see if it returned 2 or 4 when you pass it a string
with two leading consonants. It returned 4, which might be "incorrect"
to a Korean eye, but at least it is consistent with wcwidth. However,
the Single Unix Specification doesn't mandate that wcswidth return the
sum of wcwidth for each character in the string.
Interesting. I hadn't realized that wcswidth and wcwidth were allowed
to disagree.
Post by Chris Heath
So maybe your font
system should base its widths on wcswidth instead of wcwidth, in case
wcswidth is changed to handle this more general case in the future.
My font system has nothing to do with wcwidth of wcswidth. Column
width is an issue of the terminal emulator or other program displaying
text, not the font.

Rich
David Starner
2006-08-17 20:08:13 UTC
Permalink
Post by Rich Felker
This is nothing but glibc being idiotic. Yes it's _allowed_ to do this
according to POSIX (POSIX makes no requirements about correspondence
of the values returned to any other standard) but it's obviously
incorrect for the width of À to be anything but 1, even if it was
historically displayed wide (wtf?!) on some legacy CJK terminal types.
It's not obviously incorrect; in a CJK terminal, everything but ASCII
was double-width, which actually a very convienant way of doing
things. Many of these fonts are still around, and I suspect that many
users still use terminals that expect everything but ASCII to be
double-width. glibc here is merely supporting the way things work.
Ben Wiley Sittler
2006-08-19 18:07:40 UTC
Permalink
for displaying doublebyte-charset documents the east asian width
semantics are indispensible. there are very good reasons to have two
modes for the terminal — east asian (all but ascii and explicitly
narrow kana/hangeul/etc. as two cells) and non-east-asian (all but
kanji/hanzi/hanja, hangeul, and kana single-width). the first is
cell-compatible with the DBCS terminals (useful for viewing forms,
character-cell art, webpages, etc., including e.g. doublewidth
cyrillic characters used as graphics) and the second with non-DBCS
terminals (actual cyrillic text, for example.)
iuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuv c
Post by David Starner
Post by Rich Felker
This is nothing but glibc being idiotic. Yes it's _allowed_ to do this
according to POSIX (POSIX makes no requirements about correspondence
of the values returned to any other standard) but it's obviously
incorrect for the width of À to be anything but 1, even if it was
historically displayed wide (wtf?!) on some legacy CJK terminal types.
It's not obviously incorrect; in a CJK terminal, everything but ASCII
was double-width, which actually a very convienant way of doing
things. Many of these fonts are still around, and I suspect that many
users still use terminals that expect everything but ASCII to be
double-width. glibc here is merely supporting the way things work.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Ben Wiley Sittler
2006-08-19 18:20:55 UTC
Permalink
sorry, cat-typing sent that email a bit early. here's the rest:

for indic scripts and arabic having triple-cell ligatures is really
indispensible for readable text.

for east asian text a ttb, rtl columnar display mode is really, really
nice. mongolian of course needs ttb, ltr columnar display. this may
require contextual rotation of some paired glyphs (some of the rotated
forms have unicode compatibility mappings, some don't.) yes, i realize
unicode doesn't really handle vertical text yet, and the layout
algorithms ar etheoretically slightly different — however mlterm does
a passable job at least for CJK. how to handle single-cell vs.
double-cell vs. triple-cell glyphs in vertical presentation is a
tricky problem — short runs (<= 2 cells) should probably be displayed
as horizontal inclusions, longer runs should probably be rotated.

why don't we have escape sequences for switching between the DBCS and
non-DBCS cell behaviors, and for rotating the terminal display for
vertical text vs. horizontal text? Note that mixing vertical and
horizontal is sometimes done in the typographic world but is probably
not needed for terminal emulators (this requires a layout engine much
more advanced than the unicode bidi algorithm, capable of laying out
nested rectangular regions in all four mixed script directions, and
presumably an escape sequence scheme that is nestable too akin to the
bidi overrides and directional controls.)
Post by Ben Wiley Sittler
for displaying doublebyte-charset documents the east asian width
semantics are indispensible. there are very good reasons to have two
modes for the terminal — east asian (all but ascii and explicitly
narrow kana/hangeul/etc. as two cells) and non-east-asian (all but
kanji/hanzi/hanja, hangeul, and kana single-width). the first is
cell-compatible with the DBCS terminals (useful for viewing forms,
character-cell art, webpages, etc., including e.g. doublewidth
cyrillic characters used as graphics) and the second with non-DBCS
terminals (actual cyrillic text, for example.)
iuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuv c
Post by David Starner
Post by Rich Felker
This is nothing but glibc being idiotic. Yes it's _allowed_ to do this
according to POSIX (POSIX makes no requirements about correspondence
of the values returned to any other standard) but it's obviously
incorrect for the width of À to be anything but 1, even if it was
historically displayed wide (wtf?!) on some legacy CJK terminal types.
It's not obviously incorrect; in a CJK terminal, everything but ASCII
was double-width, which actually a very convienant way of doing
things. Many of these fonts are still around, and I suspect that many
users still use terminals that expect everything but ASCII to be
double-width. glibc here is merely supporting the way things work.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2006-08-20 23:01:07 UTC
Permalink
Post by Ben Wiley Sittler
for indic scripts and arabic having triple-cell ligatures is really
indispensible for readable text.
for east asian text a ttb, rtl columnar display mode is really, really
nice.
For a terminal? Why? Do you want to see:

l
s

-
l
[...]

??? I suspect not. If anyone really does want this behavior, then by
all means they can make a terminal with different orientation. But
until I hear about someone really wanting this I'll assume such claims
come from faux-counter-imperial chauvinism where western academics in
ivory towers tell people in other cultures that they must "preserve
their traditions" for their own sake with no regard for practicality,
and end up doing nothing but _disadvantaging_ people.
Post by Ben Wiley Sittler
a passable job at least for CJK. how to handle single-cell vs.
double-cell vs. triple-cell glyphs in vertical presentation is a
I've never heard of a triple-cell glyph. Certainly the "standard"
wcwidth (Kuhn's version) has no such thing.
Post by Ben Wiley Sittler
tricky problem - short runs (<= 2 cells) should probably be displayed
as horizontal inclusions, longer runs should probably be rotated.
Nonsense. A terminal does not have the luxury to decide such things.
You're confusing "terminal" with "word processor" or maybe even with
TeX...
Post by Ben Wiley Sittler
why don't we have escape sequences for switching between the DBCS and
non-DBCS cell behaviors, and for rotating the terminal display for
vertical text vs. horizontal text?
Because it's not useful. Applications will not use it. All the
terminal emulator needs to do is:

1. display raw text in a form that's not offensive -- this is
necessary so that terminal-unaware programs just writing to stdout
will work.

2. provide cursor positioning functions (minimal) and (optionally)
scrolling/insert/delete and other small optimizations.

Anything more is just pure bloat because it won't be supported by
curses and applications are written either to curses or to vt102.
Post by Ben Wiley Sittler
Note that mixing vertical and
horizontal is sometimes done in the typographic world but is probably
not needed for terminal emulators (this requires a layout engine much
more advanced than the unicode bidi algorithm, capable of laying out
This most certainly does not belong in a terminal emulator. Apps
(such as text based web browsers) wishing to do elegant
multi-orientation formatting can do the cursor positioning and such
themselves. Users preferring a vertical orientation can configure
their terminals as such. This is a matter of user preference, not
application control, and thus there should NOT be a way for
applications to control or override it.

Rich
Ben Wiley Sittler
2006-08-21 01:50:17 UTC
Permalink
see mlterm, please — some of these are very useful display forms, and
already in use for character-cell terminal emulators.

as for triple-cell glyphs, see emacs w/arabic presentation forms.
Post by Rich Felker
Post by Ben Wiley Sittler
for indic scripts and arabic having triple-cell ligatures is really
indispensible for readable text.
for east asian text a ttb, rtl columnar display mode is really, really
nice.
l
s
-
l
[...]
??? I suspect not. If anyone really does want this behavior, then by
all means they can make a terminal with different orientation. But
until I hear about someone really wanting this I'll assume such claims
come from faux-counter-imperial chauvinism where western academics in
ivory towers tell people in other cultures that they must "preserve
their traditions" for their own sake with no regard for practicality,
and end up doing nothing but _disadvantaging_ people.
Post by Ben Wiley Sittler
a passable job at least for CJK. how to handle single-cell vs.
double-cell vs. triple-cell glyphs in vertical presentation is a
I've never heard of a triple-cell glyph. Certainly the "standard"
wcwidth (Kuhn's version) has no such thing.
Post by Ben Wiley Sittler
tricky problem - short runs (<= 2 cells) should probably be displayed
as horizontal inclusions, longer runs should probably be rotated.
Nonsense. A terminal does not have the luxury to decide such things.
You're confusing "terminal" with "word processor" or maybe even with
TeX...
Post by Ben Wiley Sittler
why don't we have escape sequences for switching between the DBCS and
non-DBCS cell behaviors, and for rotating the terminal display for
vertical text vs. horizontal text?
Because it's not useful. Applications will not use it. All the
1. display raw text in a form that's not offensive -- this is
necessary so that terminal-unaware programs just writing to stdout
will work.
2. provide cursor positioning functions (minimal) and (optionally)
scrolling/insert/delete and other small optimizations.
Anything more is just pure bloat because it won't be supported by
curses and applications are written either to curses or to vt102.
Post by Ben Wiley Sittler
Note that mixing vertical and
horizontal is sometimes done in the typographic world but is probably
not needed for terminal emulators (this requires a layout engine much
more advanced than the unicode bidi algorithm, capable of laying out
This most certainly does not belong in a terminal emulator. Apps
(such as text based web browsers) wishing to do elegant
multi-orientation formatting can do the cursor positioning and such
themselves. Users preferring a vertical orientation can configure
their terminals as such. This is a matter of user preference, not
application control, and thus there should NOT be a way for
applications to control or override it.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2006-08-05 08:10:14 UTC
Permalink
To follow up on my original proposal and some of the alterrations and
simplifications I've made as a result of these discussions and
discussions with other people outside of this list, here's a summary
of the problem I'm trying to solve and how I plan to solve it:

Practical problems:
- no terminal emulators with broad support for scripts
- existing font formats but burden on software for performing layout
and parts of the substitution, and have very high size overhead
- existing formats don't support more than 64k glyphs or characters
(and if they do it'll be through some hacks...)

Ivory tower problems:
- existing font formats don't respect the unicode distinction between
characters and glyphs and use hacks to work around this problem

Requirements:
- low overhead -- should average 4 or fewer bytes per character
- complete separation of the notation of character and glyph
- oriented towards character-cell devices
- ability to select the correct glyph when doing character-at-a-time
rendering, without passing large strings to a substitution library
or having to guess how much context is needed.
- "efficient access path to these glyphs" (-- Kuhn '99)
- derived from source format which consists of glyphs only along with
lists of which characters each glyph can represent and under what
conditions (Kuhn's idea).
- reversibility: must be feasible to recover source format from a
"compiled" font file.

Implementation:
- completely interpreter based: glyph selector code interprets a
program in the font file to map characters to glyphs
- interpreted language is intentionally extremely weak. does not admit
any unbounded constructs.
- interpreted language is designed to be used efficiently by a font
compiler starting from the source format with typical real-world
font data.

"Variables" in interpreted language:
- character number, initially set to the desired character
- glyph number, initially set to zero

Operations in interpreted language (all args are unsigned):
- 0. end program, using current glyph number
- 1. if (ch>=arg1) { jump by arg2 code bytes; ch=0; }
- 2. glyph += ch*arg; ch=0;
- 3. jump by ch*arg code bytes; ch=0;
- 4. jump by arg bytes
- 5. if (in context specified by arg1) { glyph += arg2; end; }

Usage: 0 is obvious. 1 allows conditional treatment of ranges (best
use is to construct a binary tree with it), especially huge unassigned
or unsupported character ranges. 2 allows entire ranges without
ligatures or variants to map directly to glyphs without per-character
cpu-time or file-size overhead. 3 allows a sort of jump table where
the font can specify a code vector for each character in the range.
4 is self-explanatory.

Finally, 5 is the key feature. While the former ops are there to allow
efficient mapping of one million (or more..?) codepoints to glyphs, 5
allows conditional glyph selection based on context. Contexts are
essentially RE bracket expressions (much weaker than RE) for the
adjacent character positions surrounding the character whose glyph has
been requested. For example the 'low accent mark' context for latin
might be [acegijmnopqrsuvwxyz] immediately prior to the accent mark.

The specific precise requirements for context are one of the details
that I'm still working out, and which I would like help with since I'm
_not_ familiar with every script on the planet. Of course if I just go
with the draft spec and then refine it along the way while building my
font (with large parts derived from the GNU unifont project, but
corrected for the horrible character==glyph assumption it makes and
lack of correct nonspacing/wide glyphs), by the time it's done I'll
probably have something working very well.

A few more details: whenever you want to lookup a glyph for a
character, you begin at the interpreted program's entry point and
interpret it. Typically this process will begin with one or more type
1 operations to eliminate unused portions of the codepoint space, then
use type 2 (if the entire range maps directly to glyphs) or 3 (to
implement individual processing for each character in the range).
Optimality of the lookup process is dependent upon having a good
compiler to convert the source file to such a 'program', but due to
reversibility, a poorly compiled font can be restored to source and
recompiled with a better compiler.

Rich


P.S. I'm going to be travelling in Taiwan for the next week and a half
and not working on this project during that time. Please don't think
I've disappeared or dropped it. I'll be back (with more rigorous specs
and maybe some completed code) by the end of the month.
Werner LEMBERG
2006-08-10 06:39:14 UTC
Permalink
Post by Rich Felker
- existing font formats don't respect the unicode distinction
between characters and glyphs and use hacks to work around this
problem
You mean bitmap fonts, right? Fonts based on SFNT format of course
have this distinction.
Post by Rich Felker
The specific precise requirements for context are one of the details
that I'm still working out, and which I would like help with since
I'm _not_ familiar with every script on the planet. Of course if I
just go with the draft spec and then refine it along the way while
building my font (with large parts derived from the GNU unifont
project, but corrected for the horrible character==glyph assumption
it makes and lack of correct nonspacing/wide glyphs), by the time
it's done I'll probably have something working very well.
Try your code on, say, Arabic and Hindi, and it should work for most
other scripts too, I think. The most complicated latin-based script
is classical Greek, AFAIK; this would be a good test for glyph
composition also.


Werner
Loading...