Rich Felker
2006-08-02 02:53:50 UTC
To Markus et. al.:
I read in the ancient archives for this list some ideas regarding a
so-called next generation console font, supporting unicode level-3
combining in a character cell environment. I'm presently working on a
new terminal emulator called uuterm (think of the uu as µ-ucs or
something) with the goal of extremely efficient but complete unicode
support, and I'm at the point of needing a font format. Using opentype
fonts is not an option since I want it to run on small systems and
possibly even serve as a basis for integrating into linux as a
replacement console device (the current design is set up to run both
on linux fb from userspace and as an X client but is adaptible to
various display and input devices).
Was any progress ever made towards specifying the format or
requirements for NCF? Here are some ideas I have, quoting Markus's
doesn't help us anyway.
themselves well to italics unless you're going to allow glyphs to
extend outside their cells, and then you're getting into the realm of
advanced pretty-typesetting where you should probably be using
opentype fonts and mlterm.. Bold is a nice option but the vast
majority of scripts simply do not admit legible bold glyphs in
reasonable character-cell sizes. (Arguably even latin does not due to
the letter m).
take the Unicode approach and assume that the user will chose a font
suited to the form of the glyphs they're most comfortable with.
However I've also seen cases where traditional Chinese text looks very
strange when the characters existing in Japanese are shown in a
Japanese style while the rest (which have no Japanese variant) are
shown in traditional style. Unfortunately I see no easy way for a
character cell device to handle this anyway, short of adding attribute
modes for CJK variant or interpreting the nonspacing Unicode
characters for variant selection. Still neither of these will be
present when you run "ls" and some of your filenames are Chinese while
others are Japanese, unless you fill your filenames with escape
sequences or something disgusting like that...
or use something like 'sed -n l'. By the way my proposed design below
lends itself fairly well to 'stacking' multiple fonts if you want to
use a specialized supplemental font for some characters and fall back
to another font for the rest. "Debug glyphs" seems like a case where
an approach like that might be appropriate.
displayed as [lefthand part of fi ligature] if following character is
i". While the former is maybe slightly more natural for designing a
font, the latter is much more natural for a character cell device IMO.
During any ligature substitution, the number of cells must always be
invariant anyway (even if this looks ugly); otherwise applications
cannot rely on column formatting without knowing detailed information
about the script (beyond what wcwidth() reports) and the terminal
contents will get corrupted (and look much uglier than overwide
ligatures).
[Yes I know that my "fi" example is idiotic on a character cell device
but I didn't want to try to type Arabic characters which I don't have
any clue about.. :) ]
displayed as [stubby A] if followed by [list of all combining marks
that go above latin charaters]".
Now, a question: should the font have to include such context
definitions inline, or should the font format specify predefined
classes like "superscribed combining marks"?
implemented equally well with ligature substitution: "[diaresis]
gets displayed as [low diaresis] if attached to a lowercase latin
letter".
To qualify this claim of non-necessity: my main goal is character cell
support for Tibetan script, which is probably the most complicated
stacking script in Unicode. I've successfully produced a set of glyphs
that stack and attach to one another correctly using only simple
ligature-context-style rules, without any offsets.
characters they can represent, rather than indexing glyphs by
character, at least as a way of thinking. Whether it's efficient to
store them that way remains to be seen. It may be ideal to have a
"source" format for fonts that works along these lines coupled with a
binary format that's optimized for looking up characters.
ligature substitutions.
Now some ideas of mine for design of a file format. This is not likely
a good source format for editing, but is designed for runtime use by
applications (and possibly the kernel) which want to keep memory usage
and cpu expenditure low. It's also intended that it should be possible
to convert back and forth to the source format without losing any
information about the relationship between glyphs and characters.
So, here goes an idea:
----------------------------------------------------------------------
The file begins with a header consisting of a magic number, the file
format version, the character cell width and height, a search depth
for the binary lookup tree (see below), and an offset to the start of
the tree.
Next (actually the location does not matter) is the binary lookup
tree. Each tree node consists of three numbers: the pivot codepoint
and an offset to the left and right branches. All all codepoints less
than the pivot are to the left; codepoints greater than or equal to
the pivot are to the right.
After walking exactly /depth/ steps down the tree (exactness
requirement allows performance-oriented implementations to perform the
walk branchlessly using an unrolled loop and does no harm to other
implementations; typical depth will be 4-6 depending on presence or
absence of cjk characters), a block header is found at the resulting
file offset.
Blocks group characters and associated glyphs, allowing more compact
data representations. After an initial header, blocks contain a list
of offsets to the character data for each character in the block.
In their least sophisticated form, blocks could simply correspond to
roughly contiguous portions of the UCS character space in order to
avoid having thousands to table entries for unassigned characters. A
depth of zero is even possible, indicating no binary tree and just one
block with a million character table pointers in it. However much more
sophisticated and efficient storage is possible; in order to see how
they might be used I first detail the character/glyph structure and
the relationship between characters and glyphs.
Characters are essentially (hybrid linked/sequential) lists of glyph
substitution rules. In many cases the rule is simply a single inline
glyph. Character data begins with a control code indicating a
substitution rule. If the substitution rule matches, the immediately
following glyph data is used. It may be an inline glyph or a global
offset to any other glyph in the file, for characters which share
glyphs. If the substitution rule fails, the corresponding data is
skipped and another control code is read at the new position.
Glyph data consists of the glyph height (in scanlines), the row number
of the first scanline of the glyph, and the byte-aligned scanline
bitmaps. Width in character cells (single or double) is implicit from
the codepoint (is this reasonable?).
Any or all of these properties can be block-global. If all characters
in the block have only a single glyph, then a flag can be set
indicating that the character pointers point to glyphs directly (no
control code). There is also a flag indicating that height and row
offset are shared among all glyphs. Regardless of which properties are
shared by all characters/glyphs in the block, a fixed offset increment
between characters may be specified in lieu of a table of offsets.
This is particularly useful in minimizing the size of huge CJK blocks.
----------------------------------------------------------------------
Comments?
The main questions I have are:
What format to use for numbers? Some numbers have fixed range
depending on, for example, the character cell width and height. Glyph
height and row offset are 4-bits-each for height=16 and could be
packed into one byte when they fit. Other numbers can widely vary in
range.
As long as a portable, non-host-specific byte order is used I see no
reason to worry about alignment or using special word sizes. Numbers
will need to be read byte-by-byte anyway. They could be stored in a
vlc (variable length coding) format where one bit of each byte is
stolen for use as a continuation bit, or the common length in bytes
could be predeclared in an appropriate context.
The other major issue is glyph substitution (ligature) context rules.
What rules are needed to accommodate all scripts? Here are some ideas:
1. following [characters]
2. base character in [characters]
3. attached to [characters]
4. has [characters] attached
5. followed by [characters]
where each occurrance of [characters] means a bracket-expression-like
pattern of character codepoints to match. What I wonder is if this
list can be compressed down to just two types of rules, one for
previous character (previous cell if the character is spacing, and
previous character in combining stack if character is nonspacing) and
another for next character (next cell or next in stack). The latter
seems sufficient for the most complicated stacking script and the one
I'm familiar with, Tibetan, but I'm unsure if it's sufficient for
everything one would need to do with ligatures. Particularly of
interest is what's needed for vowel-placement-swapping in Indic
languages and everything to do with Arabic ligatures.
Experts, please step up and make suggestions. I expect for this to go
through various stages of revision and I will probably implement and
use drafts in uuterm in the meantime, but I'd like to reach a good
stable final spec, especially if there's hope of finally getting
proper Unicode support into the Linux console.
Rich
I read in the ancient archives for this list some ideas regarding a
so-called next generation console font, supporting unicode level-3
combining in a character cell environment. I'm presently working on a
new terminal emulator called uuterm (think of the uu as µ-ucs or
something) with the goal of extremely efficient but complete unicode
support, and I'm at the point of needing a font format. Using opentype
fonts is not an option since I want it to run on small systems and
possibly even serve as a basis for integrating into linux as a
replacement console device (the current design is set up to run both
on linux fb from userspace and as an X client but is adaptible to
various display and input devices).
Was any progress ever made towards specifying the format or
requirements for NCF? Here are some ideas I have, quoting Markus's
- space for up to 1 million glyphs
I see no realistic need for this many but placing arbitrary limitsdoesn't help us anyway.
- efficient access path to these glyphs (i.e., something
better than linear search, that can be accessed from a
memory mapped file)
Absolutely. See below for my (albeitly possibly overcomplex) proposal.better than linear search, that can be accessed from a
memory mapped file)
- support for glyph variant options (i.e., the user can activate
or deactivate some style options, which will influence the
character/glyph mapping)
- glyph variations could include bold/italic/wide/etc., they could also
I question the necessity. Character cell devices do not lendor deactivate some style options, which will influence the
character/glyph mapping)
- glyph variations could include bold/italic/wide/etc., they could also
themselves well to italics unless you're going to allow glyphs to
extend outside their cells, and then you're getting into the realm of
advanced pretty-typesetting where you should probably be using
opentype fonts and mlterm.. Bold is a nice option but the vast
majority of scripts simply do not admit legible bold glyphs in
reasonable character-cell sizes. (Arguably even latin does not due to
the letter m).
include CJK style variations, as well as more mundane things
CJK variations are definitely a consideration. My inclination is totake the Unicode approach and assume that the user will chose a font
suited to the form of the glyphs they're most comfortable with.
However I've also seen cases where traditional Chinese text looks very
strange when the characters existing in Japanese are shown in a
Japanese style while the rest (which have no Japanese variant) are
shown in traditional style. Unfortunately I see no easy way for a
character cell device to handle this anyway, short of adding attribute
modes for CJK variant or interpreting the nonspacing Unicode
characters for variant selection. Still neither of these will be
present when you run "ls" and some of your filenames are Chinese while
others are Japanese, unless you fill your filenames with escape
sequences or something disgusting like that...
such as whether you want to have a slash through a zero or
whether you want to have visible codes for the many different
Unicode space characters (for debugging)
Feature creep.. Wouldn't it make more sense to load a different fontwhether you want to have visible codes for the many different
Unicode space characters (for debugging)
or use something like 'sed -n l'. By the way my proposed design below
lends itself fairly well to 'stacking' multiple fonts if you want to
use a specialized supplemental font for some characters and fall back
to another font for the rest. "Debug glyphs" seems like a case where
an approach like that might be appropriate.
- support for ligature substitution (for languages that depend
on spacing combining characters); this means that a sequence of
several Unicode characters can be replaced by a single wide glyph.
You can look at it that way, or you can look at it as "f getson spacing combining characters); this means that a sequence of
several Unicode characters can be replaced by a single wide glyph.
displayed as [lefthand part of fi ligature] if following character is
i". While the former is maybe slightly more natural for designing a
font, the latter is much more natural for a character cell device IMO.
During any ligature substitution, the number of cells must always be
invariant anyway (even if this looks ugly); otherwise applications
cannot rely on column formatting without knowing detailed information
about the script (beyond what wcwidth() reports) and the terminal
contents will get corrupted (and look much uglier than overwide
ligatures).
[Yes I know that my "fi" example is idiotic on a character cell device
but I didn't want to try to type Arabic characters which I don't have
any clue about.. :) ]
- support for glyph variations such as smaller uppercase character,
which have enough space on top to fit a combining character over them.
This can again be a contextual variation, like ligatures: "A getswhich have enough space on top to fit a combining character over them.
displayed as [stubby A] if followed by [list of all combining marks
that go above latin charaters]".
Now, a question: should the font have to include such context
definitions inline, or should the font format specify predefined
classes like "superscribed combining marks"?
- support for combining character, not only by simple overstriking,
but also by allowing some offset. In other words, I want the
diaresis over the "a" to be 2 pixels lower than over the "A",
therefore there should be a way to add some "combining_shift(0,-2)"
attribute to the "a" character to make this happen.
Offsets are not necessary; they complicate rendering and can bebut also by allowing some offset. In other words, I want the
diaresis over the "a" to be 2 pixels lower than over the "A",
therefore there should be a way to add some "combining_shift(0,-2)"
attribute to the "a" character to make this happen.
implemented equally well with ligature substitution: "[diaresis]
gets displayed as [low diaresis] if attached to a lowercase latin
letter".
To qualify this claim of non-necessity: my main goal is character cell
support for Tibetan script, which is probably the most complicated
stacking script in Unicode. I've successfully produced a set of glyphs
that stack and attach to one another correctly using only simple
ligature-context-style rules, without any offsets.
So an NCF (Next generation Console Font) file would contain
a set of glyphs, each of which comes with some attributes.
These attributes could be
- the sets of Unicode characters that this glyph can represent
under the condition that some style variant has been activated
I strongly _like_ this approach of defining glyphs and listing thea set of glyphs, each of which comes with some attributes.
These attributes could be
- the sets of Unicode characters that this glyph can represent
under the condition that some style variant has been activated
characters they can represent, rather than indexing glyphs by
character, at least as a way of thinking. Whether it's efficient to
store them that way remains to be seen. It may be ideal to have a
"source" format for fonts that works along these lines coupled with a
binary format that's optimized for looking up characters.
- the set of Unicode sequences that this glyph can
represent (ligature substitution), again perhaps conditioned
by style variants
I would prefer contexts to sequences.represent (ligature substitution), again perhaps conditioned
by style variants
- shift offsets for various types of combining characters
Again, IMO not needed. Overstrike handles it all as long as you haveligature substitutions.
- etc.
:)Now some ideas of mine for design of a file format. This is not likely
a good source format for editing, but is designed for runtime use by
applications (and possibly the kernel) which want to keep memory usage
and cpu expenditure low. It's also intended that it should be possible
to convert back and forth to the source format without losing any
information about the relationship between glyphs and characters.
So, here goes an idea:
----------------------------------------------------------------------
The file begins with a header consisting of a magic number, the file
format version, the character cell width and height, a search depth
for the binary lookup tree (see below), and an offset to the start of
the tree.
Next (actually the location does not matter) is the binary lookup
tree. Each tree node consists of three numbers: the pivot codepoint
and an offset to the left and right branches. All all codepoints less
than the pivot are to the left; codepoints greater than or equal to
the pivot are to the right.
After walking exactly /depth/ steps down the tree (exactness
requirement allows performance-oriented implementations to perform the
walk branchlessly using an unrolled loop and does no harm to other
implementations; typical depth will be 4-6 depending on presence or
absence of cjk characters), a block header is found at the resulting
file offset.
Blocks group characters and associated glyphs, allowing more compact
data representations. After an initial header, blocks contain a list
of offsets to the character data for each character in the block.
In their least sophisticated form, blocks could simply correspond to
roughly contiguous portions of the UCS character space in order to
avoid having thousands to table entries for unassigned characters. A
depth of zero is even possible, indicating no binary tree and just one
block with a million character table pointers in it. However much more
sophisticated and efficient storage is possible; in order to see how
they might be used I first detail the character/glyph structure and
the relationship between characters and glyphs.
Characters are essentially (hybrid linked/sequential) lists of glyph
substitution rules. In many cases the rule is simply a single inline
glyph. Character data begins with a control code indicating a
substitution rule. If the substitution rule matches, the immediately
following glyph data is used. It may be an inline glyph or a global
offset to any other glyph in the file, for characters which share
glyphs. If the substitution rule fails, the corresponding data is
skipped and another control code is read at the new position.
Glyph data consists of the glyph height (in scanlines), the row number
of the first scanline of the glyph, and the byte-aligned scanline
bitmaps. Width in character cells (single or double) is implicit from
the codepoint (is this reasonable?).
Any or all of these properties can be block-global. If all characters
in the block have only a single glyph, then a flag can be set
indicating that the character pointers point to glyphs directly (no
control code). There is also a flag indicating that height and row
offset are shared among all glyphs. Regardless of which properties are
shared by all characters/glyphs in the block, a fixed offset increment
between characters may be specified in lieu of a table of offsets.
This is particularly useful in minimizing the size of huge CJK blocks.
----------------------------------------------------------------------
Comments?
The main questions I have are:
What format to use for numbers? Some numbers have fixed range
depending on, for example, the character cell width and height. Glyph
height and row offset are 4-bits-each for height=16 and could be
packed into one byte when they fit. Other numbers can widely vary in
range.
As long as a portable, non-host-specific byte order is used I see no
reason to worry about alignment or using special word sizes. Numbers
will need to be read byte-by-byte anyway. They could be stored in a
vlc (variable length coding) format where one bit of each byte is
stolen for use as a continuation bit, or the common length in bytes
could be predeclared in an appropriate context.
The other major issue is glyph substitution (ligature) context rules.
What rules are needed to accommodate all scripts? Here are some ideas:
1. following [characters]
2. base character in [characters]
3. attached to [characters]
4. has [characters] attached
5. followed by [characters]
where each occurrance of [characters] means a bracket-expression-like
pattern of character codepoints to match. What I wonder is if this
list can be compressed down to just two types of rules, one for
previous character (previous cell if the character is spacing, and
previous character in combining stack if character is nonspacing) and
another for next character (next cell or next in stack). The latter
seems sufficient for the most complicated stacking script and the one
I'm familiar with, Tibetan, but I'm unsure if it's sufficient for
everything one would need to do with ligatures. Particularly of
interest is what's needed for vowel-placement-swapping in Indic
languages and everything to do with Arabic ligatures.
Experts, please step up and make suggestions. I expect for this to go
through various stages of revision and I will probably implement and
use drafts in uuterm in the meantime, but I'd like to reach a good
stable final spec, especially if there's hope of finally getting
proper Unicode support into the Linux console.
Rich