Discussion:
Bidi considered harmful? :)
Rich Felker
2006-09-01 03:33:06 UTC
Permalink
I read an old thread on the XFree88 i18n list started by Markus Kuhn
suggesting (rather strongly) that bidi should not be supported at the
terminal level, as well accusations (from other sources) by the author
of Yudit that UAX#9 bidi algo results in serious security issues due
to the irreversibility of the transformation and that it inevitably
butchers mathematical formulae.

I've also considered examples on my own, such as a program (not
necessarily terminal-aware, just text output) that prints lines of the
form "%s %d %d %s" without any special treatment (such as putting
explicit embedding marks around the %s fields) for bidi text, or a
terminal-based program that draws interface elements over top of
existing RTL text, resulting in nonsense.

In all cases, my personal opinion has been not just that UAX#9 is
broken, but that there's no way to implement any sort of implicit bidi
in a terminal emulator or in the display of text/plain data without
every single program having to go _far_ out of its way to ensure that
it won't give incorrect output when the input contains RTL characters,
which simply isn't going to happen, especially since it would
interfere with use in non-RTL scenarios. Other people may have
different opinions but I have not seen any viable solutions.



At the same time, I'm also very dissatisfied with the lack of proper
support for RTL scripts/languages in most applications and especially
at the terminal level, especially since Arabic is in such widespread
use and has great political importance in world affairs these days. I
do not accept that the solution is to just to print characters in the
wrong visual order.

.eerga ll'uoy tcepxe I ylbatrofmoc ecnetnes siht daer nac uoy sselnU



I experimented with the idea of mirroring glyphs to improve
readability, and was fairly surprised with how little it helped my
perception. Reading English text that had been graphically mirrored
remained almost as difficult as reading the above line, with the b/d
and p/q pairs causing significant pause in comprehension.

So then, reading UAX#9 again, I stumbled across the only section
that's not completely stupid (IMO of course):

5.4 Vertical Text

In the case of vertical line orientation, the bidirectional
algorithm is still used to determine the levels of the text.
However, these levels are not used to reorder the text, since the
characters are usually ordered uniformly from top to bottom.
Instead, the levels are used to determine the rotation of the
text. Sometimes vertical lines follow a vertical baseline in which
each character is oriented as normal (with no rotation), with
characters ordered from top to bottom whether they are Hebrew,
numbers, or Latin. When setting text using the Arabic script in
vertical lines, it is more common to employ a horizontal baseline
that is rotated by 90° counterclockwise so that the characters are
ordered from top to bottom. Latin text and numbers may be rotated
90° clockwise so that the characters are also ordered from top to
bottom.

What this provides is a suggested formatting that makes RTL and LTR
scripts both readable in a single-directional context, a vertical one.
Combined with the recent Mongolian script discussion on this list, I
believe this offers an alternate presentation form for documents that
mix LTR and RTL text without using bidi.

I'm not suggesting that everyone should switch to vertically-oriented
terminals or text-file-presentation, although Mongolian users might
like such a setup and it can certainly be one presentation option
that's fair to both RTL and LTR users by making both RTL and LTR
scripts quite readable.

The key idea to take from the Mongolian discussion and from UAX#9 5.4
is that, by having glyphs for LTR and RTL scripts rotated 180°
relative to one another, both can appear legible in a common
directionality. Thus, perhaps LTR users could present legible
RTL-script text by rotating all glyphs 180° and displaying them in LTR
order, and likewise RTL users could use a dominant RTL direction with
LTR glyphs rotated 180° [1]. Like with Mongolian, directionality could
become a localized user preference, rather than a property of the
script.

Does this actually work?

I repeated my experiment with English text reading, rotating the
graphic representation by 180° rather than mirroring it left-right. I
was pleased to find that I could read it with similar ease (but not
quite as fast) as ordinary LTR English text. Surprisingly, p/d and b/q
confusion did not arise, perhaps due to the obvious visual distinction
between the ascent/descent space of the glyphs.

I do not claim these tests are scientific, since the only subject
participating was myself. :) But they are suggestive of an alternative
possible presentation form for mixed LTR/RTL scripts without utilizing
bidirectionality. I consider bidirectionality harmful because:

- It is inherently slow for one's eyes to jump back and forth
switching directions while reading a single paragraph.
- It quickly becomes impossible to read quotations with multiple
levels of directional embedding. Forget UAX#9's 61 levels; 3 levels
are already undecipherable without slow and meticulous work.
- Implicit directionality is impossible to resolve without interfering
with sane people's expectations under string operations. In
particular the UAX#9 insanity involves _semantic_ interpretations of
text contents based on presupposed cultural conventions (like
whether a comma is a thousands separator or a list separator), which
are simply not valid assumptions you can make at such a low level.
- Visual order does not uniquely convey the logical order.

This is not to say that bidirectional formatting doesn't have its
place, and that, used correctly without multiple embedding levels,
with well-set block quotes, etc., it won't be legible. I also do not
preclude use of advanced ECMA-48 features for explicit bidi at the
terminal level. But I'd like to propose unidirectional formatting with
adjusted glyph orientation as a more logical (and perhaps more easily
readable) alternative to be used in terminal emulators and perhaps
also other contexts where accurate representation of the logical order
is required or where multiple levels of quoting are in use.

The most important thing to realize is that this proposal is not to
reject traditional ways of writing RTL scripts. The proposal is to
reject the (very stupid IMO) idea of mixing LTR and RTL
directionalities in a single paragraph context, except in the case
where higher-level formatting (which is inherently not available in a
plain text file or text printed to stdout) can control it.


Rich





[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.
George W Gerrity
2006-09-01 06:32:40 UTC
Permalink
Post by Rich Felker
I read an old thread on the XFree88 i18n list started by Markus Kuhn
suggesting (rather strongly) that bidi should not be supported at the
terminal level, as well accusations (from other sources) by the author
of Yudit that UAX#9 bidi algo results in serious security issues due
to the irreversibility of the transformation and that it inevitably
butchers mathematical formulae.
I've also considered examples on my own, such as a program (not
necessarily terminal-aware, just text output) that prints lines of the
form "%s %d %d %s" without any special treatment (such as putting
explicit embedding marks around the %s fields) for bidi text, or a
terminal-based program that draws interface elements over top of
existing RTL text, resulting in nonsense.
In all cases, my personal opinion has been not just that UAX#9 is
broken, but that there's no way to implement any sort of implicit bidi
in a terminal emulator or in the display of text/plain data without
every single program having to go _far_ out of its way to ensure that
it won't give incorrect output when the input contains RTL characters,
which simply isn't going to happen, especially since it would
interfere with use in non-RTL scenarios. Other people may have
different opinions but I have not seen any viable solutions.
I did try to tell you that doing a terminal emulation properly would
be complex. I don't know if the algorithm is broken: I doubt it. But
it is difficult getting it to work properly and it essentially
requires internal tables for every glyph describing its direction and
orientation.
Post by Rich Felker
At the same time, I'm also very dissatisfied with the lack of proper
support for RTL scripts/languages in most applications and especially
at the terminal level, especially since Arabic is in such widespread
use and has great political importance in world affairs these days. I
do not accept that the solution is to just to print characters in the
wrong visual order.
.eerga ll'uoy tcepxe I ylbatrofmoc ecnetnes siht daer nac uoy sselnU
I experimented with the idea of mirroring glyphs to improve
readability, and was fairly surprised with how little it helped my
perception. Reading English text that had been graphically mirrored
remained almost as difficult as reading the above line, with the b/d
and p/q pairs causing significant pause in comprehension.
So then, reading UAX#9 again, I stumbled across the only section
5.4 Vertical Text
In the case of vertical line orientation, the bidirectional
algorithm is still used to determine the levels of the text.
However, these levels are not used to reorder the text, since the
characters are usually ordered uniformly from top to bottom.
Instead, the levels are used to determine the rotation of the
text. Sometimes vertical lines follow a vertical baseline in which
each character is oriented as normal (with no rotation), with
characters ordered from top to bottom whether they are Hebrew,
numbers, or Latin. When setting text using the Arabic script in
vertical lines, it is more common to employ a horizontal baseline
that is rotated by 90° counterclockwise so that the characters are
ordered from top to bottom. Latin text and numbers may be rotated
90° clockwise so that the characters are also ordered from top to
bottom.
What this provides is a suggested formatting that makes RTL and LTR
scripts both readable in a single-directional context, a vertical one.
Combined with the recent Mongolian script discussion on this list, I
believe this offers an alternate presentation form for documents that
mix LTR and RTL text without using bidi.
I'm not suggesting that everyone should switch to vertically-oriented
terminals or text-file-presentation, although Mongolian users might
like such a setup and it can certainly be one presentation option
that's fair to both RTL and LTR users by making both RTL and LTR
scripts quite readable.
The key idea to take from the Mongolian discussion and from UAX#9 5.4
is that, by having glyphs for LTR and RTL scripts rotated 180°
relative to one another, both can appear legible in a common
directionality. Thus, perhaps LTR users could present legible
RTL-script text by rotating all glyphs 180° and displaying them in LTR
order, and likewise RTL users could use a dominant RTL direction with
LTR glyphs rotated 180° [1]. Like with Mongolian, directionality could
become a localized user preference, rather than a property of the
script.
Does this actually work?
I repeated my experiment with English text reading, rotating the
graphic representation by 180° rather than mirroring it left-right. I
was pleased to find that I could read it with similar ease (but not
quite as fast) as ordinary LTR English text. Surprisingly, p/d and b/q
confusion did not arise, perhaps due to the obvious visual distinction
between the ascent/descent space of the glyphs.
I do not claim these tests are scientific, since the only subject
participating was myself. :) But they are suggestive of an alternative
possible presentation form for mixed LTR/RTL scripts without utilizing
- It is inherently slow for one's eyes to jump back and forth
switching directions while reading a single paragraph.
- It quickly becomes impossible to read quotations with multiple
levels of directional embedding. Forget UAX#9's 61 levels; 3 levels
are already undecipherable without slow and meticulous work.
- Implicit directionality is impossible to resolve without interfering
with sane people's expectations under string operations. In
particular the UAX#9 insanity involves _semantic_ interpretations of
text contents based on presupposed cultural conventions (like
whether a comma is a thousands separator or a list separator), which
are simply not valid assumptions you can make at such a low level.
- Visual order does not uniquely convey the logical order.
This is not to say that bidirectional formatting doesn't have its
place, and that, used correctly without multiple embedding levels,
with well-set block quotes, etc., it won't be legible. I also do not
preclude use of advanced ECMA-48 features for explicit bidi at the
terminal level. But I'd like to propose unidirectional formatting with
adjusted glyph orientation as a more logical (and perhaps more easily
readable) alternative to be used in terminal emulators and perhaps
also other contexts where accurate representation of the logical order
is required or where multiple levels of quoting are in use.
The most important thing to realize is that this proposal is not to
reject traditional ways of writing RTL scripts. The proposal is to
reject the (very stupid IMO) idea of mixing LTR and RTL
directionalities in a single paragraph context, except in the case
where higher-level formatting (which is inherently not available in a
plain text file or text printed to stdout) can control it.
Rich
[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.
No one using arabic script would accept reading it top to bottom: it
is simply never done (to the best of my knowledge), and so any
terminal emulator claiming to work with any script had better be able
to render the text correctly, including mixing rtl and ltr.

George
------
Rich Felker
2006-09-01 13:41:44 UTC
Permalink
Post by George W Gerrity
I did try to tell you that doing a terminal emulation properly would
be complex. I don't know if the algorithm is broken: I doubt it. But
it is difficult getting it to work properly and it essentially
requires internal tables for every glyph describing its direction and
orientation.
If that were the problem it would be trivial. The problems are much
more fundamental. The key examples you should look at are things like:
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken. Even if it can be handled at the terminal
emulator level with special escapes and whatnot (and I believe it can,
albeit in very ugly ways) it simply cannot be handled in a plain text
file, for reasons like:

columna COLUMNB 1234 5678 columnc
columna COLUMNB 1234 5678 COLUMNC

Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).

One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.

This does not mean that storing text in 'visual order' is acceptable
either; that's just disgusting and makes correct ligatures/shaping
impossible. It just means that you cannot create a bidirection
presentation from a text file without higher level markup. Instead you
can use a vertical presentation or either LTR or RTL presentation with
the opposite-directionality glyphs rotated 180°.

My observations were that this sort of presentation is much easier to
edit and quite possibly easier to read than a format where your eyes
have to switch scanning directions.

I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.
Post by George W Gerrity
Post by Rich Felker
[...]
[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.
No one using arabic script would accept reading it top to bottom: it
is simply never done (to the best of my knowledge), and so any
terminal emulator claiming to work with any script had better be able
to render the text correctly, including mixing rtl and ltr.
You misread the above. Of course no one using LTR scripts would want
to read top-to-bottom either. The intent is that users of RTL scripts
could use an _entirely_ RTL terminal with the LTR characters' glyphs
rotated 180° while LTR users could use an _entirely_ LTR terminal with
RTL glyphs rotated 180°. The exception noted in the footnote is that
RTL scripts actually require "bidi" for numbers, but I comment that
this is trivial compared to bidi and suffers from none of the
fundamental problems of bidi.

The vertical orientation thing is mostly of interest to Mongolian
users and perhaps some East Asian users, but it could also be
interesting to (a very few) users of both LTR and RTL scripts who use
both frequently and who want a more equal treatment of both,
especially if they find reading upside-down difficult.

Rich


P.S. Do you have any good screenshots with RTL or LTR embedded text?
If so I can prepare some modified images to show what I mean and you
can see what you think of readability.
Mark Leisher
2006-09-01 15:36:44 UTC
Permalink
Post by Rich Felker
If that were the problem it would be trivial. The problems are much
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken.
I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.

Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.

The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.
Post by Rich Felker
Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).
So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.
Post by Rich Felker
One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.
You left out the part where Unicode says that none of these things is
strictly required. The *NIX community didn't reject anything. They
didn't need to. You also seem unaware of how much effort was made by
ISO, the Unicode Consortium, and all the national standards bodies to
avoid breaking a lot of existing practice.

I highly recommend participating in any standards development process
managed by any national or international standards body. You will find
an obsession with avoidance of breaking existing practice.
Post by Rich Felker
I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.
Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.

Human languages and the scripts used to represent them are messy. There
are no neat solutions. Get used to it.

Good day and good luck.
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark
Rich Felker
2006-09-01 18:08:03 UTC
Permalink
Post by Mark Leisher
Post by Rich Felker
If that were the problem it would be trivial. The problems are much
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken.
I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.
Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.
Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.
Post by Mark Leisher
The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.
Why is implicit so important? A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.

In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
Post by Mark Leisher
So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.
What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)
Post by Mark Leisher
Post by Rich Felker
One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.
You left out the part where Unicode says that none of these things is
strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
Post by Mark Leisher
The *NIX community didn't reject anything. They
didn't need to. You also seem unaware of how much effort was made by
ISO, the Unicode Consortium, and all the national standards bodies to
avoid breaking a lot of existing practice.
I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as well. The semantics of SHY
break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
broke existing practice on Windows by causing MS's implementation of
wchar_t to violate C/POSIX by not representing a complete character
anymore. Etc. etc. etc.

On the other hand they had no problem with filling the beginning of
the BMP with useless legacy characters for the sake of compatibility,
thus forcing South[east] Asian scripts which use many characters in
each word into the 3-byte range of UTF-8...

Unicode is far from ideal, but it's what we're stuck with, I agree.
However UAX#9 is inconsistent with the definition of a text file and
with good programming practice and thus alternate ways to present RTL
text acceptably (such as an entirely RTL display for RTL users) are
needed. I've read rants from some of the Arabeyes folks that they're
so disappointed with UAX#9 that they'd rather go the awful route of
storing text backwards!!
Post by Mark Leisher
Post by Rich Felker
I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.
Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.
You did not fix it because it cannot be fixed anymore than you can
tell me whether 1,200 means the number 1200 (printed in an ugly legacy
form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
that any implicit bidi at the terminal level WILL display blatently
wrong and misleading information in numerous real world cases, and
that text will jump around the terminal in an unpredictable and
illogical fashion under cursor control and deletion, replacement, or
insertion over existing text. As such, I deem such a feature a waste
of time to implement. It will mess up more stuff than it 'fixes'.

The alternatives are either to display characters in the wrong order
(siht ekil) or to unify the flow of text to one direction without
alterring the visual representation (which rotation accomplishes).
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).
Post by Mark Leisher
Human languages and the scripts used to represent them are messy.
There are no neat solutions. Get used to it.
Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy. UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess. This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.

Rich
Mark Leisher
2006-09-01 21:46:44 UTC
Permalink
Post by Rich Felker
Post by Mark Leisher
I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.
Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.
Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.
Did it every occur to you that it wasn't the "word processing mentality"
of the Unicode designers that led to ambiguities surviving in plain
text? It is simply the fact that there is no nice neat solution. Unicode
went farther than just about anyone else in solving the general case of
reordering plain bidi text for display without explicit directional codes.
Post by Rich Felker
Post by Mark Leisher
The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.
Why is implicit so important?
Why does plain text still exist?
Post by Rich Felker
A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.
I'm not quite sure what point you are trying to make here. Do away with
plain text?
Post by Rich Felker
In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
You are aware that numeric expressions can be written differently in
Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
Arabic). This also affects other characters often used with numbers such
as percent and dollar sign. So even within strictly RTL scripts,
different reordering is required depending on which script is being
used. But if you know a priori which script is in use, reordering is
trivial.
Post by Rich Felker
Post by Mark Leisher
So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.
What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)
This is the choice of each programming language designer: either allow
directional override codes in the source or ban them. Those than ban
them obviously assume that knowledge of the language's syntax is
sufficient to allow an editor to present the source code text reasonably
well.
Post by Rich Felker
Post by Mark Leisher
You left out the part where Unicode says that none of these things is
strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
How is a line ending with newline in a text file not a paragraph? A
poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
Unicode Standard says paragraph separators are required for the
reordering algorithm. There is no reason why a line can't be viewed as a
paragraph. And it even works reasonably well most of the time.

BTW, what part of ISO/IEC 9899 are you referring to? All I see is
§7.19.2.7 which says something about lines being limited to 254
characters and a terminating newline character. No definitions of lines
or paragraphs that I see off hand.
Post by Rich Felker
I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as well. The semantics of SHY
break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
broke existing practice on Windows by causing MS's implementation of
wchar_t to violate C/POSIX by not representing a complete character
anymore. Etc. etc. etc.
Han unification did indeed break existing practice, but I think you will
find that the IRG (group of representatives from all Han-using
countries) feels that in the long run, it was the best thing to do.

UCS-2 didn't so much break existing practice as come along at one of the
most confusing periods of internationalization retrofitting of the C
libraries and language. The wchar_t type was in the works before UCS-2
came along. And in most implementations it could hold a UCS-2 character.
I don't recall UTF-1 being around long enough to have much of an impact.
Consider how quickly it was discarded in favor of UTF-8. And I certainly
don't recall UTF-1 being forced on anyone.
Post by Rich Felker
On the other hand they had no problem with filling the beginning of
the BMP with useless legacy characters for the sake of compatibility,
thus forcing South[east] Asian scripts which use many characters in
each word into the 3-byte range of UTF-8...
Those "useless" legacy characters avoided breaking many existing
applications, most of which were not written for Southeast Asia. Some
scripts had to end up in the 3-byte range of UTF-8. Are you in a
position to determine who should and should not be in that range? Have
you even considered why they ended up in that range?
Post by Rich Felker
Unicode is far from ideal, but it's what we're stuck with, I agree.
However UAX#9 is inconsistent with the definition of a text file and
with good programming practice and thus alternate ways to present RTL
text acceptably (such as an entirely RTL display for RTL users) are
needed. I've read rants from some of the Arabeyes folks that they're
so disappointed with UAX#9 that they'd rather go the awful route of
storing text backwards!!
So are you implying that good programming practice requires lines are to
be ended with a newline and paragraphs are separated by two newlines?
What about the 25 year convention of CRLF on DOS/Win? What about the 20
year practice of using CR on Mac? Should we denounce them as heretics to
be excommunicated and unilaterally dictate to all that newline is the
only answer, just like you seem to think the Unicode Consortium did?

Like others who didn't like the Unicode bidi reordering approach, the
Arabeyes people were welcome to continue doing things the way they
wanted. Interoperability problems often either kill these companies or
force them to go Unicode at some level.
Post by Rich Felker
Post by Mark Leisher
Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.
You did not fix it because it cannot be fixed anymore than you can
tell me whether 1,200 means the number 1200 (printed in an ugly legacy
form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
that any implicit bidi at the terminal level WILL display blatently
wrong and misleading information in numerous real world cases, and
that text will jump around the terminal in an unpredictable and
illogical fashion under cursor control and deletion, replacement, or
insertion over existing text. As such, I deem such a feature a waste
of time to implement. It will mess up more stuff than it 'fixes'.
So you do understand. If it isn't fixable, what point is there in
complaining about it? Find a better way.
Post by Rich Felker
The alternatives are either to display characters in the wrong order
(siht ekil) or to unify the flow of text to one direction without
alterring the visual representation (which rotation accomplishes).
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).
Hmm. Sounds just like a bidi reordering algorithm I heard about. You
know. The one the Unicode Consortium is touting.

I have a lot of experience with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022.
All I will say about them is Unicode is a lot easier to deal with. Have
a look at the old kterm code if you want to see how complicated things
can get. And that was one of the cleaner implementations I've seen over
the years.

Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.
Post by Rich Felker
Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy.
Hah! I often hear the same sentiment from people who don't know the
difference between a glyph and a character. Yes, it is true that Indic,
and even Khmer and Burmese scripts are relatively easy. All you need to
do is create the right set of glyphs.

This frequently gives you multiple glyph codes for each abstract
character. To do anything with the text, a mapping between glyph and
abstract character is necessary for every program that uses that text.

Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.
Post by Rich Felker
UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess.
If 30 years of experience has led to blatantly wrong semantics, then
quit whining about it and fix it! The Unicode Consortium isn't deaf,
dumb, or stupid. They have been known to honor actual evidence of
incorrect behavior and change things when necessary. But they aren't
going to change things just because you find it inconveniently complicated.
Post by Rich Felker
This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.
I have been complaining about the general trend to over-complicate and
over-standardize software for years. These days the "art" of programming
only exists in the output of a rare handful of programmers. Don't worry
about it. Software will collapse under it's own weight in time. You just
have to be patient and wait until that happens and be ready with all
your simpler solutions.

<sarcasm>
But you better hurry up with those simpler solutions, the increasing
creep of unnecessary complexity into software is happening fast. The
crash is coming! It will probably arrive with /The Singularity/.
</sarcasm>
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark
Rich Felker
2006-09-02 00:01:58 UTC
Permalink
Post by Mark Leisher
Did it every occur to you that it wasn't the "word processing mentality"
of the Unicode designers that led to ambiguities surviving in plain
text? It is simply the fact that there is no nice neat solution. Unicode
went farther than just about anyone else in solving the general case of
reordering plain bidi text for display without explicit directional codes.
It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.
Post by Mark Leisher
Why does plain text still exist?
Read Eric Raymond's "The Art of Unix Programming". He answers the
question quite well.

Or I could just ask: should we write C code in MS Word .doc format?
Post by Mark Leisher
Post by Rich Felker
A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.
I'm not quite sure what point you are trying to make here. Do away with
plain text?
No, rather that handling of bidi scripts in plain text should be
biased towards computer languages rather than human languages. This is
both because plain text files are declining in use for human language
texts and increasing in use for computer language texts, and because
the display issues in human language texts can be solved with explicit
embeddign markers (which an editor or word processor could even
auto-insert for you) while the same marks are unwelcome in computer
languages.
Post by Mark Leisher
Post by Rich Felker
In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
You are aware that numeric expressions can be written differently in
Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
Arabic). This also affects other characters often used with numbers such
as percent and dollar sign. So even within strictly RTL scripts,
different reordering is required depending on which script is being
used. But if you know a priori which script is in use, reordering is
trivial.
This is part of the "considered harmful" of bidi. :)
I'm not familiar with all this stuff, but as a mathematician I'm
curious how mathematicians working in these languages write. BTW
mathematical notation is an interesting example where traditional
storage order is visual and not logical.
Post by Mark Leisher
This is the choice of each programming language designer: either allow
directional override codes in the source or ban them. Those than ban
them obviously assume that knowledge of the language's syntax is
sufficient to allow an editor to present the source code text reasonably
well.
It's simply not acceptable to need an editor that's aware of language
syntax in order to present the code for viewing and editing. You could
work around the problem by inserting dummy comments to prevent the
bidi algo from taking effect but that's really ugly and essentially
makes RTL scripts unusable in programming if the editor applies
Unicode bidi algo to the display.
Post by Mark Leisher
Post by Rich Felker
Post by Mark Leisher
You left out the part where Unicode says that none of these things is
strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
How is a line ending with newline in a text file not a paragraph? A
poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
Unicode Standard says paragraph separators are required for the
reordering algorithm. There is no reason why a line can't be viewed as a
paragraph. And it even works reasonably well most of the time.
Actually it does not work with embedding. If a (semantic) paragraph
has been split into multiple lines, the bidi embedding levels will be
broken and cannot be processed by the UAX#9 algorithm without trying
to reconstruct an idea of what the whole "paragraph" meant. Also, a
problem occurs if the first character of a new line happens to be an
LTR character (some embedded English text?) in a (semantic) paragraph
that's Arabic or Hebrew.

As you acknowledge below, a line it not necessarily an
unlimited-length object and in email it should not be longer than 80
characters (or preferably 72 or so to allow for quoting). So you can't
necessarily just take the MS Notepad approach of omitting newlines and
treating lines as paragraphs although this make be appropriate in some
uses of text files.
Post by Mark Leisher
BTW, what part of ISO/IEC 9899 are you referring to? All I see is
§7.19.2.7 which says something about lines being limited to 254
characters and a terminating newline character. No definitions of lines
or paragraphs that I see off hand.
I'm talking about the definition of a text file as a sequence of
lines, which might (on stupid legacy implementations) even be
fixed-width fields. It's under the stdio stuff about the difference
between text and binary mode. I could look it up but I don't feel like
digging thru the pdf file right now..
Post by Mark Leisher
Han unification did indeed break existing practice, but I think you will
find that the IRG (group of representatives from all Han-using
countries) feels that in the long run, it was the best thing to do.
I agree it was best to do too. I just pointed it out as being contrary
to your claim that they made every effort not to break existing
practice.
Post by Mark Leisher
UCS-2 didn't so much break existing practice as come along at one of the
most confusing periods of internationalization retrofitting of the C
libraries and language.
I suppose this is true and I don't know the history and internal
politics well enough to know who was responsible for what. However,
unlike doublebyte/multibyte charsets which were becoming prevalent at
the time, UCS-2 data does not form valid C strings. A quick glance at
some historical remarks from unicode.org and Rob Pike suggests that
UTF-8 was invented well before any serious deployment of Unicode, i.e.
that the push for UCS-2 was deliberately aimed at breaking things,
though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
than the consortium as a whole.
Post by Mark Leisher
Those "useless" legacy characters avoided breaking many existing
applications, most of which were not written for Southeast Asia. Some
scripts had to end up in the 3-byte range of UTF-8. Are you in a
position to determine who should and should not be in that range? Have
IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.

Of course it's pointless to discuss this since we can't change it now
anyway.
Post by Mark Leisher
you even considered why they ended up in that range?
Probably at the time of allocation, UTF-8 was not even around yet. I
haven't studied that portion of Unicode history. Still the legacy
characters could have been put at the end with CJK compat forms and
preshaped Arabic forms, etc. or even outside the BMP.
Post by Mark Leisher
Like others who didn't like the Unicode bidi reordering approach, the
Arabeyes people were welcome to continue doing things the way they
wanted. Interoperability problems often either kill these companies or
force them to go Unicode at some level.
Thankfully there's not too much room for interoperability problems
with the data itself as long as you stick to logical order, especially
since the need for more than a single embedding level is rare. Unless
you're arguing for visual order, the question is entirely a display
matter, whether bidi display is compatible with other requirements.
Post by Mark Leisher
So you do understand. If it isn't fixable, what point is there in
complaining about it? Find a better way.
That's what I'm trying to do... Maybe some Hebrew or Arabic users who
dislike the whole bidi mess (the one Israeli user I'm in contact with
hates bidi and thinks it's backwards...not a good sample size but
interesting nonetheless) will agree and try my ideas for a
unidirectional presentation and like them. Or maybe they'll think it's
ugly and look for other solutions.
Post by Mark Leisher
Post by Rich Felker
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).
Hmm. Sounds just like a bidi reordering algorithm I heard about. You
know. The one the Unicode Consortium is touting.
Applications can draw their own bidi text with higher level formatting
information, of course. I'm thinking of a terminal-mode browser that
has the bidi text in HTML with <dir> tags and whatnot, or apps with a
text 'gui' consisting of separated interface elements.
Post by Mark Leisher
I have a lot of experience
Could you tell me some of what you've worked on and what conclusions
you reached? I'm not familiar with your work.
Post by Mark Leisher
with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022.
ISO 2022 is an abomination, certainly not an acceptable way to store
text due to its stateful nature, and although it works for
_displaying_ text, it's ugly even for that.

I've read ECMA-48 bidi stuff several times and still can't make any
sense of it, so I agree it's disgusting too. It does seem powerful but
powerful is often a bad thing. :)
Post by Mark Leisher
All I will say about them is Unicode is a lot easier to deal with. Have
Easier to deal with because it solves an easier problem. UAX#9 tells
you what to do when you have explicit paragraph division and unbounded
search capability forwards and backwards. Neither of these exists in a
character cell device environment, and (depending on your view of what
constitutes a proper text file) possibly not in a text file either. My
view of a text file (maybe not very popular these days?) is that it's
a more-restricted version of a character cell terminal (no cursor
positioning allowed) but with unlimited height.
Post by Mark Leisher
a look at the old kterm code if you want to see how complicated things
can get. And that was one of the cleaner implementations I've seen over
the years.
Does it implement ECMA-48 version of bidi? Or random unspecified bidi
like mlterm? Or..?
Post by Mark Leisher
Post by Rich Felker
Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy.
Hah! I often hear the same sentiment from people who don't know the
difference between a glyph and a character.
I think we've established that I know the difference..
Post by Mark Leisher
Yes, it is true that Indic,
and even Khmer and Burmese scripts are relatively easy. All you need to
do is create the right set of glyphs.
Exactly. That's a lot of work...for the font designer. Almost no work
for the application author or for the machine at runtime.
Post by Mark Leisher
This frequently gives you multiple glyph codes for each abstract
character. To do anything with the text, a mapping between glyph and
abstract character is necessary for every program that uses that text.
No, it's necessary only for the terminal. The programs using the text
need not have any idea what language/script it comes from. This is the
whole beauty of using such apps.

The same applies to gui apps too if they're using a nice widget kit.
Unfortunately all the existing widget kits are horribly bloated and
very painful to work with for someone not coming from a MS Windows
mentality (i.e. if you want to actually have control over the flow of
execution of your program..).
Post by Mark Leisher
Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.
That's what I'm working on, but sometimes discussing the issues at the
same time helps.
Post by Mark Leisher
Post by Rich Felker
UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess.
If 30 years of experience has led to blatantly wrong semantics, then
quit whining about it and fix it! The Unicode Consortium isn't deaf,
dumb, or stupid. They have been known to honor actual evidence of
incorrect behavior and change things when necessary. But they aren't
going to change things just because you find it inconveniently complicated.
They generally don't change things in incompatible ways, certainly not
in ways that would require retrofitting existing data with proper
embedding codes. What they might consider doing though is adding a
support level 1.5 or such. Right now UAX#9 (implicitly?) says that an
application not implementing at least implicit bidi algorithm must not
interpret RTL characters visually at all.
Post by Mark Leisher
Post by Rich Felker
This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.
I have been complaining about the general trend to over-complicate and
over-standardize software for years. These days the "art" of programming
only exists in the output of a rare handful of programmers. Don't worry
about it. Software will collapse under it's own weight in time. You just
have to be patient and wait until that happens and be ready with all
your simpler solutions.
Well in many cases my "simple solutions" are too simple for people
who've gotten used to bloated featuresets and gotten used to putting
up with slowness, bugs, and insecurity. But we'll see. My whole family
of i18n-related projects started out with a desire to switch to UTF-8
everywhere and to have Latin, Tibetan, and Japanese support at the
console level without increased bloat, performance penalties, and huge
dependency trees. From there I first wrote a super-small UTF-8-only C
library and then turned towards the terminal emulator issue, which in
turn led to the font format issue, etc. etc. :) Maybe after a whole
year passes I'll have roughly what I wanted.
Post by Mark Leisher
<sarcasm>
But you better hurry up with those simpler solutions, the increasing
creep of unnecessary complexity into software is happening fast. The
crash is coming! It will probably arrive with /The Singularity/.
</sarcasm>
Keep an eye on busybox. It's quickly gaining in features while
shrinking in size, and while currently the i18n support is rather poor
the developers are open to adding good support as long as it's an
option at compiletime. Along with my project I've been documenting the
quality, portability, i18n/m17n support, bloat, etc. of lots of other
software too and I'll eventually be making the results available
publicly.

Rich
Post by Mark Leisher
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Somehow seems appropriate
to the topic at hand.
Mark Leisher
2006-09-05 02:19:02 UTC
Permalink
Post by Rich Felker
It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.
The Unicode Consortium is quite correctly more concerned with human
languages than programming languages. I think you are arguing yourself
into a dead end. Programming languages are ephemeral and some might
argue they are in fact slowly converging with human languages.
Post by Rich Felker
Post by Mark Leisher
Why does plain text still exist?
Read Eric Raymond's "The Art of Unix Programming". He answers the
question quite well.
You missed the point completely. Support of implicit bidirectionality
exists precisely because plain text exists. And it isn't going away any
time soon.
Post by Rich Felker
Or I could just ask: should we write C code in MS Word .doc format?
No reason to. Programming editors work well as they are and will
continue to work well after being adapted for Unicode.
Post by Rich Felker
Post by Mark Leisher
I'm not quite sure what point you are trying to make here. Do away with
plain text?
No, rather that handling of bidi scripts in plain text should be
biased towards computer languages rather than human languages. This is
both because plain text files are declining in use for human language
texts and increasing in use for computer language texts, and because
the display issues in human language texts can be solved with explicit
embeddign markers (which an editor or word processor could even
auto-insert for you) while the same marks are unwelcome in computer
languages.
You don't appear to have any experience writing lexical scanners for
programming languages. If you did, you would know how utterly trivial it
is to ignore embedded bidi codes an editor might introduce.

Though I haven't checked myself, I wouldn't be surprised if Perl,
Python, PHP, and a host of other programming languages weren't already
doing this, making your concerns pointless. You would probably find it
instructive to look at some lexical scanners.
Post by Rich Felker
Post by Mark Leisher
Post by Rich Felker
In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
You are aware that numeric expressions can be written differently in
Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
Arabic). This also affects other characters often used with numbers such
as percent and dollar sign. So even within strictly RTL scripts,
different reordering is required depending on which script is being
used. But if you know a priori which script is in use, reordering is
trivial.
This is part of the "considered harmful" of bidi. :)
I'm not familiar with all this stuff, but as a mathematician I'm
curious how mathematicians working in these languages write. BTW
mathematical notation is an interesting example where traditional
storage order is visual and not logical.
Considered harmful? This is standard practice in these languages and has
been for a long time. You can't seriously expect readers of RTL
languages to just throw away everything they've learned since childhood
and learn to read their mathematical expressions backwards? Or simply
require that their scripts never appear in a plain text file? That is
ignorant at best and arrogant at worst.
Post by Rich Felker
Post by Mark Leisher
This is the choice of each programming language designer: either allow
directional override codes in the source or ban them. Those than ban
them obviously assume that knowledge of the language's syntax is
sufficient to allow an editor to present the source code text reasonably
well.
It's simply not acceptable to need an editor that's aware of language
syntax in order to present the code for viewing and editing. You could
work around the problem by inserting dummy comments to prevent the
bidi algo from taking effect but that's really ugly and essentially
makes RTL scripts unusable in programming if the editor applies
Unicode bidi algo to the display.
You really need to start looking at code and stop pontificating from a
poorly understood position. Just about every programming editor out
there is already aware of programming language syntax. Many different
programming languages in most cases.
Post by Rich Felker
Post by Mark Leisher
Post by Rich Felker
Post by Mark Leisher
You left out the part where Unicode says that none of these things is
strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
How is a line ending with newline in a text file not a paragraph? A
poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
Unicode Standard says paragraph separators are required for the
reordering algorithm. There is no reason why a line can't be viewed as a
paragraph. And it even works reasonably well most of the time.
Actually it does not work with embedding. If a (semantic) paragraph
has been split into multiple lines, the bidi embedding levels will be
broken and cannot be processed by the UAX#9 algorithm without trying
to reconstruct an idea of what the whole "paragraph" meant. Also, a
problem occurs if the first character of a new line happens to be an
LTR character (some embedded English text?) in a (semantic) paragraph
that's Arabic or Hebrew.
This is trivially obvious. Why do you think I said "poorly formed
paragraph?" The obvious implication is that every once in a while,
reordering errors will happen because the algorithm is being applied to
a single line of a paragraph.
Post by Rich Felker
As you acknowledge below, a line it not necessarily an
unlimited-length object and in email it should not be longer than 80
characters (or preferably 72 or so to allow for quoting). So you can't
necessarily just take the MS Notepad approach of omitting newlines and
treating lines as paragraphs although this make be appropriate in some
uses of text files.
So instead of a substantive argument why a line can't be viewed as a
paragraph, you simply imply that it just can't be done. Weak.
Post by Rich Felker
I'm talking about the definition of a text file as a sequence of
lines, which might (on stupid legacy implementations) even be
fixed-width fields. It's under the stdio stuff about the difference
between text and binary mode. I could look it up but I don't feel like
digging thru the pdf file right now..
That section doesn't provide definitions of line or paragraph.
Post by Rich Felker
Post by Mark Leisher
Han unification did indeed break existing practice, but I think you will
find that the IRG (group of representatives from all Han-using
countries) feels that in the long run, it was the best thing to do.
I agree it was best to do too. I just pointed it out as being contrary
to your claim that they made every effort not to break existing
practice.
For a mathematician, you are quite good at ignoring inconvenient logic.
The phrase "every effort to avoid breaking existing practice" does not
logically imply that no existing practice was broken. Weak.
Post by Rich Felker
I suppose this is true and I don't know the history and internal
politics well enough to know who was responsible for what. However,
unlike doublebyte/multibyte charsets which were becoming prevalent at
the time, UCS-2 data does not form valid C strings. A quick glance at
some historical remarks from unicode.org and Rob Pike suggests that
UTF-8 was invented well before any serious deployment of Unicode, i.e.
that the push for UCS-2 was deliberately aimed at breaking things,
though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
than the consortium as a whole.
You can ask any of the Unicode people from those companies and will get
the same answer. Something had to be done and UCS-2 was the answer at
the time. Conspiracy theories do not substantive argument make.
Post by Rich Felker
Post by Mark Leisher
Those "useless" legacy characters avoided breaking many existing
applications, most of which were not written for Southeast Asia. Some
scripts had to end up in the 3-byte range of UTF-8. Are you in a
position to determine who should and should not be in that range? Have
IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.
So you simply assume that nobody bothered to look into things like
information density et al during the formation of the Unicode
Standard? You don't appear to be aware of the social and political
ramifications involved in making decisions like that. It doesn't matter
if it makes sense from a mathematical point of view, nations and people
are involved.
Post by Rich Felker
Post by Mark Leisher
you even considered why they ended up in that range?
Probably at the time of allocation, UTF-8 was not even around yet. I
haven't studied that portion of Unicode history. Still the legacy
characters could have been put at the end with CJK compat forms and
preshaped Arabic forms, etc. or even outside the BMP.
Scripts were placed when information about their encodings became
available to the Unicode Consortium. It's that simple. No big conspiracy
to give SEA scripts short shrift.
Post by Rich Felker
Post by Mark Leisher
So you do understand. If it isn't fixable, what point is there in
complaining about it? Find a better way.
That's what I'm trying to do... Maybe some Hebrew or Arabic users who
dislike the whole bidi mess (the one Israeli user I'm in contact with
hates bidi and thinks it's backwards...not a good sample size but
interesting nonetheless) will agree and try my ideas for a
unidirectional presentation and like them. Or maybe they'll think it's
ugly and look for other solutions.
Sure. Lots of people don't like the situation, but nobody has come up
with anything better. There is a very good reason for that.
Post by Rich Felker
Applications can draw their own bidi text with higher level formatting
information, of course. I'm thinking of a terminal-mode browser that
has the bidi text in HTML with <dir> tags and whatnot, or apps with a
text 'gui' consisting of separated interface elements.
Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser.
Have you checked out how it handles Unicode?
Post by Rich Felker
Post by Mark Leisher
I have a lot of experience
Could you tell me some of what you've worked on and what conclusions
you reached? I'm not familiar with your work.
Well, you can refer to the kterm code for some of my work with ISO/IEC
2022, and I may be able to dig up an ancient version of Motif (ca. 1993)
I adapted to use ISO/IEC 6429 and ISO/IEC 2022, and shortly after that
first Motif debacle, I attempted unsuccessfully to get a variant of
cxterm working with a combination of the two standards.

The conclusion was simple. The code quickly got too complicated to
debug. All kinds of little boundary (buffer/screen) effects kept
cropping up thanks to multi-byte escape sequences.
Post by Rich Felker
ISO 2022 is an abomination, certainly not an acceptable way to store
text due to its stateful nature, and although it works for
_displaying_ text, it's ugly even for that.
I've read ECMA-48 bidi stuff several times and still can't make any
sense of it, so I agree it's disgusting too. It does seem powerful but
powerful is often a bad thing. :)
Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte
escape sequences.
Post by Rich Felker
Post by Mark Leisher
All I will say about them is Unicode is a lot easier to deal with. Have
Easier to deal with because it solves an easier problem. UAX#9 tells
you what to do when you have explicit paragraph division and unbounded
search capability forwards and backwards. Neither of these exists in a
character cell device environment, and (depending on your view of what
constitutes a proper text file) possibly not in a text file either. My
view of a text file (maybe not very popular these days?) is that it's
a more-restricted version of a character cell terminal (no cursor
positioning allowed) but with unlimited height.
Having implemented UAX #9 and a couple of other approaches that produce
the same or similar results, I don't see any problem using it to render
text files. If your text file has one paragraph per line, then you will
see occasional glitches in mixed LTR & RTL text.
Post by Rich Felker
Post by Mark Leisher
a look at the old kterm code if you want to see how complicated things
can get. And that was one of the cleaner implementations I've seen over
the years.
Does it implement ECMA-48 version of bidi? Or random unspecified bidi
like mlterm? Or..?
kterm had ISO/IEC 2022 support. Very few people attempted to use ISO/IEC
6429 because they didn't understand it very well and they knew how
complicated ISO/IEC 2022 was all by itself.
Post by Rich Felker
Post by Mark Leisher
Hah! I often hear the same sentiment from people who don't know the
difference between a glyph and a character.
I think we've established that I know the difference..
Post by Mark Leisher
Yes, it is true that Indic,
and even Khmer and Burmese scripts are relatively easy. All you need to
do is create the right set of glyphs.
Exactly. That's a lot of work...for the font designer. Almost no work
for the application author or for the machine at runtime.
Post by Mark Leisher
This frequently gives you multiple glyph codes for each abstract
character. To do anything with the text, a mapping between glyph and
abstract character is necessary for every program that uses that text.
No, it's necessary only for the terminal. The programs using the text
need not have any idea what language/script it comes from. This is the
whole beauty of using such apps.
I suspect you missed my point. Using glyph codes as an encoding gets
complicated fast. You can ask anyone who has tried to do any serious NLP
work with pre-Unicode Indic text. We are still having to write analysers
and converters to figure out the correct abstract characters and their
order for many scripts. I can provide a mapping table for one Burmese
encoding that shows how hideously complicated it can get to map a glyph
encoding to the underlying linear abstract character necessary to do any
kind of linguistic analysis.
Post by Rich Felker
They generally don't change things in incompatible ways, certainly not
in ways that would require retrofitting existing data with proper
embedding codes. What they might consider doing though is adding a
support level 1.5 or such. Right now UAX#9 (implicitly?) says that an
application not implementing at least implicit bidi algorithm must not
interpret RTL characters visually at all.
Well, they don't want a program that simply reverses RTL segments
claiming conformance with UAX #9, it is better to see it backward than
to see it wrong. You can ask native users of RTL scripts about that. And
ask more than one.
Post by Rich Felker
Well in many cases my "simple solutions" are too simple for people
who've gotten used to bloated featuresets and gotten used to putting
up with slowness, bugs, and insecurity. But we'll see. My whole family
of i18n-related projects started out with a desire to switch to UTF-8
everywhere and to have Latin, Tibetan, and Japanese support at the
console level without increased bloat, performance penalties, and huge
dependency trees. From there I first wrote a super-small UTF-8-only C
library and then turned towards the terminal emulator issue, which in
turn led to the font format issue, etc. etc. :) Maybe after a whole
year passes I'll have roughly what I wanted.
I don't recall having seen your "simple solutions" so I can't dismiss
them off-hand as not being complicated enough yet. Like I said a couple
emails ago, sometimes it doesn't matter if you have a better answer, but
if it really is simple, accurate, and on the Internet, you can count on
it supplanting the bloat eventually.

BTW, now that the holiday has passed, I probably won't have time to
reply at similar length. But it's been fun.
--
---------------------------------------------------------------------------
Mark Leisher
Computing Research Lab Nowadays, the common wisdom is to
New Mexico State University celebrate diversity - as long as you
Box 30001, MSC 3CRL don't point out that people are
Las Cruces, NM 88003 different. -- Colin Quinn
Behdad Esfahbod
2006-09-05 02:57:08 UTC
Permalink
Post by Mark Leisher
Though I haven't checked myself, I wouldn't be surprised if Perl,
Python, PHP, and a host of other programming languages weren't already
doing this, making your concerns pointless. You would probably find it
instructive to look at some lexical scanners.
To add a sidenote to this otherwise pointless conversation, the ECMA
Script (aka Javascript) standard actually ignores all format characters
(gen-cat=Cf) from the source code. This has caused a problem for
Persian computing as U+200C ZERO WIDTH NON-JOINER is Cf and used in
Persian text. Brandon Eich is working on changing the standard to not
ignore formatting characters in string literals (and regexps probably
too.)
--
behdad
http://behdad.org/

"Commandment Three says Do Not Kill, Amendment Two says Blood Will Spill"
-- Dan Bern, "New American Language"
Rich Felker
2006-09-05 05:13:35 UTC
Permalink
Post by Mark Leisher
Post by Rich Felker
It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.
The Unicode Consortium is quite correctly more concerned with human
languages than programming languages. I think you are arguing yourself
into a dead end. Programming languages are ephemeral and some might
argue they are in fact slowly converging with human languages.
Arrg, C is not going away anytime soon. C is THE LANGUAGE as far as
POSIX is concerned. The reason I said "arrg" is that I feel like this
gap between the core values of the "i18n bloatware crowd" and the
"hardcore lowlevel efficient software crowd" is what keeps good i18n
out of the best software. When you talk about programming languages
converging with human languages somehow all I can think of us Perl...
yuck! Larry Wall's been great about pushing Unicode and UTF-8, but
Perl itself is a horrible mess. The implementation is hopelessly bad
and there's little hope of there ever being a reimplementation.

Anyway as I've said again and again, it's no problem for human
language text to have explicit embedding tagging. It doesn't need to
conform to syntax rules (oh yeah Perl code doesn't need to either ;)).
Fancy editors can even insert tags for you. On the other hand,
stuffing extra control characters into machine-read texts with
specific syntactical and semantic rules is not possible. You can't
even just strip these characters when processing because, depending on
the semantics of the file, they may either be controlling the display
of the file or literal embedding controls to be used when the strings
from the file are printed to their final destination.
Post by Mark Leisher
Post by Rich Felker
Or I could just ask: should we write C code in MS Word .doc format?
No reason to. Programming editors work well as they are and will
continue to work well after being adapted for Unicode.
No, if they perform the algorithm in UAX#9 they will display garbled
unreadable code. Or does C somehow qualify as a "higher level
protocol" for formatting?
Post by Mark Leisher
You don't appear to have any experience writing lexical scanners for
programming languages. If you did, you would know how utterly trivial it
is to ignore embedded bidi codes an editor might introduce.
I'm quite aware that it's simple to code, but also illegal according
to the specs. Also you're ignoring the more troublesome issues...
Obviously you can't remove them inside strings. :) Issues with
comments too..
Post by Mark Leisher
Though I haven't checked myself, I wouldn't be surprised if Perl,
Python, PHP, and a host of other programming languages weren't already
doing this, making your concerns pointless.
I doubt it, but even it they do, these are toy languages with one
implementation and no specification (and in Perl's case, for which
it's hopeless to even try to write a specification). It's easy to hack
whatever you want and break compatibility with every new release of
the language when your implementation is the only one. It's much
harder when you're working with an international standard for a
language that's been around (and rather stable!) approaching-40-years
and intended to have multiple interoperable implementations.
Post by Mark Leisher
You can't seriously expect readers of RTL
languages to just throw away everything they've learned since childhood
and learn to read their mathematical expressions backwards? Or simply
require that their scripts never appear in a plain text file? That is
ignorant at best and arrogant at worst.
I've seen examples that show that UAX#9 just butchers mathematical
expressions in the absence of explicit bidi control.
Post by Mark Leisher
You really need to start looking at code and stop pontificating from a
poorly understood position. Just about every programming editor out
there is already aware of programming language syntax. Many different
programming languages in most cases.
Cheap regex-based syntax hilighting is not the same thing at all. But
this is aside from the point, that it's fundamentally WRONG to need a
special tool that knows about the syntax of your computer language in
order to edit it. What if you've designed your own language to solve a
particular problem? Do you have to go and modify your editor to make
it display this text correctly for this language? NO! That's the whole
reason we have plain text. You can edit it without having to have a
special program!
Post by Mark Leisher
Post by Rich Felker
As you acknowledge below, a line it not necessarily an
unlimited-length object and in email it should not be longer than 80
characters (or preferably 72 or so to allow for quoting). So you can't
necessarily just take the MS Notepad approach of omitting newlines and
treating lines as paragraphs although this make be appropriate in some
uses of text files.
So instead of a substantive argument why a line can't be viewed as a
paragraph, you simply imply that it just can't be done. Weak.
No, I agree that it can be. I'm just saying that a line can't do the
things you expect a paragraph to do, though. In particular it can't be
arbitrarily long in any plain text context, although it could be in
some.
Post by Mark Leisher
Post by Rich Felker
I'm talking about the definition of a text file as a sequence of
lines, which might (on stupid legacy implementations) even be
fixed-width fields. It's under the stdio stuff about the difference
between text and binary mode. I could look it up but I don't feel like
digging thru the pdf file right now..
That section doesn't provide definitions of line or paragraph.
See 7.19.2 Streams.
Post by Mark Leisher
Post by Rich Felker
I agree it was best to do too. I just pointed it out as being contrary
to your claim that they made every effort not to break existing
practice.
For a mathematician, you are quite good at ignoring inconvenient logic.
The phrase "every effort to avoid breaking existing practice" does not
logically imply that no existing practice was broken. Weak.
Read the history. Han unification was one of the very first points of
Unicode, even though it was obvious that it would break much existing
practice. This seems to have been connected to the misguided goal of
trying to make everything into fixed-width 16bit characters. From what
I understand, early Unicode was making every effort _to break_
existing practice. Their motto was "...begin at 0 and add the next
character" which to me implies "throw out everything that already
exists and start from scratch." I've never seen the early drafts but I
wouldn't be surprised if the original characters 0-127 didn't even
match ASCII.
Post by Mark Leisher
Post by Rich Felker
I suppose this is true and I don't know the history and internal
politics well enough to know who was responsible for what. However,
unlike doublebyte/multibyte charsets which were becoming prevalent at
the time, UCS-2 data does not form valid C strings. A quick glance at
some historical remarks from unicode.org and Rob Pike suggests that
UTF-8 was invented well before any serious deployment of Unicode, i.e.
that the push for UCS-2 was deliberately aimed at breaking things,
though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
than the consortium as a whole.
You can ask any of the Unicode people from those companies and will get
the same answer. Something had to be done and UCS-2 was the answer at
the time. Conspiracy theories do not substantive argument make.
I've been researching what I can with the little information available
and it seems that the early Unicode architects got a strong disgust
for variable-size characters from their experience with Shift_JIS
(which was extremely poorly designed) and other CJK encodings and
developed a dogma that fixed-width was the way to go. There are
numerous references to this sort of thinking in "10 Years of Unicode"
published under history on unicode.org.
Post by Mark Leisher
So you simply assume that nobody bothered to look into things like
information density et al during the formation of the Unicode
Standard? You don't appear to be aware of the social and political
ramifications involved in making decisions like that. It doesn't matter
if it makes sense from a mathematical point of view, nations and people
are involved.
Latin text (which is mostly ASCII anyway) would go up in size by a few
percent while many languages would go down by 33%. Sounds like a fair
trade. I'm sure there are political ramifications, and of course the
answer is always: do what pleases the countries with the most
money/power rather than doing what serves the largest population and
the population that has the greatest scarcity of storage space...
Post by Mark Leisher
Scripts were placed when information about their encodings became
available to the Unicode Consortium. It's that simple. No big conspiracy
to give SEA scripts short shrift.
Honestly I think they just didn't care about UTF-8 at the time because
they still had delusions that people would switch to UCS-2 for
everything. Also I've been told that the arrangement was intended to
be "West to East"..
Post by Mark Leisher
Post by Rich Felker
Applications can draw their own bidi text with higher level formatting
information, of course. I'm thinking of a terminal-mode browser that
has the bidi text in HTML with <dir> tags and whatnot, or apps with a
text 'gui' consisting of separated interface elements.
Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser.
Have you checked out how it handles Unicode?
The only app I've seriously checked out is mined simply because most
apps don't have support for bidi on the console (and many still don't
even know how to use wcwidth...! including emacs!! :( ).

If lynx handles bidi specially I'd be interested in seeing what it
does. However this brings up another interesting question: what should
lynx -dump do? :) Naturally dumping in visual order is wrong, but
generating a text file that will look right when displayed according
to UAX#9 sounds quite difficult, especially when you take multiple
columns, etc. into account. Of course lynx is old broken crap that
doesn't even support tables so maybe it has it easier.. :) These days
I use ELinks, but it has very very poor i18n support. :(
Post by Mark Leisher
Post by Rich Felker
I've read ECMA-48 bidi stuff several times and still can't make any
sense of it, so I agree it's disgusting too. It does seem powerful but
powerful is often a bad thing. :)
Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte
escape sequences.
I'm confused what you mean by multi-byte escape sequences. What I know
of as ISO 2022 is the charset-switching escapes used for legacy CJK
support and "vt100 linedrawing characters", but you seem to be talking
about something related to bidi. Does ISO 2022 have bidi controls as
well?
Post by Mark Leisher
Post by Rich Felker
Post by Mark Leisher
All I will say about them is Unicode is a lot easier to deal with. Have
Easier to deal with because it solves an easier problem. UAX#9 tells
you what to do when you have explicit paragraph division and unbounded
search capability forwards and backwards. Neither of these exists in a
character cell device environment, and (depending on your view of what
constitutes a proper text file) possibly not in a text file either. My
view of a text file (maybe not very popular these days?) is that it's
a more-restricted version of a character cell terminal (no cursor
positioning allowed) but with unlimited height.
Having implemented UAX #9 and a couple of other approaches that produce
the same or similar results, I don't see any problem using it to render
text files. If your text file has one paragraph per line, then you will
see occasional glitches in mixed LTR & RTL text.
Seek somewhere in the middle of the line and type a character of the
opposite directionality. Watch the whole line jump around and the
character you just typed end up in a different column from where your
cursor was placed.

This sort of thing will happen all the time in a terminal when the app
goes to draw interface elements, etc. over top of part of the text. If
it doesn't, i.e. if the terminal implements a sort of "hard implicit
bidi", then the terminal will just hopelessly corrupt unless the
program has explicit bidi logic matching the terminal's.
Post by Mark Leisher
Post by Rich Felker
Post by Mark Leisher
This frequently gives you multiple glyph codes for each abstract
character. To do anything with the text, a mapping between glyph and
abstract character is necessary for every program that uses that text.
No, it's necessary only for the terminal. The programs using the text
need not have any idea what language/script it comes from. This is the
whole beauty of using such apps.
I suspect you missed my point. Using glyph codes as an encoding gets
complicated fast.
Yes but where did I say anything about glyph codes? In both Unicode
and ISCII text everything is character codes, not glyph codes. Sorry
but I don't understand what you were trying to say..
Post by Mark Leisher
Well, they don't want a program that simply reverses RTL segments
claiming conformance with UAX #9, it is better to see it backward than
to see it wrong. You can ask native users of RTL scripts about that. And
ask more than one.
It says more than that; it says that a program is forbidden from
interpreting the characters visually at all if it doesn't perform at
least the implicit part of UAX#9. From my reading, this means that
UAX#9 deems it worse to show the RTL characters in LTR order than not
to show them at all. It also precludes display strategies like the one
I proposed.
Post by Mark Leisher
Post by Rich Felker
Well in many cases my "simple solutions" are too simple for people
who've gotten used to bloated featuresets and gotten used to putting
up with slowness, bugs, and insecurity. But we'll see. My whole family
of i18n-related projects started out with a desire to switch to UTF-8
everywhere and to have Latin, Tibetan, and Japanese support at the
console level without increased bloat, performance penalties, and huge
dependency trees. From there I first wrote a super-small UTF-8-only C
library and then turned towards the terminal emulator issue, which in
turn led to the font format issue, etc. etc. :) Maybe after a whole
year passes I'll have roughly what I wanted.
I don't recall having seen your "simple solutions" so I can't dismiss
http://svn.mplayerhq.hu/libc/trunk/
About 100kb of code and a few kb of data. E.g. iconv is 2kb, missing
support for CJK legacy encodings at present, final size should be
about 2.5-2.7kb.

Terminal emulator uuterm isn't checked in yet but it's looking like
the whole program with support for all scripts (except RTL scripts, if
you don't count non-UAX#9-conformant display as support) will come to
about 50kb of code static linked. Plus about 1.5 meg for a complete
font.


On a separate note... maybe it would help it I express and clarify my
view on UAX#9:

I think it very much has its place and it's great when formatting
content that is known to be human-language text for display in the
traditional form expected by most readers. However, IMO what UAX#9
should be seen as is a specification of the correspondence between the
stored "logical order" text and the traditional print form, in a way
as a definition of "logical order" text. It's important to have this
kind of definition for legal purposes especially, so e.g. if someone
has signed a document containing particular bidi text, it's clear what
printed text ordering that binary text is meant to represent and thus
clear what was signed.

On the other hand, I find the whole idea of bidirectionality harmful.
Human language text has always involved ambiguity as far as
interpreting the meaning, but aside from bidi text, at least there is
an unambiguous way to display the characters so that their logical
order is clear to the reader, and this method does not require the
machine to interpret the human language at all.

With bidi thrown in, not only does the presentation completely _fail_
to represent the logical order of the text. In fact it's possible to
construct bidi text where the presentation order is completely
deceptive... this could, for example, be used for googlebombing or
evading spam filters by permuting the characters of your text to
include or avoid certain words or phrases. The author of Yudit also
identifies examples that have security implications.

Along with the other reasons I have discussed regarding breaking text
file and character cell sanity, this is why, in my view, bidi is
"considered harmful". I don't expect RTL script users to switch to
LTR. What I do propose is a way for LTR users to view text containing
RTL characters without the need for bidi and without "ekil esnesnon
siht", as well as a way for RTL users to have an entirely-RTL
environment rather than a bidi one. The latter still requires some
more consideration regarding mathematical expressions and numerals. At
this point I have no idea whether such a thing would be of interest to
a significant number of RTL users but I suspect primarily-LTR users
with an occasional need for reading Arabic or Hebrew words or phrases
would like it. Both of these approaches have the side-effect of making
RTL scripts "just work" in any application without the need for
special bidi support at the application level or the terminal level.
Post by Mark Leisher
BTW, now that the holiday has passed, I probably won't have time to
reply at similar length. But it's been fun.
Ah well, I tried to strip my reply down to the most
interesting/relevant parts in case you do have time for some replies,
but it looks like I've still left a lot in.

Thanks for discussing in any case.

Rich
Rich Felker
2006-09-06 04:11:49 UTC
Permalink
My last gasp on this conversation: I don't think you really understand
what you are talking about and won't until you get some hands-on
experience.
I'm not sure how to take this but whatever it is, it sounds
condescending and impolite. Was that the intent? What makes you think
I lack hands-on experience? The fact that my code is "too small" and
going to stay that way? Or just that it's not yet checked in for you
to view?

I'm sorry if my long messages to this list have offended, but my
intent was to seek input and discussion. I don't think anything I said
was any more offensive than similar things which Markus and other
people respected in this community have said. If it's just that you
don't have time to deal with this thread anymore, no problem, I won't
take offense.
Goodbye and good luck.
Thanks I suppose......

Rich
Mark Leisher
2006-09-05 14:07:14 UTC
Permalink
My last gasp on this conversation: I don't think you really understand
what you are talking about and won't until you get some hands-on
experience. Goodbye and good luck.
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark
David Starner
2006-09-05 04:44:26 UTC
Permalink
Post by Rich Felker
IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.
Once you compress the data with a decent compression scheme, you may
as well store the data by writing out the full Unicode name (e.g.
"LATIN CAPITAL LETTER OU"); the final result will be about the same
size. Furthermore, you can fit a decent sized novel on a floppy
uncompressed and a decent sized library on a DVD uncompressed. The
only application I've seen where text data size was really crucial was
text messaging. Hence, common sense tells _me_ that we should put
scripts used by heavily text-messaging cultures in the 2-byte range;
that is, Latin, Hiragana and Katakana.
Rich Felker
2006-09-05 05:28:29 UTC
Permalink
Post by David Starner
Post by Rich Felker
IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.
Once you compress the data with a decent compression scheme, you may
as well store the data by writing out the full Unicode name (e.g.
"LATIN CAPITAL LETTER OU"); the final result will be about the same
size.
With some compression methods this is true, particularly bz2.
Post by David Starner
Furthermore, you can fit a decent sized novel on a floppy
uncompressed and a decent sized library on a DVD uncompressed.
Yet somehow the firefox source code is still 36 megs (bz2), and god
only knows how large OOO is. Imagine now if all the variable and
function names were written in Hindi or Thai... It would be an
interesting test to transliterate the Latin letters to Devanagari and
see how much the compressed tarball size goes up.
Post by David Starner
The
only application I've seen where text data size was really crucial was
text messaging. Hence, common sense tells _me_ that we should put
scripts used by heavily text-messaging cultures in the 2-byte range;
that is, Latin, Hiragana and Katakana.
ROTFL! :)

In all seriousness, though, unless you're dealing with image, music,
or movie files, text weighs in quite heavy in size. It's true that in
html 75-90% of the size is usually tags (in ASCII) but that's due to
incompetence of the web designers and their inability to use CSS
correctly, not anything fundamental. If you're making a website
without fluff and with lots of information, text size will be the
dominant factor in traffic. It's quite unfortunate that native
language text is 3 to 6(*) times larger in countries where bandwidth
is very expensive.

Rich


(*) 6 because a large number of characters in Indic scripts will have
the virama (a combining character) attached to them to remove the
inherent vowel and attach them into clusters.
David Starner
2006-09-05 05:57:08 UTC
Permalink
Post by Rich Felker
Post by David Starner
Once you compress the data with a decent compression scheme, you may
as well store the data by writing out the full Unicode name (e.g.
"LATIN CAPITAL LETTER OU"); the final result will be about the same
size.
With some compression methods this is true, particularly bz2.
Post by David Starner
Furthermore, you can fit a decent sized novel on a floppy
uncompressed and a decent sized library on a DVD uncompressed.
Yet somehow the firefox source code is still 36 megs (bz2), and god
only knows how large OOO is. Imagine now if all the variable and
function names were written in Hindi or Thai... It would be an
interesting test to transliterate the Latin letters to Devanagari and
see how much the compressed tarball size goes up.
The very point of the above test is that it would change the size
minimally. It shouldn't make much if any difference.
Post by Rich Felker
In all seriousness, though, unless you're dealing with image, music,
or movie files, text weighs in quite heavy in size.
As opposed to what? The vast majority of content is one of the four,
and what's left--say, Flash files--don't seem particularly small
compared to text.
Post by Rich Felker
If you're making a website
without fluff and with lots of information, text size will be the
dominant factor in traffic. It's quite unfortunate that native
language text is 3 to 6(*) times larger in countries where bandwidth
is very expensive.
Welcome to HTTP 1.1. There's no reason not to compress the data while
you're sending it across the network, which will fix the vast majority
of this problem.
Rich Felker
2006-09-05 07:11:52 UTC
Permalink
Post by David Starner
Post by Rich Felker
In all seriousness, though, unless you're dealing with image, music,
or movie files, text weighs in quite heavy in size.
As opposed to what? The vast majority of content is one of the four,
and what's left--say, Flash files--don't seem particularly small
compared to text.
I wasn't thinking of a website but rather a complete computer system.
I have several gigabytes of email which is larger than even a very
bloated OS and several hundred thousand times bigger than a
non-bloated OS. Multiply this by a factor of 3 or more and it could
quite easily go from "feasible to store" to "infeasible to store".
Post by David Starner
Post by Rich Felker
If you're making a website
without fluff and with lots of information, text size will be the
dominant factor in traffic. It's quite unfortunate that native
language text is 3 to 6(*) times larger in countries where bandwidth
is very expensive.
Welcome to HTTP 1.1. There's no reason not to compress the data while
you're sending it across the network, which will fix the vast majority
of this problem.
Here you have the issue of compression performance versus bandwidth,
especially relevant on a heavily loaded server (of course you can
precompress static texts). Also gzip doesn't perform so well on UTF-8
so bzip2 would be better but also much more cpu-hungry and I doubt any
clients support it.

Anyway all of this discussion is in a sense pointless since none of us
have the power to change any of the problem and since there's no real
solution even if we could. But sometimes you just have to bitch about
the stuff the Unicode folks messed up on..

Rich
Alexandros Diamantidis
2006-09-02 11:15:50 UTC
Permalink
Post by Rich Felker
The vertical orientation thing is mostly of interest to Mongolian
users and perhaps some East Asian users, but it could also be
Note that Mongolian is mostly written with the Cyrillic alphabet today.
From what I've seen in movies, articles etc. - never been to
Mongolia myself - the traditional vertical script is still used on signs on
public buildings, monuments, and similar cultural contexts, but not to
write longer texts.
--
Alexandros Diamantidis * ***@hellug.gr
Loading...