Bidi considered harmful? :)

Discussion:

Rich Felker

2006-09-01 03:33:06 UTC

I read an old thread on the XFree88 i18n list started by Markus Kuhn
suggesting (rather strongly) that bidi should not be supported at the
terminal level, as well accusations (from other sources) by the author
of Yudit that UAX#9 bidi algo results in serious security issues due
to the irreversibility of the transformation and that it inevitably
butchers mathematical formulae.

I've also considered examples on my own, such as a program (not
necessarily terminal-aware, just text output) that prints lines of the
form "%s %d %d %s" without any special treatment (such as putting
explicit embedding marks around the %s fields) for bidi text, or a
terminal-based program that draws interface elements over top of
existing RTL text, resulting in nonsense.

In all cases, my personal opinion has been not just that UAX#9 is
broken, but that there's no way to implement any sort of implicit bidi
in a terminal emulator or in the display of text/plain data without
every single program having to go _far_ out of its way to ensure that
it won't give incorrect output when the input contains RTL characters,
which simply isn't going to happen, especially since it would
interfere with use in non-RTL scenarios. Other people may have
different opinions but I have not seen any viable solutions.

At the same time, I'm also very dissatisfied with the lack of proper
support for RTL scripts/languages in most applications and especially
at the terminal level, especially since Arabic is in such widespread
use and has great political importance in world affairs these days. I
do not accept that the solution is to just to print characters in the
wrong visual order.

.eerga ll'uoy tcepxe I ylbatrofmoc ecnetnes siht daer nac uoy sselnU

I experimented with the idea of mirroring glyphs to improve
readability, and was fairly surprised with how little it helped my
perception. Reading English text that had been graphically mirrored
remained almost as difficult as reading the above line, with the b/d
and p/q pairs causing significant pause in comprehension.

So then, reading UAX#9 again, I stumbled across the only section
that's not completely stupid (IMO of course):

5.4 Vertical Text

In the case of vertical line orientation, the bidirectional
algorithm is still used to determine the levels of the text.
However, these levels are not used to reorder the text, since the
characters are usually ordered uniformly from top to bottom.
Instead, the levels are used to determine the rotation of the
text. Sometimes vertical lines follow a vertical baseline in which
each character is oriented as normal (with no rotation), with
characters ordered from top to bottom whether they are Hebrew,
numbers, or Latin. When setting text using the Arabic script in
vertical lines, it is more common to employ a horizontal baseline
that is rotated by 90° counterclockwise so that the characters are
ordered from top to bottom. Latin text and numbers may be rotated
90° clockwise so that the characters are also ordered from top to
bottom.

What this provides is a suggested formatting that makes RTL and LTR
scripts both readable in a single-directional context, a vertical one.
Combined with the recent Mongolian script discussion on this list, I
believe this offers an alternate presentation form for documents that
mix LTR and RTL text without using bidi.

I'm not suggesting that everyone should switch to vertically-oriented
terminals or text-file-presentation, although Mongolian users might
like such a setup and it can certainly be one presentation option
that's fair to both RTL and LTR users by making both RTL and LTR
scripts quite readable.

The key idea to take from the Mongolian discussion and from UAX#9 5.4
is that, by having glyphs for LTR and RTL scripts rotated 180°
relative to one another, both can appear legible in a common
directionality. Thus, perhaps LTR users could present legible
RTL-script text by rotating all glyphs 180° and displaying them in LTR
order, and likewise RTL users could use a dominant RTL direction with
LTR glyphs rotated 180° [1]. Like with Mongolian, directionality could
become a localized user preference, rather than a property of the
script.

Does this actually work?

I repeated my experiment with English text reading, rotating the
graphic representation by 180° rather than mirroring it left-right. I
was pleased to find that I could read it with similar ease (but not
quite as fast) as ordinary LTR English text. Surprisingly, p/d and b/q
confusion did not arise, perhaps due to the obvious visual distinction
between the ascent/descent space of the glyphs.

I do not claim these tests are scientific, since the only subject
participating was myself. :) But they are suggestive of an alternative
possible presentation form for mixed LTR/RTL scripts without utilizing
bidirectionality. I consider bidirectionality harmful because:

- It is inherently slow for one's eyes to jump back and forth
switching directions while reading a single paragraph.
- It quickly becomes impossible to read quotations with multiple
levels of directional embedding. Forget UAX#9's 61 levels; 3 levels
are already undecipherable without slow and meticulous work.
- Implicit directionality is impossible to resolve without interfering
with sane people's expectations under string operations. In
particular the UAX#9 insanity involves _semantic_ interpretations of
text contents based on presupposed cultural conventions (like
whether a comma is a thousands separator or a list separator), which
are simply not valid assumptions you can make at such a low level.
- Visual order does not uniquely convey the logical order.

This is not to say that bidirectional formatting doesn't have its
place, and that, used correctly without multiple embedding levels,
with well-set block quotes, etc., it won't be legible. I also do not
preclude use of advanced ECMA-48 features for explicit bidi at the
terminal level. But I'd like to propose unidirectional formatting with
adjusted glyph orientation as a more logical (and perhaps more easily
readable) alternative to be used in terminal emulators and perhaps
also other contexts where accurate representation of the logical order
is required or where multiple levels of quoting are in use.

The most important thing to realize is that this proposal is not to
reject traditional ways of writing RTL scripts. The proposal is to
reject the (very stupid IMO) idea of mixing LTR and RTL
directionalities in a single paragraph context, except in the case
where higher-level formatting (which is inherently not available in a
plain text file or text printed to stdout) can control it.

Rich

[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.

George W Gerrity

2006-09-01 06:32:40 UTC

Permalink

Post by Rich Felker
I read an old thread on the XFree88 i18n list started by Markus Kuhn
suggesting (rather strongly) that bidi should not be supported at the
terminal level, as well accusations (from other sources) by the author
of Yudit that UAX#9 bidi algo results in serious security issues due
to the irreversibility of the transformation and that it inevitably
butchers mathematical formulae.
I've also considered examples on my own, such as a program (not
necessarily terminal-aware, just text output) that prints lines of the
form "%s %d %d %s" without any special treatment (such as putting
explicit embedding marks around the %s fields) for bidi text, or a
terminal-based program that draws interface elements over top of
existing RTL text, resulting in nonsense.
In all cases, my personal opinion has been not just that UAX#9 is
broken, but that there's no way to implement any sort of implicit bidi
in a terminal emulator or in the display of text/plain data without
every single program having to go _far_ out of its way to ensure that
it won't give incorrect output when the input contains RTL characters,
which simply isn't going to happen, especially since it would
interfere with use in non-RTL scenarios. Other people may have
different opinions but I have not seen any viable solutions.

I did try to tell you that doing a terminal emulation properly would
be complex. I don't know if the algorithm is broken: I doubt it. But
it is difficult getting it to work properly and it essentially
requires internal tables for every glyph describing its direction and
orientation.

Post by Rich Felker
At the same time, I'm also very dissatisfied with the lack of proper
support for RTL scripts/languages in most applications and especially
at the terminal level, especially since Arabic is in such widespread
use and has great political importance in world affairs these days. I
do not accept that the solution is to just to print characters in the
wrong visual order.
.eerga ll'uoy tcepxe I ylbatrofmoc ecnetnes siht daer nac uoy sselnU
I experimented with the idea of mirroring glyphs to improve
readability, and was fairly surprised with how little it helped my
perception. Reading English text that had been graphically mirrored
remained almost as difficult as reading the above line, with the b/d
and p/q pairs causing significant pause in comprehension.
So then, reading UAX#9 again, I stumbled across the only section
5.4 Vertical Text
In the case of vertical line orientation, the bidirectional
algorithm is still used to determine the levels of the text.
However, these levels are not used to reorder the text, since the
characters are usually ordered uniformly from top to bottom.
Instead, the levels are used to determine the rotation of the
text. Sometimes vertical lines follow a vertical baseline in which
each character is oriented as normal (with no rotation), with
characters ordered from top to bottom whether they are Hebrew,
numbers, or Latin. When setting text using the Arabic script in
vertical lines, it is more common to employ a horizontal baseline
that is rotated by 90° counterclockwise so that the characters are
ordered from top to bottom. Latin text and numbers may be rotated
90° clockwise so that the characters are also ordered from top to
bottom.
What this provides is a suggested formatting that makes RTL and LTR
scripts both readable in a single-directional context, a vertical one.
Combined with the recent Mongolian script discussion on this list, I
believe this offers an alternate presentation form for documents that
mix LTR and RTL text without using bidi.
I'm not suggesting that everyone should switch to vertically-oriented
terminals or text-file-presentation, although Mongolian users might
like such a setup and it can certainly be one presentation option
that's fair to both RTL and LTR users by making both RTL and LTR
scripts quite readable.
The key idea to take from the Mongolian discussion and from UAX#9 5.4
is that, by having glyphs for LTR and RTL scripts rotated 180°
relative to one another, both can appear legible in a common
directionality. Thus, perhaps LTR users could present legible
RTL-script text by rotating all glyphs 180° and displaying them in LTR
order, and likewise RTL users could use a dominant RTL direction with
LTR glyphs rotated 180° [1]. Like with Mongolian, directionality could
become a localized user preference, rather than a property of the
script.
Does this actually work?
I repeated my experiment with English text reading, rotating the
graphic representation by 180° rather than mirroring it left-right. I
was pleased to find that I could read it with similar ease (but not
quite as fast) as ordinary LTR English text. Surprisingly, p/d and b/q
confusion did not arise, perhaps due to the obvious visual distinction
between the ascent/descent space of the glyphs.
I do not claim these tests are scientific, since the only subject
participating was myself. :) But they are suggestive of an alternative
possible presentation form for mixed LTR/RTL scripts without utilizing
- It is inherently slow for one's eyes to jump back and forth
switching directions while reading a single paragraph.
- It quickly becomes impossible to read quotations with multiple
levels of directional embedding. Forget UAX#9's 61 levels; 3 levels
are already undecipherable without slow and meticulous work.
- Implicit directionality is impossible to resolve without interfering
with sane people's expectations under string operations. In
particular the UAX#9 insanity involves _semantic_ interpretations of
text contents based on presupposed cultural conventions (like
whether a comma is a thousands separator or a list separator), which
are simply not valid assumptions you can make at such a low level.
- Visual order does not uniquely convey the logical order.
This is not to say that bidirectional formatting doesn't have its
place, and that, used correctly without multiple embedding levels,
with well-set block quotes, etc., it won't be legible. I also do not
preclude use of advanced ECMA-48 features for explicit bidi at the
terminal level. But I'd like to propose unidirectional formatting with
adjusted glyph orientation as a more logical (and perhaps more easily
readable) alternative to be used in terminal emulators and perhaps
also other contexts where accurate representation of the logical order
is required or where multiple levels of quoting are in use.
The most important thing to realize is that this proposal is not to
reject traditional ways of writing RTL scripts. The proposal is to
reject the (very stupid IMO) idea of mixing LTR and RTL
directionalities in a single paragraph context, except in the case
where higher-level formatting (which is inherently not available in a
plain text file or text printed to stdout) can control it.
Rich
[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.

No one using arabic script would accept reading it top to bottom: it
is simply never done (to the best of my knowledge), and so any
terminal emulator claiming to work with any script had better be able
to render the text correctly, including mixing rtl and ltr.

George
------

Rich Felker

2006-09-01 13:41:44 UTC

Permalink

Post by George W Gerrity
I did try to tell you that doing a terminal emulation properly would
be complex. I don't know if the algorithm is broken: I doubt it. But
it is difficult getting it to work properly and it essentially
requires internal tables for every glyph describing its direction and
orientation.

If that were the problem it would be trivial. The problems are much
more fundamental. The key examples you should look at are things like:
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken. Even if it can be handled at the terminal
emulator level with special escapes and whatnot (and I believe it can,
albeit in very ugly ways) it simply cannot be handled in a plain text
file, for reasons like:

columna COLUMNB 1234 5678 columnc
columna COLUMNB 1234 5678 COLUMNC

Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).

One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.

This does not mean that storing text in 'visual order' is acceptable
either; that's just disgusting and makes correct ligatures/shaping
impossible. It just means that you cannot create a bidirection
presentation from a text file without higher level markup. Instead you
can use a vertical presentation or either LTR or RTL presentation with
the opposite-directionality glyphs rotated 180°.

My observations were that this sort of presentation is much easier to
edit and quite possibly easier to read than a format where your eyes
have to switch scanning directions.

I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.

Post by George W Gerrity

Post by Rich Felker
[...]
[1] There is a small problem that even without LTR scripts mixed in,
most RTL scripts are "bidirectional" due to numbers being written LTR.
However supporting reversed display of individual numbers (or even
individual words) is a trivial problem compared to full bidi text flow
and can be done without compromising reversibility and without complex
algorithms that cause misinterpretation of adjacent text.

You misread the above. Of course no one using LTR scripts would want
to read top-to-bottom either. The intent is that users of RTL scripts
could use an _entirely_ RTL terminal with the LTR characters' glyphs
rotated 180° while LTR users could use an _entirely_ LTR terminal with
RTL glyphs rotated 180°. The exception noted in the footnote is that
RTL scripts actually require "bidi" for numbers, but I comment that
this is trivial compared to bidi and suffers from none of the
fundamental problems of bidi.

The vertical orientation thing is mostly of interest to Mongolian
users and perhaps some East Asian users, but it could also be
interesting to (a very few) users of both LTR and RTL scripts who use
both frequently and who want a more equal treatment of both,
especially if they find reading upside-down difficult.

Rich

P.S. Do you have any good screenshots with RTL or LTR embedded text?
If so I can prepare some modified images to show what I mean and you
can see what you think of readability.

Mark Leisher

2006-09-01 15:36:44 UTC

Permalink

Post by Rich Felker
If that were the problem it would be trivial. The problems are much
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken.

I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.

Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.

The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.

Post by Rich Felker
Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).

So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.

Post by Rich Felker
One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.

You left out the part where Unicode says that none of these things is
strictly required. The *NIX community didn't reject anything. They
didn't need to. You also seem unaware of how much effort was made by
ISO, the Unicode Consortium, and all the national standards bodies to
avoid breaking a lot of existing practice.

I highly recommend participating in any standards development process
managed by any national or international standards body. You will find
an obsession with avoidance of breaking existing practice.

Post by Rich Felker
I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.

Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.

Human languages and the scripts used to represent them are messy. There
are no neat solutions. Get used to it.

Good day and good luck.

--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark

Rich Felker

2006-09-01 18:08:03 UTC