perl unicode support

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I frequenty run into problems with utf-8 in perl, and I was wondering
if anyone else
had encountered similar things.
One thing I've noticed is that when processing characters, I often get
warnings about
"wide characters in print", or have input/output get horribly mangled.
binmode STDIN,":utf8";
binmode STDOUT,":utf8";
sub unfunge_string
{
{
$$ref = Encode::decode("utf8",$$ref,Encode::FB_CROAK);
}
}
but this feels wrong to me.
For a language that really goes out of its way to support encodings, I
wonder if it
wouldnt have been better off it it just ignored the entire concept
alltogether and treated
strings as arrays of bytes...

Read the ancient linux-utf8 list archives and you should get a good
feel for Larry Wall's views on the matter.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
http://ahinea.com/en/tech/perl-unicode-struggle.html
And I'm wondering if in its attempt to be a good i18n citizen, perl
hasnt gone overboard and made a mess of things instead.

I agree, but maybe there are workarounds. I have a system that's
completely UTF-8-only. I don't have or want support for any legacy
encodings except in a few isolated tools (certainly nothing related to
perl) for converting legacy data I receive from outside.

With that in mind, I built perl without PerlIO, wanting to take
advantage of my much smaller and faster stdio implementation. But now,
binmode doesn't work, so the only way I can get rid of the nasty
warning is by disabling it explicitly.

Is there any way to get perl to behave sanely in this regard? I don't
really use perl much (mainly for irssi) so if not, I guess I'll just
leave it how it is and hope nothing seriously breaks..

Rich

Egmont Koblinger

2007-03-27 10:06:06 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I frequenty run into problems with utf-8 in perl, and I was wondering
if anyone else had encountered similar things.

Of course :-) I guess everyone who has ever tried perl's utf8 support faced
similar problems.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
One thing I've noticed is that when processing characters, I often get
warnings about
"wide characters in print", or have input/output get horribly mangled.

Yes, that's the case if some function expects bytes but receives an utf-8
string that happens to contain characters not representable in latin-1 or
something like that. Printing them is the usual case when this message
appears, but recently I happened to face the same with the crypt() call when
I wanted to crypt utf-8 passwords.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
binmode STDIN,":utf8";
binmode STDOUT,":utf8";

This is okay, it turns these file descriptors to UTF-8 mode. I usually begin
my perl scripts with
#!/usr/bin/perl -CSDA
this turns on utf8 mode for all the file descriptors (except for sockets),
see "man perlrun" for details.

It's also advisable to have a "use utf8;" pragma at the beginning of the
file, in case the source itself contains utf-8 characters.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
sub unfunge_string
{
{
$$ref = Encode::decode("utf8",$$ref,Encode::FB_CROAK);
}
}

I can't understand this. Here I guess your goal is to convert an internally
UTF-8 encoded string into sequence of bytes that can be passed into any file
descriptor. In this case you need the encode function, decode is the reverse
way. E.g.

use utf8;
use Encode;
my $password_string = "pásswőrd";
#my $encoded_wrong = crypt($password_string, "xx"); # wrong!
my $password_bytes = Encode::encode("utf8", $password_string);
my $encoded_good = crypt($password_bytes, "xx"); # gives "xx2TrBZ2zni6o"

I also used this trick recently when writing to a socket. If you turn on the
utf8 flag on the socket, you can simply send the utf8 string over it, but
the return value of the send() call is the number of characters written,
and I don't know what happens if a character is only partially sent. If you
send bytes instead of characters, you have total control over this. The
choice is yours.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
For a language that really goes out of its way to support encodings, I
wonder if it
wouldnt have been better off it it just ignored the entire concept
alltogether and treated
strings as arrays of bytes...

That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

The problem is that in the good old days perl only knew about strings as
array of bytes, and later they had to implement Unicode support without
breaking backwards compatibility. Hence currently perl strings are being
used to store both types of data: byte sequences and array sequences.

For each string variable, there's a bit telling whether it's known to be an
UTF-8 encoded human-readable string. See "man Encode" and the is_utf8,
_utf8_on and _utf8_off functions. You can think of this bit as the piece of
information whether this string is to be assumed a sequence of bytes or a
sequence of characters. Having this bit set when the string is not valid
utf8 can yield unexpected behavior - never do that. However, having an
otherwise valid utf-8 which doesn't have this bit set is perfectly valid,
and in some circumstances (e.g. when printing to a file) it behaves
differently - e.g. if the file descriptor has its utf8 flag set, but the
string doesn't, then IIRC it is converted from latin1 to utf8 and hence
you'll have a different result, not what you expect.

In some cases you might want to use _utf8_on, this happens when you know
that the string is utf8 but perl doesn't know this. An example is gettext
lookup if you've used bind_textdomain_codeset("...", "UTF-8"). In this case
the gettext() call always returns a valid utf-8 string but perl doesn't know
it would, so this bit is not set.

Perl automatically sets this utf8 bit on strings read from file descriptors
having the utf8 mode turned on, on strings within the perl source if "use
utf8" is in effect, on the output of decode("charset", "text"), in the
obvious string operations, e.g. if you concat two strings with this bit set,
do regexp matching on an utf8 string, join/split it and so on............

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
And I'm wondering if in its attempt to be a good i18n citizen, perl
hasnt gone overboard and made a mess of things instead.

Probably. Just look around and see how many pieces of software, file
formats, protocols and so on suffer from the same problem (no, not
particularly utf8, I mean being an overcomplicated mess) due to
compatibility issues. Plenty! Perl is just one of them, far from being the
worst :-)

--
Egmont

ＳｒｉｎＴｕａｒ

2007-03-27 15:16:58 UTC

Post by Egmont Koblinger
That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

Hrm, that statement I think I would object to. For the overwhelming
vast majority of programs, strings are simply arrays of bytes.
(regardless of encoding) The only time source code needs to care about
characters is when it has to layout or format them for display.

If perl did not have a "utf-8" bit on its scalars, it would probably
handle utf-8 alot better and more naturally, imo.

Functions and routines which need to know the printable charcell
width, or the how to lookup glyph's in a font could easily parse the
codepoints out of the array based on either the locale encoding, or by
simply assuming utf-8 (as is increasily preferable, imo) then perform
the appropriate formatting lookups.

Aside from that tiny handful of libraries, noone else should have to
bother with encoding, imo. (regular expressions supporting utf-8 is
useful as wel)

When I write a basic little perl script that reads in lines from a
file, does trivial string operations on them, then prints them back
out, there should be absolutely no need for my code to make any
special considerations for encoding.

Egmont Koblinger

2007-03-27 16:31:11 UTC

Post by Egmont Koblinger
That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

Hrm, that statement I think I would object to. For the overwhelming
vast majority of programs, strings are simply arrays of bytes.
(regardless of encoding)

In order to be able to write applications that correctly handle accented
letters, Unicode taught us the we clearly have to distinguish between bytes
and characters, and when handling texts we have to think in terms of
characters. These characters are eventually stored in memory or on disk as
several bytes, though. But in most of the cases you have to _think_ in
characters, otherwise it's quite unlikely that your application will work
correctly.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
The only time source code needs to care about
characters is when it has to layout or format them for display.

No, there are many more situations. Even if your job is so simple that you
only have to convert a text to uppercase, you already have to know what
encoding (and actually what locale) is being used. Finding a particular
letter (especially in case insentitive mode), performing regexp matching,
alphabetical sorting etc. are just a few trivial examples where you must
think in characters.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
If perl did not have a "utf-8" bit on its scalars, it would probably
handle utf-8 alot better and more naturally, imo.

Probably. Probably not. I'm really unable to compare an existing programming
language with a hypothetical one. For example in PHP a string is simply a
sequence of bytes, and you have mb...() functions that handle them according
to the selected locale. I don't think it's either better or worse than perl,
it's just a different approach.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
When I write a basic little perl script that reads in lines from a
file, does trivial string operations on them, then prints them back
out, there should be absolutely no need for my code to make any
special considerations for encoding.

If none of these trivial string operations depend on the encoding then you
don't have to use this feature of perl, that's all. Simply make sure that
the file descriptors are not set to utf8, neither are the strings that you
concat or match to. etc, so you stay in world of pure bytes.

--
Egmont

Rich Felker

2007-03-27 17:42:33 UTC

Post by Egmont Koblinger
That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

Hrm, that statement I think I would object to. For the overwhelming
vast majority of programs, strings are simply arrays of bytes.
(regardless of encoding)

In order to be able to write applications that correctly handle accented
letters, Unicode taught us the we clearly have to distinguish between bytes
and characters,

No, accented characters have nothing to do with the byte/character
distinction. That applies to any non-ascii character. However, it only
matters when you'll be performing display, editing, and pattern-based
(not literal-string-based, though) searching.

Accents and combining marks have to do with the character/grapheme
distinction, which is pretty much relevant only for display.

None of this is relevant to most processing of text which is just
storage, retrieval, concatenation, and exact substring search.

Post by Egmont Koblinger
and when handling texts we have to think in terms of
characters. These characters are eventually stored in memory or on disk as
several bytes, though. But in most of the cases you have to _think_ in
characters, otherwise it's quite unlikely that your application will work
correctly.

It's the other way around, too: you have to think in terms of bytes.
If you're thinking in terms of characters too much you'll end up doing
noninvertable transformations and introduce vulnerabilities when data
has been maliciously crafted not to be valid utf-8 (or just bugs due
to normalizing data, etc.).

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
The only time source code needs to care about
characters is when it has to layout or format them for display.

No, there are many more situations. Even if your job is so simple that you
only have to convert a text to uppercase, you already have to know what
encoding (and actually what locale) is being used.

This is not a simple task at all, and in fact it's a task that a
computer should (almost) never do... Case-insensitivity is bad enough,
but case conversion is a horrible horrible mistake. Create your data
in the case you want it in.

The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

Post by Egmont Koblinger
Finding a particular
letter (especially in case insentitive mode),

Hardly. A byte-based regex for all case matches (e.g. "(ä|Ä)") will
work just as well even for case-insensitive matching, and literal
character matching is simple substring matching identical to any other
sane encoding. I get the impression you don't understand UTF-8..

Post by Egmont Koblinger
performing regexp matching,
alphabetical sorting etc. are just a few trivial examples where you must
think in characters.

Character-based regex (which posix BRE/ERE is) needs to think in terms
of characters. Sometimes a byte-based regex is also useful. For
example my procmail rules reject mail containing any 8bit octets if
there's not an appropriate mime type for it. This kills a lot of east
asian spam. :)

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
If perl did not have a "utf-8" bit on its scalars, it would probably
handle utf-8 alot better and more naturally, imo.

Well it's definitely worse for someone who just wants text to work on
their system without thinking about encoding. And it WILL just work
(as evidenced by my disabling of the warning and still getting correct
behavior) as long as the whole system is consistent, regardless of
what encoding is used.

Yes, strings need to distinguish byte/character data. But streams
should not. A stream should accept bytes, and a character string
should always be interpreted as bytes according to the machine's
locale when read/written to a stream, or when incorporated into byte
strings.

But it should work even with strings interpreted as characters!
There's no legitimate reason for it not to.

Moreover, the warning is fundamentally stupid because it does not
trigger for characters in the range 128-255, only >255. This is an
implicit assumption that someone would want to use latin1, which is
simply backwards and wrong. A program printing characters in latin1
without associating an encoding with the stream is equally "wrong" to
a program writing arbitrary unicode characters.

Rich

David Starner

2007-03-27 23:44:42 UTC

Post by Rich Felker
This is not a simple task at all, and in fact it's a task that a
computer should (almost) never do...

Of course. Why shouldn't an editor go through and change 257 headings
to titlecase by hand? Humans are known for their abilities to do such
tedious
things without error, aren't they?

Post by Rich Felker
The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

Really? Funny, I'm from North America, and we have a concept of case
here. 90% of the languages native to the continent are written in a
script that has a concept of case. In fact, I think you'd find that
most of the world's languages are written in scripts that have a
concept of case. Furthermore, the whole reason for Unicode is because
you have to accomadate every single script's idiosyncracities; you
have to include case conversion because certain scripts demand it.

Rich Felker

2007-03-28 03:10:34 UTC

Post by Rich Felker
This is not a simple task at all, and in fact it's a task that a
computer should (almost) never do...

Of course. Why shouldn't an editor go through and change 257 headings
to titlecase by hand? Humans are known for their abilities to do such
tedious
things without error, aren't they?

There was a reason I wrote "almost". This is one of the very few
places where a computer should ever perform case mappings: in a
powerful editor or word processor. Another I can think of is
linguistic software (e.g. machine based translation, or anything
that's performing semantic analysis or synthesis of human language
text). These comprise a tiny minority of computing applications and
certainly do not warrant punishing the rest; such functionality and
special handling should be isolated to the programs that use it.

Post by Rich Felker
The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

Really? Funny, I'm from North America, and we have a concept of case

Same thing. North American civilization is all European-derived.

Post by David Starner
here. 90% of the languages native to the continent are written in a
script that has a concept of case.

Is that so? I don't think so. Rather, most of the languages native to
the continent have no native writing system, or use a writing system
that was long ago lost/extincted. Perhaps you should look up the
meaning of the word native.. :)

Post by David Starner
In fact, I think you'd find that
most of the world's languages are written in scripts that have a
concept of case.

This is a very dubious assertion. Technically it depends on how you
measure "most" (language count vs speaker count... also the whole
dialect vs language debate), but otherwise I think it's bogus. I
believe a majority of the world's population has as their native
language a language that does not use case.

Just take India and China and you're already almost there. Now throw
in the rest of South Asia and East Asia, all of the Arabic speaking
countries, ....

Post by David Starner
Furthermore, the whole reason for Unicode is because
you have to accomadate every single script's idiosyncracities; you
have to include case conversion because certain scripts demand it.

No, you only have to deal with the idiosyncracies of the subset you
support. A good multilingual application will have sufficient support
for acceptable display and editing of most or all languages, but
there's no reason it should have lots of language-specific features
for each language. Why should all apps be forced to have
(Euro-centric) case mappings, but not also mappings between (for
example) the corresponding base-character and subjoined-character
forms of Tibetan letters, or transliteration mappings between Latin
and Cyrillic for East European languages?

My answer (maybe others disagree) is that most apps need none of this,
while editor/OS hybrids like GNU emacs probably want all of it. :) But
each app is free to choose which language-specific frills it wants to
include support for. I see no reason that case mappings should be
given such a special place aside from the forces of linguistic
imperialism.

~Rich

David Starner

2007-03-28 19:24:26 UTC

Post by Rich Felker
This is one of the very few
places where a computer should ever perform case mappings: in a
powerful editor or word processor

Just about any program that deals with text is going to have a need to
merge distinctions that the user considers irrelevant, which often
includes case. I use grep -i, even when searching the output of my own
programs sometimes. I could go back and check the case I used in the
messages, but I'd rather let the tools do that.

Post by Rich Felker
The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

Really? Funny, I'm from North America, and we have a concept of case

Same thing. North American civilization is all European-derived.

The civilization on North America, South America, Europe, Australia
and Antartica is European-derived, but I find it horribly hard to
dismiss something that's universal in five of the seven continents as
"disgustingly euro-centric".

Post by David Starner
here. 90% of the languages native to the continent are written in a
script that has a concept of case.

I wrote precisely what I meant, and I stand by it as correct. Read the
sentence I wrote. No language uses a writing system that was long ago
lost; that's logically absurd. I don't believe the concept of native
writing system is clear, nor do I believe it's useful. Arguably, the
"native" writing system of Irish is Oghma and the "native" writing
system of Greek is Linear B, but from a practical aspect, Irish uses
the Latin script and Greek uses Greek and those are the realities that
we need to be dealing with.

Post by David Starner
In fact, I think you'd find that
most of the world's languages are written in scripts that have a
concept of case.

This is a very dubious assertion. Technically it depends on how you
measure "most" (language count vs speaker count... also the whole
dialect vs language debate), but otherwise I think it's bogus.

The English meaning of "Most of the world's languages" is the number
of languages. All of the languages spoken in North and South America,
with the exception of Cherokee and some Canadian languages written in
the UCAS, are written in Latin. All of the languages spoken in Africa,
with the exception of a few languages written in Ethiopian and Arabic,
are written in Latin. All of the languages of Europe are written in
Latin, Greek or Cyrillic. All of the languages of Australia are
written in Latin. All of the languages of New Guinea (12% of the
world's languages) are written in Latin. Most of the languages of the
former USSR are written in Cyrillic.

Post by Rich Felker
I
believe a majority of the world's population has as their native
language a language that does not use case.
Just take India and China and you're already almost there. Now throw
in the rest of South Asia and East Asia, all of the Arabic speaking
countries, ....

According to Wikipedia, Asia has 60% of the world's population. By my
estimates, the part of that population, including Vietnam, Indonesia
and Russia, that use casing scripts is larger than the number of
people outside that don't use casing scripts (mainly North Africa,
population-wise.) 60% may be a majority, but it's hardly a huge
majority.

Post by Rich Felker
No, you only have to deal with the idiosyncracies of the subset you
support. A good multilingual application will have sufficient support
for acceptable display and editing of most or all languages, but
there's no reason it should have lots of language-specific features
for each language. Why should all apps be forced to have
(Euro-centric) case mappings, but not also mappings between (for
example) the corresponding base-character and subjoined-character
forms of Tibetan letters, or transliteration mappings between Latin
and Cyrillic for East European languages?

Two issues:

Demand: 40% of the world uses cased scripts, including most of the
richest part of the world. (Compare to the .1% that use Tibetan.)
Furthermore, casing is a very low-level operation; uppercase,
lowercase and titlecase words are mixed freely with an understanding
of the fundamental identity. Market-share aside, I don't believe
writers of Eastern European languages frequently mix Latin and
Cyrillic in the same document.

History: I'm not aware of a computer model in the history of the world
that supported text and not Latin text. If there are such, I would be
stunned if they amounted to one in a million of all computers made.
Virtually all computers in use depend on ASCII, with the remainder
depending on ASCII variants or EBCDIC variants that are equally Latin
dependent. Latin text is fundamental to computers, and the casing
operation is a part of many standards and commonly used APIs. Call it
language imperalism, but it's reality.

Post by Rich Felker
But
each app is free to choose which language-specific frills it wants to
include support for.

And I suspect that most of them will choose to support the
language-specific frills that 40% of the world's population demand. In
fact, I don't know of a single language-specific "frill" that has as
much demand as casing; the non-casing scripts are a pretty diverse
bunch that the majority share no "frill" as key to them as casing is
Cyrillic, Latin and Greek.

Rich Felker

2007-03-28 22:06:32 UTC

Post by Rich Felker
This is one of the very few
places where a computer should ever perform case mappings: in a
powerful editor or word processor

This is not case mapping but equivalence classes. A completely
different issue. Matching equivalence classes (including case and
other equivalences) is trivial and mostly language-independent. Case
mapping is ugly (think German “SS/ß”) and language-dependent (think
Turkish “I/ı” and “İ/i”).

Post by Rich Felker
Same thing. North American civilization is all European-derived.

It’s not universal. It’s universal among the european-descended
colonizers. In many of these places there are plenty of indigenous
populations which do not use the colonizer’s script because it’s not
suitable for their language, because the latin phonetic systems are
designed for pompous linguists rather than based on the way people see
their own languages. Often there is a colonial language (English,
Spanish, French, etc.) alongside an indigenous language, and while the
latter may often be written in latin letters, the orthography is often
inconsistent and should be perceived as a “foreign” spelling system
rather than something native.

Post by David Starner
In fact, I think you'd find that
most of the world's languages are written in scripts that have a
concept of case.

This is a very dubious assertion. Technically it depends on how you
measure "most" (language count vs speaker count... also the whole
dialect vs language debate), but otherwise I think it's bogus.

Written by whom? European-descended scholars who imposed a Latin
alphabet for studying the language. Many of the speakers of many of
these languages don’t even write the language at all..

I maintain that you have a very euro-centric-imperialist view of the
world. It’s not to say that latin isn’t important or in widespread
use, but pretending like latin is the pinnacle of importance and like
frills for latin keep the world happy is something i find extremely
annoying.

Rich

David Starner

2007-03-29 13:16:02 UTC

Post by Rich Felker
Matching equivalence classes (including case and
other equivalences) is trivial and mostly language-independent. Case
mapping is ugly (think German "SS/ß") and language-dependent (think
Turkish "I/ı" and "İ/i").

In Turkish, I and i should be in different equivalence classes, unlike
in German. That's the main case where case mapping is
language-dependent, so I don't see a huge difference here.

Post by Rich Felker
It's not universal. It's universal among the european-descended
colonizers.

Humans aren't native to Europe; we're all Africa-descended colonizers,
except for the Africans. Besides which, most of the descendants of the
pre-Columbus inhabitants of the Americans now speak English, Spanish
or Portuguese as their native tongue and write said language in the
Latin script.

Post by Rich Felker
the orthography is often
inconsistent and should be perceived as a "foreign" spelling system
rather than something native.

Why? Orthography of English before the 1700s was inconsistent, and it
was and is still occasionally inconsistent after that. Standardized
orthography isn't found in many smaller language groups. Cherokee
written in the Cherokee script doesn't have standardized orthography,
despite that being an unquestionably native spelling system

Post by Rich Felker
Written by whom? European-descended scholars who imposed a Latin
alphabet for studying the language. Many of the speakers of many of
these languages don't even write the language at all..

If they don't write the language, why are they a concern for the
programming of a text handling application?

Post by Rich Felker
It's not to say that latin isn't important or in widespread
use, but pretending like latin is the pinnacle of importance and like
frills for latin keep the world happy is something i find extremely
annoying.

I never said that Latin is the pinnacle of importance, nor that frills
for Latin keep the world happy. I said the casing operation in Latin,
Greek, and Cyrillic is relatively basic to the way people write their
languages in those scripts, and that those scripts are very common
among the

ＳｒｉｎＴｕａｒ

2007-03-28 22:12:22 UTC

Post by David Starner
And I suspect that most of them will choose to support the
language-specific frills that 40% of the world's population demand. In
fact, I don't know of a single language-specific "frill" that has as
much demand as casing; the non-casing scripts are a pretty diverse
bunch that the majority share no "frill" as key to them as casing is
Cyrillic, Latin and Greek.

You are right that basic latin case folding is pretty important, and probably
deserves its special significance. The latin alphabet is used all the
time in Japanese and VIetnamese. ( In vietnam Im pretty sure they
don't see it as foreign )

The more advanced and language specific versions of if are very tricky though.
(german, turkish, etc) and rarely done right. I don't think the
standard C library even has a sequence based toupper/tolower.
Titlecase strikes me as even weirder, because then you have to have
rules to decide what constitutes a word- which is not as trivial as it
sounds and is very language dependant.

In general, I try to avoid designing anything case-insensitive as much
as possible. But sometimes its unavoidable, such as when handling
search. In those cases, its impossible to write one routine that can
handle all languages, because of the incompatibilities betwene
different systems. Most of the time, doing a basic best effort gets
you by, but it does leave a bad aftertaste.

Even when making the be all end all language sensitive case folding
system, I can imagine problems than are unsolvable. How would you
handle this problem: search for a mix of german, english, and turkish
words in a case-insensitive manner ?

Przemyslaw Brojewski

2007-03-29 13:25:56 UTC

Post by David Starner
Market-share aside, I don't believe
writers of Eastern European languages frequently mix Latin and
Cyrillic in the same document.

They mix it frequently enough. Take a look at http://bash.org.ru/
Or, if you need "more proper" example, lookt at:
- bielarussian text on TCP/IP protocol:
http://skif.bas-net.by/bsuir/base/node350.html
- ukrainian text on SARS:
http://www.nbuv.gov.ua/e-journals/AMI/2005/05kivvap.pdf
- A list of periodics issued by Bulgarian Academy of Science:
http://www.bas.bg/index.php?pat=baspubl&glaven=gov&ezik=bg

So please, don't take such assumptions. A better assumption would be
that writers of any language will have to mix it with Latin sooner or
later, just because, as you mentioned in your post, Latin is
fundamental to computers - a simple consequence of the fact that computers
where invented and first put to massive use in Latin-based countries.

Przemek.

Daniel B.

2007-03-28 03:04:09 UTC

...

Post by Rich Felker
The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

Really? Funny, I'm from North America, and we have a concept of case
here. 90% of the languages native to the continent ...

I think he was including North America when he wrote "euro-centric"
(given the European component of most people and cultures--and
languages--here).

Daniel

--
Daniel Barclay
***@smart.net

Daniel B.

2007-03-28 02:55:32 UTC

Post by Rich Felker
...
None of this is relevant to most processing of text which is just
storage, retrieval, concatenation, and exact substring search.

It might be true that more-complicated processing is not relevant to those
operations. (I'm not 100% sure about exact substring matches, but maybe
if the byte string given to search for is proper (e.g., doesn't have any
partial representations of characters), it's okay).

However, I think you're stretching things too much to say "most." (I
guess it depends on what we're calling "text processing".)

... But in most of the cases you have to _think_ in
characters, otherwise it's quite unlikely that your application will work
correctly.

Well of course you need to think in bytes when you're interpreting the
stream of bytes as a stream of characters, which includes checking for
invalid UTF-8 sequences.

Once you've checked that the bytes properly represent characters, from
then on you need to think in characters unless you're doing sufficiently
simple operations (e.g., your list above).

Post by Rich Felker
Hardly. A byte-based regex for all case matches (e.g. "(Ã¤|Ã?)") will
work just as well even for case-insensitive matching, and literal
character matching is simple substring matching identical to any other
sane encoding. I get the impression you don't understand UTF-8..

How do you match a single character? Would you want the programmer to
have to write an expression that matches a byte 0x00 through 0x7F, a
sequence of two bytes from 0xC2 0x80 through 0xDF 0xBF, a sequence of
three bytes from 0xE1 0xA0 0x80 through 0xEF 0xBF 0xBF, etc. [hoping I
got those bytes right] instead of simply "."?

Post by Rich Felker
... Sometimes a byte-based regex is also useful. For
example my procmail rules reject mail containing any 8bit octets if
there's not an appropriate mime type for it. This kills a lot of east
asian spam. :)

Yep.

Of course, you can still do that with character-based strings if you
can use other encodings. (E.g., in Java, you can read the mail
as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
Then you can write the regular expression in terms of Unicode characters
0-255. The only disadvantage there is probably some time spent
decoding the byte stream into the internal representation of characters.)

Maybe the net result from your point is that one should be able to read
byte streams in encodings other than just UTF-8. (A language might
do that by converting anything else into UTF-8, or could use a different
internal representation (e.g., as Java uses UTF-16).)

Post by Rich Felker
A stream should accept bytes, and a character string
should always be interpreted as bytes according to the machine's
locale when read/written to a stream

Note that it's specific to the stream, not the machine. (Consider
for example, HTTP's Content-Encoding header's charset parameter. A
web browser needs to handle different character encodings in different
responses. A MIME application needs to handle different character
encodings in different parts of a single multi-part message.)

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-28 03:33:23 UTC

Post by Rich Felker
...
None of this is relevant to most processing of text which is just
storage, retrieval, concatenation, and exact substring search.

No character is a substring of another character in UTF-8. This is an
essential property of any sane multibyte encoding (incidentally, the
only other one is EUC-TW).

Post by Daniel B.
Well of course you need to think in bytes when you're interpreting the
stream of bytes as a stream of characters, which includes checking for
invalid UTF-8 sequences.

And what do you do if they're present? Under your philosophy, it would
be impossible for me to remove files with invalid sequences in their
names, since I could neither type the filename nor match it with glob
patterns (due to the filename causing an error at the byte to
character conversion phase before there's even a change to match
anything). I'd have to write specialized tools to do it...

Other similar problem: I open a file in a text editor and it contains
illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,
or a file with mixed encodings (e.g. a mail spool) or with mixed-in
binary data. I want to edit it anyway and save it back without
trashing the data that does not parse as valid UTF-8, while still
being able to edit the data that is valid UTF-8 as UTF-8.

This is easy if the data is kept as bytes and the character
interpretation is only made "just in time" when performing display,
editing, pattern searches, etc. If I'm going to convert everything to
characters, it requires special hacks for encoding the invalid
sequences in a reversible way. Markus Kuhn experimented with ideas for
this a lot back in the early linux-utf8 days and eventually it was
found to be a bad idea as far as I could tell.

Also, I've found performance and implementation simplicity to be much
better when data is kept as UTF-8. For example, my implementation of
the POSIX fnmatch() function (used by glob() function) is extremely
light and fast, due to performing all the matching as byte strings and
only considering characters "just in time" during bracket expression
matching (same as regex brackets). This also allows it to accept
strings with illegal sequences painlessly.

Post by Rich Felker
Hardly. A byte-based regex for all case matches (e.g. "(Ã¤|Ã?)") will

The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

Post by Rich Felker
work just as well even for case-insensitive matching, and literal
character matching is simple substring matching identical to any other
sane encoding. I get the impression you don't understand UTF-8..

No, this is the situation where a character-based regex is wanted.
Ideally, a single regex system could exist which could do both
byte-based and character-based matching together in the same string.
Sadly that's not compatible with POSIX BRE/ERE, nor Perl AFAIK.

The biggest disadvantage of it is that it's WRONG. The data is not
Latin-1, and pretending it's Latin-1 is a hideous hack. The data is
bytes with either no meaning as characters, or (more often) an
interpretation as characters that's not available to the software
processing it. I've just seen waaaay too many bugs from pretending
that bytes are characters to consider doing this reasonable. It also
perpetuates the (IMO very bad) viewpoint among new users that UTF-8 is
"sequences of Latin-1 characters making up a character" instead of
"sequences of bytes making up a character".

Post by Rich Felker
A stream should accept bytes, and a character string
should always be interpreted as bytes according to the machine's
locale when read/written to a stream

Yes, clients need to. Servers can just always serve UTF-8. However, in
the examples you give, the clean solution is just to treat the bytes
as bytes, not characters, until they've been processed.

Maybe 20 years from now we'll finally be able to get rid of the
nonsense and just assume everything is UTF-8...

Rich

Daniel B.

2007-03-29 02:39:49 UTC

Post by Rich Felker
...
None of this is relevant to most processing of text which is just
storage, retrieval, concatenation, and exact substring search.

No character is a substring of another character in UTF-8. ...

I know. That's why I addresses avoiding _partial_ byte sequences.

Post by Daniel B.
Well of course you need to think in bytes when you're interpreting the
stream of bytes as a stream of characters, which includes checking for
invalid UTF-8 sequences.

And what do you do if they're present?

Of course, it depends where they are present. You seem to be addressing
relative special cases.

Under your philosophy, it would
be impossible for me to remove files with invalid sequences in their
names, since I could neither type the filename nor match it with glob
patterns (due to the filename causing an error at the byte to
character conversion phase before there's even a change to match
anything). ...

If the file name contains illegal byte sequences, then either they're
not in UTF-8 to start with or, if they're supposed to be, something
else let invalid sequences through.

If they're not always in UTF-8 (if they're sometimes in a different
encoding), then why would you be interpreting them as UTF-8 (why would
you hit the case where it seems there's an illegal sequence)? (How
do you know what encoding they're in? Or are you dealing with the
problem of not having any specification of what the encoding really
is and having to guess?)

If they're supposed to be UTF-8 and aren't, then certainly normal
tools shouldn't have to deal with malformed sequences. If you write
a special tool to fix malformed sequences somehow (e.g., delete files
with malformed sequences), then of course you're going to be dealing
with the byte level and not (just) the character level.

Other similar problem: I open a file in a text editor and it contains
illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,

Again, you seem to be dealing with special cases.

If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
would you expect a UTF-8 text editor to work on it?

For the data that is parseable as a valid UTF-8 encoding of characters,
how do you propose to know whether it really is characters encoded as
UTF-8 or is characters encoded some other way?

(If you see the byte sequence 0xDF 0xBF, how do you know whether that
means the character U+003FF or the two characters U+00DF U+00BF? For
example, if at one point you see the UTF-8-illegal byte sequence
0x00 0xBF and assume that that 0xBF byte means character U+00BF, then
if you see the UTF-8-legal byte sequence 0xDF 0xBF, how would you
know that that 0xBF byte also represents U+00BF vs. whether it's really
part of the representation of character U+003FF?)

Either edit the test file with a byte editor, or edit it with a text
editor specifying an encoding that maps bytes to characters (e.g.,
ISO 8859-1), even if those characters aren't the same characters the
UTF-8-valid parts of the file represent.

or a file with mixed encodings (e.g. a mail spool) or with mixed-in
binary data. I want to edit it anyway and save it back without
trashing the data that does not parse as valid UTF-8, while still
being able to edit the data that is valid UTF-8 as UTF-8.

If the file uses mixed encodings, then of course you can't read the
entire file in one encoding.

But when you determine that a given section (e.g., a MIME part) is
in some given encoding, why not map that section's bytes to characters
and then work with characters from then on?

What if you're searching for a character string across multiple sections
of a mixed-encoding file like that? You certainly can't write a UTF-8-
byte regular expression that matches other encodings. And you can't
"OR" together the regular expression for a UTF-8 byte encoding of the
character string with each other encoding's byte sequences (since
there's no way for the regular expression matcher to know which
alternative in the regular expression it should be using in each
section of the mixed-encoding file.

...

Post by Rich Felker
Hardly. A byte-based regex for all case matches (e.g. "(ÃfÂ¤|Ãf?)") will

The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

Maybe you should think more clearly. I didn't write my mailer, so the
quality of its behavior doesn't reflect my knowledge.

The biggest disadvantage of it is that it's WRONG.

Is it any more wrong than your rejecting of 1xxxxxx bytes? The bytes
represent characters in some encoding. You ignore those characters and
reject based on just the byte values.

The data is not
Latin-1, and pretending it's Latin-1 is a hideous hack.

It's not pretending the data is bytes encoding characters. It's mapping
bytes to characters to use methods defined on characters. Yes, it could
be misleading if it's not clear that it's a temporary mapping only for
that purpose (i.e., that the mapped-to characters are not the characters
that the byte sequence really represents). And yes, byte-based regular
expressions would be useful.

Maybe 20 years from now we'll finally be able to get rid of the
nonsense and just assume everything is UTF-8...

Hopefully.

Daniel

--
Daniel Barclay
***@smart.net

ＳｒｉｎＴｕａｒ

2007-03-29 03:05:56 UTC

Post by Daniel B.
If they're supposed to be UTF-8 and aren't, then certainly normal
tools shouldn't have to deal with malformed sequences. If you write
a special tool to fix malformed sequences somehow (e.g., delete files
with malformed sequences), then of course you're going to be dealing
with the byte level and not (just) the character level.

If normal tools completely wet the bed at the sight of malformed sequences,
then they are poorly designed. Vim, for one seems to handle garbage characters
quite admirably, while working perfectly with well formed ones. If
anything, thats
a behavior to strive towards.

The fact that the designers of certain filesystems and file
manipulation tools dont want anything to do with encoding is actually
very fortunate. That way you can delete the invalid files without
having to use a special tool coded for just such events.

Post by Rich Felker
The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

Maybe you should think more clearly. I didn't write my mailer, so the
quality of its behavior doesn't reflect my knowledge.

it does reflect your lack of interesting in getting your email utf-8 compatible.

Daniel B.

2007-03-31 22:33:20 UTC

If normal tools completely wet the bed at the sight of malformed sequences,
then they are poorly designed.

Some of them are following specifications (e.g., the specifications that
say certain UTF-8 readers (and XML processor, maybe?) should reject
malformed sequences or reject inputs with malformed sequences (for
security reasons)).

Daniel

--
Daniel Barclay
***@smart.net

Daniel B.

2007-03-31 22:36:06 UTC

Post by Rich Felker
The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

Maybe you should think more clearly. I didn't write my mailer, so the
quality of its behavior doesn't reflect my knowledge.

it does reflect your lack of interesting in getting your email utf-8 compatible.

How the hell do you think you know what it reflects? (Have you ever
considered it might have something to do with bookmark management?)

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-31 23:05:03 UTC

Post by Rich Felker
The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

Maybe you should think more clearly. I didn't write my mailer, so the
quality of its behavior doesn't reflect my knowledge.

it does reflect your lack of interesting in getting your email utf-8 compatible.

How the hell do you think you know what it reflects? (Have you ever
considered it might have something to do with bookmark management?)

Just because you insist on using an ancient, horribly broken,
proprietary web browser to manage your bookmarks doesn't mean you have
to use it for email too... especially when it breaks email so badly.
In any case it reflects priorities I think, and also indicates that
you're using backwards software, which goes alongside with discussing
the UTF-8 issue as if we were living in 1997 instead of 2007.

All of this is stuff you're entitled to do if you like, and it's not
really my business to tell you what you should be using. But it
does reframe the discussion.

Rich

Rich Felker

2007-03-29 03:49:38 UTC

Post by Daniel B.
Well of course you need to think in bytes when you're interpreting the
stream of bytes as a stream of characters, which includes checking for
invalid UTF-8 sequences.

And what do you do if they're present?

Of course, it depends where they are present. You seem to be addressing
relative special cases.

I’m addressing corner cases. Robust systems engineering is ALWAYS
about handling the corner cases. Any stupid codemonkey can write code
that does what’s expected when you throw the expected input at it. The
problem is that coding like this blows up and gives your attacker root
as soon as they throw something unexpected at it. :)

Post by Rich Felker
Under your philosophy, it would
be impossible for me to remove files with invalid sequences in their
names, since I could neither type the filename nor match it with glob
patterns (due to the filename causing an error at the byte to
character conversion phase before there’s even a change to match
anything). ...

If the file name contains illegal byte sequences, then either they’re
not in UTF-8 to start with or, if they’re supposed to be, something
else let invalid sequences through.

Several likely scenarios:

1. Attacker intentionally created invalid filenames. This might just
be annoying vandalism but on the other hand might be trying to
trick non-robust code into doing something bad (maybe throwing away
or replacing the invalid sequences so that the name collides with
another filename, or interpreting overlong UTF-8 sequences, etc.).

2. Foolish user copied filenames from a foreign system (e.g. scp or
rsync) with a different encoding, without conversion.

3. User (yourself or other) extracted files from a tar or zip archive
with names encoded in a foreign encoding, without using software
that could detect and correct the situation.

Post by Daniel B.
If they're supposed to be UTF-8 and aren't, then certainly normal
tools shouldn't have to deal with malformed sequences.

This is nonsense. Regardless of what they’re supposed to be, someone
could intentionally or unintentionally create files whose names are
not valid UTF-8. While it would be a nice kernel feature to make such
filenames illegal, you have to consider foreign removable media (where
someone might have already created such bad names), and since POSIX
makes no guarantee that strings which are illegal sequences in the
character encoding are illegal as filenames, any robust and portable
code MUST account for the the fact that they could exist. Thus
filenames, commandlines, etc. MUST always be handled as bytes or in a
way that preserves invalid sequences.

Post by Daniel B.
If you write
a special tool to fix malformed sequences somehow (e.g., delete files
with malformed sequences), then of course you're going to be dealing
with the byte level and not (just) the character level.

Why should I need a special tool to do this?? Something like:
rm *known_ascii_substring*
should work, as long as the filename contains a unique ascii (or valid
UTF-8) substring.

Post by Rich Felker
Other similar problem: I open a file in a text editor and it contains
illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,

Again, you seem to be dealing with special cases.

Again, software which does not handle corner cases correctly is crap.

Post by Daniel B.
If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
would you expect a UTF-8 text editor to work on it?

I expect my text editor to be able to edit any file without corrupting
it. Perhaps you have lower expectations... If you’re used to Windows
Notepad, that would be natural, but I’m used to GNU Emacs.

Post by Daniel B.
For the data that is parseable as a valid UTF-8 encoding of characters,
how do you propose to know whether it really is characters encoded as
UTF-8 or is characters encoded some other way?

It’s neither. It’s bytes, which when they are presented for editing,
are displayed as a character according to their interpretation as
UTF-8. :)

If I receive a foreign file in a legacy encoding and wish to interpret
it as characters in that encoding, then I’ll convert it to UTF-8 with
iconv (which deal with bytes) or using C-x RET c prefix in Emacs to
visit the file with a particular encoding. What I absolutely do NOT
want is for a file to “magically” be interpreted as Latin-1 or some
other legacy codepage as soon as invalid sequences are detected. This
is clobbering the functionality of my system to edit its own native
data for the sake of accomodating foreign data.

I respect that others do want and regularly use such auto-detection
functionality, however.

Post by Daniel B.
(If you see the byte sequence 0xDF 0xBF, how do you know whether that
means the character U+003FF

It never means U+03FF in any case because U+03FF is 0xCF 0xBF...

Post by Daniel B.
or the two characters U+00DF U+00BF? For

It never means this in text on my system because the text encoding is
UTF-8. It would mean this only if your local character encoding were
Latin-1.

Post by Daniel B.
example, if at one point you see the UTF-8-illegal byte sequence
0x00 0xBF and assume that that 0xBF byte means character U+00BF, then

This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.
Assuming Latin-1 as soon as an illegal sequence is detected is
sometimes a useful hack, e.g. on IRC when some people are too stubborn
or uneducated to fix their encoding, but it’s fundamentally incorrect,
and in most cases will cause more harm than help in the long term. IRC
is a notable exception because the text is separated into individual
messages and you can selectively interpret individual messages in
legacy encodings without compromising the ability to accept valid
UTF-8.

Post by Daniel B.
Either edit the test file with a byte editor, or edit it with a text
editor specifying an encoding that maps bytes to characters (e.g.,
ISO 8859-1), even if those characters aren't the same characters the
UTF-8-valid parts of the file represent.

With a byte editor, how am I supposed to see any characters except
ascii correctly? Do all the UTF-8 math in my head and then look them
up in a table?!?!

Post by Rich Felker
or a file with mixed encodings (e.g. a mail spool) or with mixed-in
binary data. I want to edit it anyway and save it back without
trashing the data that does not parse as valid UTF-8, while still
being able to edit the data that is valid UTF-8 as UTF-8.

If the file uses mixed encodings, then of course you can't read the
entire file in one encoding.

Indeed. But I’m thinking of cases like:
cat file1.utf8 file2.latin1 file3.utf8 > foobar

Obviously this should not be done, but sometimes people are ignorant
and thus sometimes such files come to exist, and eventually arrive on
my system. :) What you’re saying is that I should have to use a ‘byte
editor’ (what is that? a hex editor?) to repair this file, rather than
just being able to load it like any other file in Emacs and edit it.
This strikes me as unnecessarily crippling.

Post by Daniel B.
But when you determine that a given section (e.g., a MIME part) is
in some given encoding, why not map that section's bytes to characters
and then work with characters from then on?

Of course that’s preferred, but requires specialized tools. The core
strength of Unix is being able to use general tools for many purposes
beyond what they were originally intended for.

Post by Rich Felker
Hardly. A byte-based regex for all case matches (e.g. "(ÃfÂ¤|Ãf?)") will

The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
instill faith...

Maybe you should think more clearly. I didn't write my mailer, so the
quality of its behavior doesn't reflect my knowledge.

If you don’t actually use UTF-8, I think that reflects your lack of
qualification to talk about issues related to it. And I don’t see how
you could be using it if your mailer won’t even handle it...

Post by Daniel B.
Of course, you can still do that with character-based strings if you
can use other encodings. (E.g., in Java, you can read the mail
as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
Then you can write the regular expression in terms of Unicode characters
0-255. The only disadvantage there is probably some time spent
decoding the byte stream into the internal representation of characters.)

The biggest disadvantage of it is that it's WRONG.

Is it any more wrong than your rejecting of 1xxxxxx bytes? The bytes
represent characters in some encoding. You ignore those characters and
reject based on just the byte values.

Nobody said that it needs to be “rejected” (except Unicode fools who
think in terms of UTF-16...). It’s just not a character, but some
other binary data with no interpretation as a character.

Post by Rich Felker
The data is not
Latin-1, and pretending it's Latin-1 is a hideous hack.

If you’re going to do this, at least map into the PUA rather than to
Latin-1..... At least that way it’s clear what the meaning is.

〜Rich

Daniel B.

2007-03-31 23:44:39 UTC

Post by Rich Felker
Other similar problem: I open a file in a text editor and it contains
illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,

Again, you seem to be dealing with special cases.

Again, software which does not handle corner cases correctly is crap.

Why are you confusing "special-case" with "corner case"?

I never said that software shouldn't handle corner cases such as illegal
UTF-8 sequences.

I meant that an editor that handles illegal UTF-8 sequences other than
by simply rejecting the edit request is a bit if a special case compared
to general-purpose software, say a XML processor, for which some
specification requires (or recommends?) that the processor ignore or
reject any illegal sequences. The software isn't failing to handle the
corner case; it is handling it--by explicitly rejecting it.

Post by Daniel B.
If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
would you expect a UTF-8 text editor to work on it?

I expect my text editor to be able to edit any file without corrupting
it.

Okay, then it's not UTF-8-only text editor.

Post by Rich Felker
Perhaps you have lower expectations... If youâ??re used to Windows
Notepad, that would be natural, but Iâ??m used to GNU Emacs.

I'm used to Emacs too. Quit casting implied aspersions.

Post by Daniel B.
(If you see the byte sequence 0xDF 0xBF, how do you know whether that
means the character U+003FF

It never means U+03FF in any case because U+03FF is 0xCF 0xBF...

Post by Daniel B.
or the two characters U+00DF U+00BF? For

It never means this in text on my system because the text encoding is
UTF-8. It would mean this only if your local character encoding were
Latin-1.

What I meant (given the quoted part below you replied before) was that
if you're dealing with a file that overall isn't valid UTF-8, how would
you know whether a particular part that looks like valid UTF-8,
representing some characters per the UTF-8 interpretation, really
represents those characters or is an erroneously mixed-in representation
of other characters in some other encoding?

Since you're talking about preserving what's there as opposed to doing
anything more than that, I would guess you answer is that it really
doesn't matter. (Whether you treater 0xCF 0xBF as a correct the UTF-8
sequence and displayed the character U+03FF or, hypothetically, treated
it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and
displayed those characters, you'd still write the same bytes back out.)

Post by Daniel B.
example, if at one point you see the UTF-8-illegal byte sequence
0x00 0xBF and assume that that 0xBF byte means character U+00BF, then

This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.

You said you're talking about a text editor, that reads bytes, displays
legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
reject other UTF-8-illegal bytes, and does something with those bytes.

What does it do with such a byte? It seems you were taking about
mapping it to some character to display it. Are you talking about
something else, such as displaying the hex value of the byte?

The biggest disadvantage of it is that it's WRONG.

Is it any more wrong than your rejecting of 1xxxxxx bytes? The bytes
represent characters in some encoding. You ignore those characters and
reject based on just the byte values.

Nobody said that it needs to be â??rejectedâ?? ...

Yes someone did--they wrote about rejecting spam mail by detecting
bytes/octets with the high bit set.

...

Post by Rich Felker
The data is not
Latin-1, and pretending it's Latin-1 is a hideous hack.

If youâ??re going to do this, at least map into the PUA rather than to
Latin-1..... At least that way itâ??s clear what the meaning is.

That makes it a bit less convenient, since then the numeric values of
the characters don't match the numeric values of the bytes.

But yes, doing all that is not something you'd want to escape into the
wild (be seen outside the immediate code whether you need to fake
byte-level regular expressions in Java).

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-04-01 05:33:18 UTC

Post by Rich Felker
Again, software which does not handle corner cases correctly is crap.

Why are you confusing "special-case" with "corner case"?
I never said that software shouldn't handle corner cases such as illegal
UTF-8 sequences.
I meant that an editor that handles illegal UTF-8 sequences other than
by simply rejecting the edit request is a bit if a special case compared
to general-purpose software, say a XML processor, for which some
specification requires (or recommends?) that the processor ignore or
reject any illegal sequences. The software isn't failing to handle the
corner case; it is handling it--by explicitly rejecting it.

It is a corner case! Imagine a situation like this:

1. I open a file in my text editor for editing, unaware that it
contains invalid sequences.

2. The editor either silently clobbers them, or presents some sort of
warning (which, as a newbie, I will skip past as quickly as I can) and
then clobbers them.

3. I save the file, and suddenly I’ve irreversibly destroyed huge
amounts of data.

It’s simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
lead either to horrible data corruption or inability to edit when your
file has somehow gotten malformed data into it. If your editor
corrupts files like this, it’s broken and I would never even consider
using it.

As an example of broken behavior (but different from what you’re
talking about since it’s not UTF-8), XEmacs converts all characters to
its own nasty mule encoding when it loads the file. It proceeds to
clobber all Unicode characters which don’t also exist in legacy mule
character sets, and upon saving, the file is horribly destroyed. Yes
this situation is different, but the only difference is that UTF-8 is
a proper standard and mule is a horrible hack. The clobbering is just
as wrong either way.

(I’m hoping that XEmacs developers will fix this someday soon since I
otherwise love XEmacs, but this is pretty much a show-stopper since it
clobbers characters I actually use..)

Post by Daniel B.
What I meant (given the quoted part below you replied before) was that
if you're dealing with a file that overall isn't valid UTF-8, how would
you know whether a particular part that looks like valid UTF-8,
representing some characters per the UTF-8 interpretation, really
represents those characters or is an erroneously mixed-in representation
of other characters in some other encoding?
Since you're talking about preserving what's there as opposed to doing
anything more than that, I would guess you answer is that it really
doesn't matter. (Whether you treater 0xCF 0xBF as a correct the UTF-8
sequence and displayed the character U+03FF or, hypothetically, treated
it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and
displayed those characters, you'd still write the same bytes back out.)

Yes, that’s exactly my answer. You might as well show it as the
character in case it really was supposed to be the character. Now it
sounds like we at least understand what one another are saying.

Post by Daniel B.
example, if at one point you see the UTF-8-illegal byte sequence
0x00 0xBF and assume that that 0xBF byte means character U+00BF, then

This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.

You said you're talking about a text editor, that reads bytes, displays
legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
reject other UTF-8-illegal bytes, and does something with those bytes.
What does it do with such a byte? It seems you were taking about
mapping it to some character to display it. Are you talking about
something else, such as displaying the hex value of the byte?

Yes. Actually GNU Emacs displays octal instead of hex, but it’s the
same idea. The pager “less” displays hex, such as <BF>, in reverse
video, and shows legal sequences that make up illegal or unprintable
codepoints in the form <U+D800> (also reverse video).

Post by Daniel B.
Yes someone did--they wrote about rejecting spam mail by detecting
bytes/octets with the high bit set.

Oh that was me. I misunderstood what you meant, sorry.

Post by Rich Felker
If youâ??re going to do this, at least map into the PUA rather than to
Latin-1..... At least that way itâ??s clear what the meaning is.

That makes it a bit less convenient, since then the numeric values of
the characters don't match the numeric values of the bytes.
But yes, doing all that is not something you'd want to escape into the
wild (be seen outside the immediate code whether you need to fake
byte-level regular expressions in Java).

*nod*

Rich

Ben Wiley Sittler

2007-04-01 07:00:07 UTC

please before embarking on such a path think about what happens when
someone else happens to use an actual character in the PUA which
collides with your escape. better to use something invalid to
represent something invalid. markus kuhn said it best, see e.g. here:

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html

and specifically, "option d", "Emit a malformed UTF-16 sequence for
every byte in a malformed UTF-8 sequence", basically each invalid
input 0xnn byte is mapped to the unpaired surrogate 0xDCnn (which are
all in the range 0xDC80 ... 0xDCFF). on output, the reverse is done
(unpaired surrogates from that range are mapped to the corresponding
bytes.)

the particular scheme described there has a name ("utf-8b") and
several implementations, and is widely applicable to situations
involving mixed utf-8 and binary data where the binary needs to be
preserved while also treating the utf-8 parts with Unicode or UCS
semantics.

-ben

Post by Rich Felker
Again, software which does not handle corner cases correctly is crap.

Why are you confusing "special-case" with "corner case"?
I never said that software shouldn't handle corner cases such as illegal
UTF-8 sequences.
I meant that an editor that handles illegal UTF-8 sequences other than
by simply rejecting the edit request is a bit if a special case compared
to general-purpose software, say a XML processor, for which some
specification requires (or recommends?) that the processor ignore or
reject any illegal sequences. The software isn't failing to handle the
corner case; it is handling it--by explicitly rejecting it.

1. I open a file in my text editor for editing, unaware that it
contains invalid sequences.
2. The editor either silently clobbers them, or presents some sort of
warning (which, as a newbie, I will skip past as quickly as I can) and
then clobbers them.
3. I save the file, and suddenly I've irreversibly destroyed huge
amounts of data.
It's simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
lead either to horrible data corruption or inability to edit when your
file has somehow gotten malformed data into it. If your editor
corrupts files like this, it's broken and I would never even consider
using it.
As an example of broken behavior (but different from what you're
talking about since it's not UTF-8), XEmacs converts all characters to
its own nasty mule encoding when it loads the file. It proceeds to
clobber all Unicode characters which don't also exist in legacy mule
character sets, and upon saving, the file is horribly destroyed. Yes
this situation is different, but the only difference is that UTF-8 is
a proper standard and mule is a horrible hack. The clobbering is just
as wrong either way.
(I'm hoping that XEmacs developers will fix this someday soon since I
otherwise love XEmacs, but this is pretty much a show-stopper since it
clobbers characters I actually use..)

Yes, that's exactly my answer. You might as well show it as the
character in case it really was supposed to be the character. Now it
sounds like we at least understand what one another are saying.

Post by Daniel B.
example, if at one point you see the UTF-8-illegal byte sequence
0x00 0xBF and assume that that 0xBF byte means character U+00BF, then

This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.

You said you're talking about a text editor, that reads bytes, displays
legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
reject other UTF-8-illegal bytes, and does something with those bytes.
What does it do with such a byte? It seems you were taking about
mapping it to some character to display it. Are you talking about
something else, such as displaying the hex value of the byte?

Yes. Actually GNU Emacs displays octal instead of hex, but it's the
same idea. The pager "less" displays hex, such as <BF>, in reverse
video, and shows legal sequences that make up illegal or unprintable
codepoints in the form <U+D800> (also reverse video).

Post by Daniel B.
Yes someone did--they wrote about rejecting spam mail by detecting
bytes/octets with the high bit set.

Oh that was me. I misunderstood what you meant, sorry.

Post by Rich Felker
If youâ??re going to do this, at least map into the PUA rather than to
Latin-1..... At least that way itâ??s clear what the meaning is.

That makes it a bit less convenient, since then the numeric values of
the characters don't match the numeric values of the bytes.
But yes, doing all that is not something you'd want to escape into the
wild (be seen outside the immediate code whether you need to fake
byte-level regular expressions in Java).

*nod*
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Daniel B.

2007-04-05 01:45:11 UTC

Post by Rich Felker
Again, software which does not handle corner cases correctly is crap.

Why are you confusing "special-case" with "corner case"?
I never said that software shouldn't handle corner cases such as illegal
UTF-8 sequences.
I meant that an editor that handles illegal UTF-8 sequences other than
by simply rejecting the edit request is a bit if a special case compared
to general-purpose software, say a XML processor, for which some
specification requires (or recommends?) that the processor ignore or
reject any illegal sequences. The software isn't failing to handle the
corner case; it is handling it--by explicitly rejecting it.

It is a corner case!

We seem to be having a communication problem, but I don't quite see
what the cause is.

I agree that it is a corner case. However, (seemingly) clearly, what
you wrote indicates you think I don't or wouldn't.

(I was arguing that handling the corner case by doing something other
than simply rejecting the illegal UTF-8 sequences was a bit of a
special case, just like, say, handling ill-formed XML is not something
a general XML processor (parser) has to do (it rejects it) but _is_
something a typical XML editor would want to do.

And to be clear, I'm not arguing that an editor should _not_ be a
special case (that is, not arguing that it shouldn't be careful to avoid
changing the file unintentially). I was only pointing out that it _is_
a special case (because whatever UTF-8 issues we were talking about
many message ago seem top apply differently to special-case tools (e.g.,
a general text editor) vs. general tools (e.g., HTTP POST receiver code).

Maybe at first I thought you were talking about a UTF-8-_only_ editor.)

Post by Rich Felker
Itâ??s simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
lead either to horrible data corruption or inability to edit when your
file has somehow gotten malformed data into it.

(Yes, I agree.)

...

Post by Daniel B.
You said you're talking about a text editor, that reads bytes, displays
legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
reject other UTF-8-illegal bytes, and does something with those bytes.
What does it do with such a byte? It seems you were taking about
mapping it to some character to display it. Are you talking about
something else, such as displaying the hex value of the byte?

Yes.

Roger.

Daniel

ＳｒｉｎＴｕａｒ

2007-03-27 17:51:59 UTC

Post by Egmont Koblinger
But in most of the cases you have to _think_ in
characters, otherwise it's quite unlikely that your application will work
correctly.

I'm not quite sure how "thinking in characters" helps an application,
in general. I'd be interested if you had a concrete example...

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
The only time source code needs to care about
characters is when it has to layout or format them for display.

No, there are many more situations. Even if your job is so simple that you
only have to convert a text to uppercase, you already have to know what
encoding (and actually what locale) is being used.

Thinking in characters for that: such as calling a function like
"toupper" is broken.
There is no guarantee that case folding will maintain a 1 to 1 mapping of
unicode codepoints.

Here you are better off working with whole strings, and when doing so you don't
have to think in characters or codepoints at all.

Post by Egmont Koblinger
Finding a particular
letter (especially in case insentitive mode), performing regexp matching,
alphabetical sorting etc. are just a few trivial examples where you must
think in characters.

It's probably advisable to use a library regex engine than to re-write custom
regex engines all the time. Once you have a regex library that handles
codepoints, the code that uses it doesnt have to care about them in particular.

Post by Egmont Koblinger
If none of these trivial string operations depend on the encoding then you
don't have to use this feature of perl, that's all. Simply make sure that
the file descriptors are not set to utf8, neither are the strings that you
concat or match to. etc, so you stay in world of pure bytes.

The problem soon as you use a library routine that is utf-8 aware, it sets
the utf-8 flag on a string and problems start to result. If there was no utf-8
flag on the scalar strings to be set, then you could stay in byte world all the
time, while still using unicode functionality where you needed it.

Daniel B.

2007-03-28 02:57:44 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
...
Thinking in characters for that: such as calling a function like
"toupper" is broken.
There is no guarantee that case folding will maintain a 1 to 1 mapping of
unicode codepoints.

Case conversion probably needs to operate on strings (multiple
characters), not single characters.

Daniel

--
Daniel Barclay
***@smart.net

Egmont Koblinger

2007-03-28 14:03:23 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I'm not quite sure how "thinking in characters" helps an application,
in general. I'd be interested if you had a concrete example...

I don't have a concrete example. It's just a level of abstraction you have
in your mind. When you are coding, you are not just randomly hitting your
keyboard (see infinite monkeys vs. Shakespeare), you have something in your
mind, you give your variables a meaning, you have an intent with your
code... By "thinking in characters" I meant this. Probably all your "if"
branches, all your pointer incmenets and everything happens because you know
that you handle a _character_ and write your code according to that. In most
cases you can't write good code if you don't know what kind of data you're
dealing with. For example it's impossible to implement a regexp matching
routine if you have no idea what encoding is being used.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
It's probably advisable to use a library regex engine than to re-write
custom regex engines all the time.

Sure.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Once you have a regex library that handles codepoints, the code that uses
it doesnt have to care about them in particular.

It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8). Suppose you want to test whether it matches to the regexp 65 46 66
("A.B"). Does it match? It depends. If the byte sequence really denotes AÁB
(i.e. it is encoded in UTF-8) then it does. If it has different semantics (a
different character sequence encoded in some other 8-bit encoding) then it
doesn't. How do you think perl is supposed to overcome this problem if it
didn't have Unicode support?

You have to make sure that the string to test and the regexp itself are
encoded in the same charset, and in turn this also matches the charset the
regexp library routine expects. Otherwise things will go plain wrong sooner
or later. In some languages regexp matching is done via functions, and
probably you may have an 8-bit match() and a Unicode-aware mbmatch() as
well. Remember that in perl regexp matching is part of the language itself,
the =~ and !~ operators do that. Offering mb=~ and mb!~ counterparts as
built-in operators would be IMHO terribly disgusting. If the operator itself
remains the same then these are the string and regexp objects (the arguments
of that operator) that have to carry the information which the regexp
matching operator can depend on.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
The problem soon as you use a library routine that is utf-8 aware, it sets
the utf-8 flag on a string and problems start to result. If there was no utf-8
flag on the scalar strings to be set, then you could stay in byte world all the
time, while still using unicode functionality where you needed it.

As I've already said, there's absolutely nothing preventing you from _not_
using the Unicode features of Perl at all. But then I'm just curious how you
would match accented characters to regexps for example.

--
Egmont

Rich Felker

2007-03-28 16:26:30 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I'm not quite sure how "thinking in characters" helps an application,
in general. I'd be interested if you had a concrete example...

dealing with. For example it's impossible to implement a regexp matching
routine if you have no idea what encoding is being used.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
It's probably advisable to use a library regex engine than to re-write
custom regex engines all the time.

Sure.

I think ＳｒｉｎＴｕａｒ has made it clear that he agrees that a
regular expression engine needs to be able to interpret characters.
His point is that the calling code does not have to know anything
about characters, only strings.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Once you have a regex library that handles codepoints, the code that uses
it doesnt have to care about them in particular.

When interpreting bytes as characters, you do so according to the
system's character encoding, as exposed by the C multibyte character
handling functions. On systems which allow the user to choose an
encoding, the user then selects it via the LC_CTYPE category. On my
system, it's always UTF-8 and not a runtime option.

If you want to process foreign encodings (not the system/locale native
encoding) then you should convert them to your native encoding first
(via iconv or a similar library). If your native encoding is not able
to represent all the characters in the foreign encoding then you're
out of luck and you should give up your legacy codepage and switch to
UTF-8 if you want multilingual support.

Post by Egmont Koblinger
Otherwise things will go plain wrong sooner
or later. In some languages regexp matching is done via functions, and
probably you may have an 8-bit match() and a Unicode-aware mbmatch() as
well.

I don't know which languages do this, but it's wrong. mbmatch() would
cover both cases just fine (i.e. it would work even if the native
encoding is 8bit). If you want a BYTE-based regex engine, that's
another matter, and AFAIK few languages provide such a thing
explicitly. (If they do, it's by misappropriating an
8bit-codepage-targetted matcher.) But treating bytes and 8bit codepage
encodings as the same thing is wrong. Bytes represent numbers in the
range 0-255. 8bit codepages represent 256-character subsets of
Unicode. These are not the same.

Regex would always match characters...

Rich

ＳｒｉｎＴｕａｒ

2007-03-28 16:44:44 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Once you have a regex library that handles codepoints, the code that uses
it doesnt have to care about them in particular.

It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8).

Why is it not so simple?I just want to know some basic information:
Does it match or not. What range of bytes in the string was matched.

I don't care what the regex library does under the covers, and I
shouldnt have to care...
I can safely extract substrings on those boundaries now if it did its job right.

If it knows how to match "Á" to ".", then I dont have to know how it
goes about doing so.
Even better if the regex engine handles both normalization forms
transparently. My code should never have to care. I shouldnt have to
jump through hoops, and call all sort of fancy "binmode" settings or
perform "Encode::decode" incantantions everywhere to turn my scalars
ba

Egmont Koblinger

2007-03-28 17:49:57 UTC

Post by Egmont Koblinger
It's not so simple. Suppose you have a byte sequence (decimal) 65 195 129
66. (This is the beginning of the Hungarian alphabet AÁB... encoded in
UTF-8).

Does it match or not. What range of bytes in the string was matched.

Seems you didn't understand. It depends on how to interpret the byte
sequence above. If it stands in UTF-8 then it means "AÁB" and hence it
matches the regexp "A followed by a letter followed by B". However, the same
byte sequence may encode "A├üB" in CP437 ("A followed by a vertical+right
frame element followed by u with diaeresis followed by B), and may also
encode "Aц│B" (cyrillic "tse" and a vertical frame element) in KOI8-R and so
on. In these latter cases it does not match the same regexp. See? Whether it
matches or not _does_ depend on the character set that you use. It's not
perl's flaw that it couldn't decide, it's impossible to decide in theory
unless you know the charset.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I don't care what the regex library does under the covers, and I
shouldnt have to care...

they should _just work_ without requiring any charset knowledge from me.

In an ideal world where no more than one character set (and one
representation) is used, a developer could expect the same from any
programming language or development environment. But our world is not ideal.
There _are_ many character sets out there, and it's _your_ job, the
programmer's job to tell the compiler/interpreter how to handle your bytes
and to hide all these charset issues from the users. Therefore you have to
be aware of the technical issues and have to be able to handle them.

Having a variable in your code that stores sequence of bytes, without you
being able to tell what encoding is used there, is just like having a
variable to store the height of people, without knowing whether it's
measured in cm or meter or feet... The actions you may take are very limited
(e.g. you can add two of these to calculate how large they'd be if one would
stand on the top of the other (btw the answer would also lack the unit)),
but there are plenty of things you cannot answer.

There are many ways to solve charset problems, and which one to choose
depends on the goals of your software too. If you only handle _texts_ then
probably the best approach is to convert every string as soon as they arrive
at your application to some Unicode representation (UTF-8 for Perl, "String"
(which uses UTF-16) for Java and so on), then use this representation inside
your application, and convert (if necessary) when you output them. If you
must be able to handle arbitrary byte sequences, then (as Rich pointed out)
you should keep the array of bytes but you might need to adjust a few
functions that handle them, e.g. regexp matching might be a harder job in
this case (e.g. what does a dot (any character) mean in this case?).

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
If it knows how to match "Á" to ".", then I dont have to know how it
goes about doing so.

Recently you asked why perl didn't just simply work with bytes. Now you talk
about the "Á" letter. But you seem to forget about one very important step:
how should perl know that your sequence of bytes represents "Á" and not some
other letter(s)? It's _your_ job to tell it to perl, there's no way it could
tell it on its own. And this is where all these utf8 magic comes into play.

--
Egmont

Rich Felker

2007-03-28 18:35:32 UTC

Post by Egmont Koblinger
matches or not _does_ depend on the character set that you use. It's not
perl's flaw that it couldn't decide, it's impossible to decide in theory
unless you know the charset.

It is perl's flaw. The LC_CTYPE category of the locale determines the
charset. This is how all sane languages work.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I don't care what the regex library does under the covers, and I
shouldnt have to care...

I don't have to be aware of it in any other language. It just works.
Perl is being unnecessarily difficult here.

Post by Egmont Koblinger
Having a variable in your code that stores sequence of bytes, without you
being able to tell what encoding is used there, is just like having a
variable to store the height of people, without knowing whether it's
measured in cm or meter or feet... The actions you may take are very limited
(e.g. you can add two of these to calculate how large they'd be if one would
stand on the top of the other (btw the answer would also lack the unit)),
but there are plenty of things you cannot answer.

Nonsense. As long as all the length variables are in the SAME unit,
your program has absolutely no reason to care whatsoever exactly what
that unit it. Any unit is just as good as long as it's consistent. The
same goes for character sets. There is a well-defined native character
encoding, which should be UTF-8 on any modern system. When importing
data from foreign encodings, it should be converted. This is just the
same as if you stored all your lengths in a database. As long as
they're all consistent (e.g. all in meters) then you don't have to
grossly increase complexity and redundancy by storing a unit with each
value. Instead, you just convert foreign values when they're input,
and assume all local data is already in the correct form. The same
applies to character encoding.

Post by Egmont Koblinger
your application, and convert (if necessary) when you output them. If you
must be able to handle arbitrary byte sequences, then (as Rich pointed out)
you should keep the array of bytes but you might need to adjust a few
functions that handle them, e.g. regexp matching might be a harder job in
this case (e.g. what does a dot (any character) mean in this case?).

Regex matching is also easier with bytes, even if your bytes represent
multibyte characters. The implementation that converts to UTF-32 or
similar is larger, slower, klunkier, and more error-prone.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
If it knows how to match "Á" to ".", then I dont have to know how it
goes about doing so.

Recently you asked why perl didn't just simply work with bytes. Now you talk
how should perl know that your sequence of bytes represents "Á" and not some
other letter(s)?

Because the system defines this as part of LC_CTYPE.

Rich

Egmont Koblinger

2007-03-29 10:01:28 UTC

It is perl's flaw. The LC_CTYPE category of the locale determines the
charset. This is how all sane languages work.

LC_CTYPE determines the system charset. This is used when reading from /
writing to a terminal, to/from text files by default; this is the charset
you expect messages coming from glibc to be encoded in; etc...

But this is not necessarily the charset you want your application to work
with. Think of Gtk+-2 for example, internally it always uses UTF-8, no
matter what your locale is. So it _has_ to tell every external regexp
routine (if it uses any) to work with UTF-8, not with the charset implied by
LC_CTYPE.

And you can think of any web browser, mail client and so on, they have to
cope with the charset that particular web page or message uses, yet again
independently from the system locale.

So, to stay at our example of a fictional regexp matching library: If this
library insists on assuming that the strings are encoded according to
LC_CTYPE then it's quite hard to use it correctly in such circumstances.
(You might need to write a wrapper that alters the locale temporarily -- but
could you tell me how to find a locale whose charset is one particular
charset?) If the charset the regexp library expects _defaults_ to LC_CTYPE
but is overridable then it's much better. And for libraries such as
glib2/gtk2 which force using utf-8 internally it's of course perfectly okay
if they implement an utf8-only regexp matching function.

Post by Rich Felker
I don't have to be aware of it in any other language. It just works.

Show me your code that you think "just works" and I'll show you where you're
wrong. :-)

Post by Rich Felker
Perl is being unnecessarily difficult here.

You forget one very important thing: Compatibility. In the old days Perl
used 8-bit strings and there many people created many perl programs that
handled 8-bit (most likely iso-8859-1) data. These programs must continue to
work correctly with newer Perls. This implies that perl mustn't assume UTF-8
charset for the data flows (even if your locale says so) since in this case
it would produce different output.

Post by Rich Felker
Nonsense. As long as all the length variables are in the SAME unit,
your program has absolutely no reason to care whatsoever exactly what
that unit it. Any unit is just as good as long as it's consistent.

If you don't know what unit is used, then you're unable to answer questions
whether that man is most likely healthy, whether he's extremely tall or
extremely small.

If you don't know what unit is used, how do you fill up your structures from
external data source? What if you are supposed to store cm but the data
arrives in inches? How would you know that you need to convert?

What if multiple external data sources use different units? If you ignore
the whole problem you'll end up with different units in your database where
even adding two numbers doesn't make any sense - just as it doesn't make any
sense to simply concatenate two byte sequences that represent text in
different encodings.

I guess you've heard several stories about million (billion?) dollar
projects failing due to such stupid mistakes - one developer sending the
data in centimeters, the other expecting them to arrive in inches.

--
Egmont

Rich Felker

2007-03-29 17:05:57 UTC

It is perl's flaw. The LC_CTYPE category of the locale determines the
charset. This is how all sane languages work.

Gtk+-2’s approach is horribly incorrect and broken. By default it
writes UTF-8 filenames into the filesystem even if UTF-8 is not the
user’s encoding.

Post by Egmont Koblinger
So it _has_ to tell every external regexp
routine (if it uses any) to work with UTF-8, not with the charset implied by
LC_CTYPE.

This is their fault for designing it wrong. If they correctly used the
requested encoding, there would be no problem.

Post by Egmont Koblinger
And you can think of any web browser, mail client and so on, they have to
cope with the charset that particular web page or message uses, yet again
independently from the system locale.

Not independently. All they have to do is convert it to the local
encoding. And yes I’m quite aware that a lot of information might be
lost in the process. That’s fine. If users want to be able to read
multilingual text, they NEED to migrate to a character encoding that
supports multilingual text. Trying to “work around” this [non-]issue
by mixing encodings and failing to respect LC_CTYPE is a huge hassle
for negative gain.

Post by Rich Felker
I don't have to be aware of it in any other language. It just works.

Show me your code that you think "just works" and I'll show you where you're
wrong. :-)

Mutt is an excellent example.

Post by Rich Felker
Perl is being unnecessarily difficult here.

Such programs could just as easily be run in a legacy locale, if
available on the system. But unless the data they’re processing
actually contains Latin-1 (in which case you’re in a Latin-1
environment!), there’s no reason that treating the strings as UTF-8
should cause any harm. ASCII is the same either way of course. The
only possible exception is if a perl program is using regex on true
binary data, which is a bit dubious to begin with.

If you don't know what unit is used, then you're unable to answer questions
whether that man is most likely healthy, whether he's extremely tall or
extremely small.

Thresholds/formulae for what height is tall/small/healthy/whatever
just need to be written using whatever unit you’ve selected as the
global units.

Post by Egmont Koblinger
If you don't know what unit is used, how do you fill up your structures from
external data source? What if you are supposed to store cm but the data
arrives in inches? How would you know that you need to convert?

Same way it works with character encodings. The code importing
external data knows what format the internal data must be in. The
internal code has no knowledge or care what the unit/encoding is. This
keeps the internal code clean and simple.

Post by Egmont Koblinger
I guess you've heard several stories about million (billion?) dollar
projects failing due to such stupid mistakes - one developer sending the
data in centimeters, the other expecting them to arrive in inches.

Yes.

Rich

Egmont Koblinger

2007-03-29 17:43:54 UTC

Post by Rich Felker
Gtk+-2’s approach is horribly incorrect and broken. By default it
writes UTF-8 filenames into the filesystem even if UTF-8 is not the
user’s encoding.

There's an environment variable that tells Gtk+-2 to use legacy encoding in
filenames. Whether or not forcing UTF-8 on filenames is a good idea is
really questionable, you're right.

But I'm not just talking about filenames, there are many more strings
handled inside Glib/Gtk+. Strings coming from gettext that will be displayed
on the screen, error messages originating from libc's strerror, strings
typed by the user into entry widgets and so on. Gtk+-2 uses UTF-8
everywhere, and (except for the filenames) it's clearly a wise decision.

Post by Rich Felker
Not independently. All they have to do is convert it to the local
encoding. And yes I’m quite aware that a lot of information might be
lost in the process. That’s fine. If users want to be able to read
multilingual text, they NEED to migrate to a character encoding that
supports multilingual text. Trying to “work around” this [non-]issue
by mixing encodings and failing to respect LC_CTYPE is a huge hassle
for negative gain.

I think this is just plain wrong. Since when do you browse the net and read
acccented pages? Since when do you use UTF-8 locale?

I used Linux with a Latin-2 locale since 1996. It's been around 2003 that I
began using UTF-8 sometimes and it was last year that I finally managed to
switch fully to UTF-8. There are still several applications that are
nightmare with UTF-8 (midnight commander for example). A few years ago
software were even much worse, many of them were not ready for UTF-8, it
would have been nearly impossible to switch to UTF-8. When did you switch to
unicode? Probably a few years earlier than I did, but I bet you also had
those old-fashioned 8-bit days...

So, I have used Linux for 10 years with an 8-bit locale set up. Still I
could visit French, Japanese etc. pages and the letters appeared correctly.
Believe me, I would have switched to Windows or whatever if Linux browsers
weren't be able to perform this pretty simple job.

It's not about workarounds or non-issues. If a remote server tells my
browser to display a kanji then my browser _must_ display a kanji, even if
my default charset doesn't contain it. Having an old-fashioned system
configuration is no excuse for any application not to properly display the
characters. (Except for terminal applications that are forced to use the
charset I use.)

Post by Egmont Koblinger
Show me your code that you think "just works" and I'll show you where you're
wrong. :-)

Mutt is an excellent example.

As you might see from the header of my messages, I'm using Mutt too. In this
regard mutt is a nice piece of software that handles accented characters
correctly (nearly) always. In order to do this, it has to be aware of the
charset of messages (and its parts) and the charset of the terminal and has
to convert between them plenty of times. The fact it does its job (mostly)
correctly implies that the authors didn't just write "blindly copy the bytes
from the message to the terminal" kind of functions, they have taken charset
issues into account and converted the strings whenever necessary. From a
user's point of view, accent handling in Mutt "just works". This is because
the developers took care of it. If the developers had tought "copying those
bytes from the mail to the terminal" would "_just work_" then mutt would be
an unusable mess.

--
Egmont

Rich Felker

2007-03-29 20:24:39 UTC

Post by Rich Felker
Gtk+-2’s approach is horribly incorrect and broken. By default it
writes UTF-8 filenames into the filesystem even if UTF-8 is not the
user’s encoding.

There's an environment variable that tells Gtk+-2 to use legacy encoding in
filenames. Whether or not forcing UTF-8 on filenames is a good idea is
really questionable, you're right.

Well the real solution is forcing UTF-8 in filenames by forcing
everyone who wants to use multilingual text to switch to UTF-8
locales.

Post by Egmont Koblinger
But I'm not just talking about filenames, there are many more strings
handled inside Glib/Gtk+. Strings coming from gettext that will be displayed
on the screen, error messages originating from libc's strerror, strings
typed by the user into entry widgets and so on. Gtk+-2 uses UTF-8
everywhere, and (except for the filenames) it's clearly a wise decision.

Not if it will also be reading/writing text to stdout or text-based
config files, etc..

Post by Egmont Koblinger
I think this is just plain wrong. Since when do you browse the net and read
acccented pages? Since when do you use UTF-8 locale?

Using accented characters in your own language has always been
possible with legacy codepage locales, and is still possible with what
I consider the correct implementation. The only thing that's not
possible in legacy codepage locales is handling text from other
languages that need characters not present in your codepage.

Post by Egmont Koblinger
I used Linux with a Latin-2 locale since 1996. It's been around 2003 that I
began using UTF-8 sometimes and it was last year that I finally managed to
switch fully to UTF-8. There are still several applications that are
nightmare with UTF-8 (midnight commander for example). A few years ago
software were even much worse, many of them were not ready for UTF-8, it
would have been nearly impossible to switch to UTF-8.

But now we’re living in 2007, not 2003 or 1996. Maybe your approaches
had some merit then, but that’s no reason to continue to use them now.
At this point anyone who wants multilingual text support should be
using UTF-8 natively, and if they have a good reason they’re not (e.g.
a particular piece of broken software) that software should be quickly
fixed.

Post by Egmont Koblinger
When did you switch to
unicode? Probably a few years earlier than I did, but I bet you also had
those old-fashioned 8-bit days...

I’ve always used UTF-8 since I started with Linux; until recently it
was just restricted to the first 128 characters of Unicode, though. :)
I never used 8bit codepages except to draw stuff on DOS waaaaay back.

Post by Egmont Koblinger
So, I have used Linux for 10 years with an 8-bit locale set up. Still I
could visit French, Japanese etc. pages and the letters appeared correctly.

UTF-8 has been around for almost 15 years now, longer than any real
character-aware 8bit locale support on Linux. It was a mistake that
8bit locales were ever implemented on Linux. If things had been done
right from the beginning we wouldn't even be having this discussion.

I’m sure you did have legitimate reasons to use Latin-2 when you did,
namely broken software without proper support for UTF-8. Here’s where
we have to agree to disagree I think: you’re in favor of workarounds
which get quick results while increasing the long-term maintainence
cost and corner-case usability, while I’m in favor of omitting
functionality (even very desirable functions) until someone does it
right, with the goal of increasing the incentive for someone to do it
right.

Post by Egmont Koblinger
Believe me, I would have switched to Windows or whatever if Linux browsers
weren't be able to perform this pretty simple job.

Your loss, not mine.

Post by Egmont Koblinger
It's not about workarounds or non-issues. If a remote server tells my
browser to display a kanji then my browser _must_ display a kanji, even if

Nonsense. If you don’t have kanji fonts installed then it can’t
display kanji anyway. Not having a compatible encoding is a comparable
obstacle to not having fonts. I see no reason that a system without
support for _doing_ anything with Japanese text should be able to
display it. What happens if you copy and paste it from your browser
into a terminal or text editor???

Even the Unicode standards talk about “supported subset” and give
official blessing to displaying characters outside the supported
subset as a ? or replacement glyph or whatever.

Post by Egmont Koblinger
Show me your code that you think "just works" and I'll show you where you're
wrong. :-)

Mutt is an excellent example.

Mutt “just works” in exactly the sense I described. I've RTFS'd mutt
and studied it a fair bit: it converts all data to your locale charset
(or an overridable configured charset, but that could cause problems).
This is absolutely necessary since it wants to be able to use external
editors and viewers which require data to be in the system's encoding,
use the system regex routines, etc. All of this is what makes mutt
light, clean, and a good citizen among other unix apps.

Post by Egmont Koblinger
If the developers had tought "copying those
bytes from the mail to the terminal" would "_just work_" then mutt would be
an unusable mess.

I have never suggested doing something idiotic like that yet you keep
bringing it up again and again as if I did. “Just work” means using
the C/POSIX multibyte and iconv and regex, etc. functions the way
they’re intended to be used and treating all text as text (in the
LC_CTYPE sense) once it’s been read in from local or foreign sources
(with natural conversions required for the latter).

Rich

Egmont Koblinger

2007-03-30 09:39:01 UTC

On Thu, Mar 29, 2007 at 04:24:39PM -0400, Rich Felker wrote:

Hi,

Post by Rich Felker
Using accented characters in your own language has always been
possible with legacy codepage locales

Of course.

Post by Rich Felker
The only thing that's not
possible in legacy codepage locales is handling text from other
languages that need characters not present in your codepage.

You say it's not possible??? Just launch firefox/opera/konqueror/whatever
modern browser with a legacy locale and see whether it displays all foreign
letters. It _does_, though you believe it's "not possible".

But let's reverse the whole story. I write a homepage in Hungarian, using
either latin2 or utf8 charset. Someone who lives in West Europe, America,
Asia, the Christmas Island... anywhere else happens to visit this page. It's
not only important for him, it's also important for me that my accents get
displayed correctly there, under a locale unknown to me. And luckily that's
how all good browsers work. I can't see why you're reasoning that this
should't (or mustn't?) work.

Post by Rich Felker
At this point anyone who wants multilingual text support should be
using UTF-8 natively,

At this point everyone should be using UTF-8 natively. But not everybody
does.

Post by Rich Felker
I’ve always used UTF-8 since I started with Linux; until recently it
was just restricted to the first 128 characters of Unicode, though. :)

:-)

Post by Rich Felker
UTF-8 has been around for almost 15 years now, longer than any real
character-aware 8bit locale support on Linux. It was a mistake that
8bit locales were ever implemented on Linux. If things had been done
right from the beginning we wouldn't even be having this discussion.

I agree.

Post by Rich Felker
I’m sure you did have legitimate reasons to use Latin-2 when you did,
namely broken software without proper support for UTF-8.

Yes.

Post by Rich Felker
Here’s where
we have to agree to disagree I think: you’re in favor of workarounds
which get quick results while increasing the long-term maintainence
cost and corner-case usability, while I’m in favor of omitting
functionality (even very desirable functions) until someone does it
right, with the goal of increasing the incentive for someone to do it
right.

Imagine the following. Way back in 1996 I had to create homepages, text
files, LaTeX documents etc. that contained Hungarian accented characters.
There were two ways to go. One was to use the legacy 8-bit encoding
(iso-8859-2 for Hungarian), the other was to fix software to work with UTF-8.
(Oh, there's a third way for homepages and latex files: use their disgusting
escaping mechanism.) In the first case I'm ready with my job in several
minutes. In the second case I had to fix dozens (maybe even hundreds) of
pieces of software that even took (and still takes) more than 10 years for
all the developers around the world to fix. Imagine that my boss asked me to
create a homepage and I answered him "okay, but I first have to fix a
complete operating system with its applications, I'll be ready in N years".
Are you still convinced that the 1st solution was just a "workaround" that
increased long-term maintainence cost? You would be clearly right if
software were usable with UTF-8 those days, but it's a fact that they
weren't.

Nowadays, when ninety-some percent of software are ready to UTF-8, you can
force UTF-8 and fix some remaining applications if needed. This approach --
though might have been theoretically better -- just couldn't have worked in
the last century when only a minor part of software supported UTF-8.

Post by Rich Felker
Nonsense. If you don’t have kanji fonts installed then it can’t
display kanji anyway. Not having a compatible encoding is a comparable
obstacle to not having fonts.

You're mixing two thing: installed (having support for it) vs. default.

Of course if you have no kanji installed then you won't expect an
application to display it. If you don't have UTF-8 support _available_ then
no-one expects software to handle it. But if UTF-8 support is _available_
though it's not set as the _default_, it is expectable that those software
that need it use it despite the default locale.

Post by Rich Felker
I see no reason that a system without
support for _doing_ anything with Japanese text should be able to
display it.

See? You're talking about "support" too. If I set LC_ALL=hu_HU.ISO-8859-2
then my system still _supports_ UTF-8 and kanjis, though it's not the
default.

Post by Rich Felker
What happens if you copy and paste it from your browser
into a terminal or text editor???

Minor detail. Test what occurs. Depending on the terminal emulator, it might
skip unknown characters, replace them with question mark, or refuse to
insert anything if the clipboard/selection contains an out-of-locale
character. It's not important at all.

Post by Rich Felker
Even the Unicode standards talk about “supported subset” and give
official blessing to displaying characters outside the supported
subset as a ? or replacement glyph or whatever.

Sure. But I still no reason why any application should restrict its own
"supported subset" to the current locale's charset if it has no sane reason
to do so.

--
Egmont

Christopher Fynn

2007-03-30 11:07:55 UTC

IMO these days all browsers should come with their default encoding set
to UTF-8 - and all HTML / XHTML editors should insert UTF-8 as the
default charset when creating new pages. If a user wants to go and
change this OK but usually there is no very good reason for creating a
page using any other encoding. Unicode is after all defined as the base
character set for HTML 4.0 and above.

Similarly all Linux distributions should use UTF-8 locales as the
default - and if a user wants to select a non UTF-8 locale at install
time they should probably receive some kind of mild warning.

- Chris

Post by David Starner
Hi,

Post by Rich Felker
Using accented characters in your own language has always been
possible with legacy codepage locales

Of course.

Post by Rich Felker
The only thing that's not
possible in legacy codepage locales is handling text from other
languages that need characters not present in your codepage.

It may display some of them incorrectly because of the overlap of
characters 128 -> 255 in that codepage and characters Unicode defines in
that range. Unless your using an east asian codpage, most browsers now
treat anything beyond 255 as a Unicode character.

Post by David Starner
But let's reverse the whole story. I write a homepage in Hungarian, using
either latin2 or utf8 charset. Someone who lives in West Europe, America,
Asia, the Christmas Island... anywhere else happens to visit this page. It's
not only important for him, it's also important for me that my accents get
displayed correctly there, under a locale unknown to me. And luckily that's
how all good browsers work. I can't see why you're reasoning that this
should't (or mustn't?) work.

Post by Rich Felker
At this point anyone who wants multilingual text support should be
using UTF-8 natively,

At this point everyone should be using UTF-8 natively. But not everybody
does.

....

Egmont Koblinger

2007-03-30 11:30:58 UTC

On Fri, Mar 30, 2007 at 05:07:55PM +0600, Christopher Fynn wrote:

Hi,

Post by Christopher Fynn
IMO these days all browsers should come with their default encoding set
to UTF-8

What do you mean by a browser's default encoding? Is it the encoding to be
assumed for pages lacking charset specification? In this case iso-8859-1 is
a much better choise -- there are far more pages out there in the wild
encoded in latin1 that lack charset info than utf8 pages that lack this
info. (Maybe an utf8 auto-detection would be nice, though.) So my argument
for iso-8859-1 is not theoretical but practical.

Post by Christopher Fynn
and all HTML / XHTML editors should insert UTF-8 as the
default charset when creating new pages.

Agree. You're also properly using the world "should". This is how they
_should_ work. Unfortunately this is not the way they actually do work.
See for example two bugs in Mozilla/Nvu:
https://bugzilla.mozilla.org/show_bug.cgi?id=315533
https://bugzilla.mozilla.org/show_bug.cgi?id=315543

My experiences with Seamonkey's charset handling in the html editor were
even worse than with Mozilla (the 1st bug report).

Post by Christopher Fynn
Similarly all Linux distributions should use UTF-8 locales as the
default - and if a user wants to select a non UTF-8 locale at install
time they should probably receive some kind of mild warning.

Perfectly agree. (Btw our distro doesn't even offer a non-utf8 locale at
installation. I do believe that asking the question "do you want your system
to do things right or wrong?" is always a bad idea. Software should behave
right and should not offer the possibility to behave wrong.)

Post by Christopher Fynn
It may display some of them incorrectly because of the overlap of
characters 128 -> 255 in that codepage and characters Unicode defines in
that range. Unless your using an east asian codpage, most browsers now
treat anything beyond 255 as a Unicode character.

I can't see this. Of course two different character sets may define
different symbols for the same byte. But it's not a problem as soon as the
page properly mentions its character set. The page or the http protocoll
tells which asian codepage to use, and then the browser interprets the bytes
according to this charset and displays the result. I can't see any case
where it might be ambiguous. Could you please provice a concrete example?

--
Egmont

Rich Felker

2007-03-30 15:10:15 UTC

Post by Egmont Koblinger
Hi,

Post by Christopher Fynn
IMO these days all browsers should come with their default encoding set
to UTF-8

Chris's argument (also mine) is practical too: It intentionally breaks
pages which are missing an explicit character set specification, so
that people making this broken stuff will have to fix it or deal with
lost visitors/lost sales/etc. :)

Rich

Daniel B.

2007-04-05 01:50:17 UTC

Post by Egmont Koblinger
...
What do you mean by a browser's default encoding? Is it the encoding to be
assumed for pages lacking charset specification?

Isn't that defined by the HTTP or HTML specification?

Daniel

--
Daniel Barclay
***@smart.net

Fredrik Jervfors

2007-03-30 12:47:55 UTC

Post by Rich Felker
The only thing that's not
possible in legacy codepage locales is handling text from other languages
that need characters not present in your codepage.

You say it's not possible??? Just launch firefox/opera/konqueror/whatever
modern browser with a legacy locale and see whether it displays all
foreign letters. It _does_, though you believe it's "not possible".
But let's reverse the whole story. I write a homepage in Hungarian, using
either latin2 or utf8 charset. Someone who lives in West Europe, America,
Asia, the Christmas Island... anywhere else happens to visit this page.
It's not only important for him, it's also important for me that my
accents get displayed correctly there, under a locale unknown to me. And
luckily that's how all good browsers work. I can't see why you're
reasoning that this should't (or mustn't?) work.

Correct me if I'm wrong, but isn't the web server supposed to tell the
client which charset is used: Latin2 or UTF-8? This might be done by using
the <meta> element in HTML. The client (browser) renders the page using
the charset suggested by the server (if possible) regardless of what
locale the receiving user has his/hers environment set to. Some browsers
might try to guess what charset to use if the server doesn't specify it,
but the "correct" solution is to configure the server (or the HTML
document) right.

In the (X)HTML document:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

HTTP header from the server:
Content-Type: text/html; charset=UTF-8

And an XML example:
<?xml version="1.0" encoding="utf-8" ?>

It's the same with MIME - the sender tells the receiver what charset it
used. Since the charset for the strings is specified, it's easy for the
program to know how to handle that string. If the sender doesn't specify
which charset to use, some/most/all sending clients assumes that the
content is written using the locale set in the sender's environment (which
might be wrong if inserting text through pipes).

These solutions work since the content is tagged with the charset used.
Guessing just isn't an option (unless the content isn't tagged, which it
shouldn't be), but the basic rule is that if a string isn't tagged there's
no way to know what charset was used when writing it. (Analysing the
encoding might result in a good guess though.) If the receiving end
displays data based on the receiving user's locale things will go terribly
wrong.

The receiving end must have a font which supports the charset needed for
rendering, but that's another issue.

Every program that inputs data should make sure that it knows what
encoding is used and give an error for any input that's malformed. Then
the data is tagged with the right charset before sending it to the
receiver, who will then know what to use to render it. If the receiver
only can render using the charset in the locale - their loss.

I'm just a beginner, so I might be completely wrong. If so - please
educate me.

Sincerely,
Fredrik

Egmont Koblinger

2007-03-30 13:48:19 UTC

Post by Fredrik Jervfors
Correct me if I'm wrong, but isn't the web server supposed to tell the
client which charset is used: Latin2 or UTF-8? [...]

You're perfectly right and understand the whole concept of encodings.

What I'm arguing with Rich is the following situation:

X writes a homepage in French, using either latin1 or utf8 encoding (but
mentions this encoding properly), and of course he uses all the french
letters, including e.g. è (e with grave accent).

Y is sitting in Poland for example, using a system configured to use a
latin2 locale by default. Latin2 lacks e with grave accent. Y visits the
homepage of X with some popular graphical web browser.

What should happen?

Rich says that his browser must (or should?) think in latin2 and hence drop
the è letters, maybe replace them with unaccented e or question marks or
similar.

I say that his browser mush show è correctly, it doesn't matter what its
locale is.

--
Egmont

Fredrik Jervfors

2007-03-30 15:17:32 UTC

Post by Fredrik Jervfors
Correct me if I'm wrong, but isn't the web server supposed to tell the
client which charset is used: Latin2 or UTF-8? [...]

You're perfectly right and understand the whole concept of encodings.
X writes a homepage in French, using either latin1 or utf8 encoding (but
mentions this encoding properly), and of course he uses all the french
letters, including e.g. è (e with grave accent).
Y is sitting in Poland for example, using a system configured to use a
latin2 locale by default. Latin2 lacks e with grave accent. Y visits the
homepage of X with some popular graphical web browser.
What should happen?
Rich says that his browser must (or should?) think in latin2 and hence
drop the è letters, maybe replace them with unaccented e or question
marks or similar.
I say that his browser mush show è correctly, it doesn't matter what its
locale is.

That depends on the configuration of the browser.

The browser should by default (programmer's choice really) think in the
encoding X used, since it's tagged with that encoding information.

If Y's computer supports the encoding X used (it doesn't have to be Y's
preferred encoding), the browser should use X's encoding when showing Y
the page (unless Y instructs, automatically (preference setting) or
manually (choosing in menu or such), the browser to convert the page to
Y's preferred encoding).

If Y's computer doesn't support the encoding X used, the browser should,
as a fallback solution, try to convert the page to Y's encoding if
possible. If the letter "è" isn't supported it should be replaced by
another letter (such as "?") or a symbol indicating that some data
couldn't be converted. It's also nice if the browser explains that a
conversion was made (maybe not as an alert (too intrusive, unless it
provides an option to install the missing encoding support), but maybe in
the information bar or status bar). Such an explanation would get the user
more interested in upgrading the system to support more encodings.

I think clipboards treat the data as bytes, so if Y wants to copy from X's
page and paste it into program P, Y has to make sure that the browser
converts the data to Y's preferred encoding before copying, since P's
input validation would (should) complain otherwise (when pasting).

Sincerely,
Fredrik

Rich Felker

2007-03-30 15:46:12 UTC

Post by Fredrik Jervfors

Post by Egmont Koblinger
I say that his browser mush show è correctly, it doesn't matter what its
locale is.

What does “supports the encoding” mean? Applications cannot select the
locale they run in, aside from requesting the “C” or “POSIX” locale.
It’s the decision of the user and/or the system implementor. In fact
it would be impossible to switch locales when visiting different pages
anyway. How would you deal with multiple browser windows or tabs, or
even frames?

Post by Fredrik Jervfors
If Y's computer doesn't support the encoding X used, the browser should,
as a fallback solution, try to convert the page to Y's encoding if
possible.

This is why I’m confused about what you mean by “support the
encoding”. The app cannot switch it’s native encoding (the locale), so
supporting the encoding would have to mean supporting it as an option
for conversion... But then, if the system doesn’t “support” it in this
sense, how would you go about converting?

Normal implementations work either by converting all data to the
user’s encoding, or by converting it all to some representation of
Unicode (UTF-8 or UTF-32, or something nonstandard like UTF-21).

Post by Fredrik Jervfors
I think clipboards treat the data as bytes, so if Y wants to copy from X's
page and paste it into program P, Y has to make sure that the browser
converts the data to Y's preferred encoding before copying, since P's
input validation would (should) complain otherwise (when pasting).

X selection thinks in ASCII or UTF-8. Technically the ASCII mode can
also be used for Latin-1, but IMO it’s a bad idea to continue to
support this since it’s obviously a broken interface. There’s also a
nasty scheme based on ISO-2022 which should be avoided at all costs.
So, in order to communicate cleanly via the X selection, X apps need
to be able to convert their data to and from UTF-8.

In a way I think this is bad, because it makes things difficult for
apps, but the motivation seems to be at least somewhat correct.
There’s no reason to expect that other X clients are even running on
the same machine, and they machines they’re running on might use
different encodings, so a universal encoding is needed for
interchange. It would be nice if xlib provided an API to convert the
data to and from the locale’s encoding automatically upon sending and
receiving it, however. (This could be a no-op on UTF-8-only systems.)

Rich

Egmont Koblinger

2007-03-30 17:06:52 UTC

Post by Rich Felker
What does “supports the encoding” mean? Applications cannot select the
locale they run in, aside from requesting the “C” or “POSIX” locale.

This isn't so. First of all, see the manual page setlocale(3), as well as
the documentation of newlocale() and uselocale() and *_l() functions (no man
page for them, use google). These will show you how to switch to arbitrary
existing locale, no matter what your environment variables are.

Second, in order to perform charset conversion, you don't need locales at
all, you only need the iconv_open(3) and iconv(3) library calls. Yes, glibc
provides a function to convert between two arbitrary character sets, even if
the locale in effect uses a third, different charset.

Post by Rich Felker
It’s the decision of the user and/or the system implementor. In fact
it would be impossible to switch locales when visiting different pages
anyway.

No, it's not impossible, and actually it's unneeded.

Just for curiosity: I wrote a menu generator for our distribution. This
loads the application menu from desktop files under /usr/share/applications,
and outputs menu files for various window managers, such as IceWM, Window
Maker, Enlightenment and so on. The input .desktop files contain the names
of software in multiple languages. Simple window managers expect the menu
file to contain them in only one language, the one you want to see. Hence
this program outputs plenty of configuration file, one for each window
manager and each language (icewm.en, icewm.hu, windowmaker.en,
windowmaker.hu and so on).

The entries are sorted alphabetically. But rules of alphabetical sorting
differs from language to language. Hence I have to use many locales. Before
dumping icewm.en, I have to switch to an English locale and perform sorting
there. Before dumping icewm.hu, I need to activate the Hungarian locale. And
so on.

Earlier versions of this program even included UTF-8 -> 8-bit conversions
(.desktop files use UTF-8 while our distro still used old-fashioned locale
those early days) and this 8-bit charset yet again differed from language to
language. So for example, when dumping icewm.fr, I converted the French
descriptions to Latin1, but when dumping icewm.hu, it had to be converted to
Latin2. In newer versions this part of the code is dropped since luckily
UTF-8 is used in the generated file.

Just in case you're interested, here's the source:
ftp://ftp.uhulinux.hu/sources/uhu-menu/

Post by Rich Felker
How would you deal with multiple browser windows or tabs, or even frames?

I can't see any problem here. Can you? Browsers work correctly, don't they?
You ask me how I'd implement a feature that _is_ implemented in basically
any browser. I guess your browser handles frames and tabs with different
charset correctly, doesn't it? Even if you run it with an 8-bit locale.

One possible way is to convert each separate input stream (e.g. html page or
frame) from their encoding to a common internal representation (most likely
UTF-8). Technically there are some minor issues that make this more
complicated (e.g. the charset info can be inside the html file), but
theoretically there's absolutely no problem.

Post by Rich Felker
Normal implementations work either by converting all data to the
user’s encoding, or by converting it all to some representation of
Unicode (UTF-8 or UTF-32, or something nonstandard like UTF-21).

Normal implementations work the 2nd way, that is, use a Unicode-compatible
internal encoding. From the user's point of view there's only one difference
between the two ways. Using the 1st way characters not present in your
current locale are lost. Using the 2nd way they are kept and displayed
correctly. Hence I still can't see any reason for choosing the 1st way
(except for terminal applications that have to stick to the terminal
charset).

--
Egmont

Rich Felker

2007-03-30 18:04:14 UTC

Post by Rich Felker
What does “supports the encoding” mean? Applications cannot select the
locale they run in, aside from requesting the “C” or “POSIX” locale.

This isn't so. First of all, see the manual page setlocale(3), as well as

The documentation of setlocale is here:
http://www.opengroup.org/onlinepubs/009695399/functions/setlocale.html

As you’ll see, the only arguments with which you can portably call
setlocale are NULL, "", "C", "POSIX", and perhaps also a string
previously returned by setlocale.

I’m interested only in portable applications, not “GNU/Linux
applications”.

Post by Egmont Koblinger
the documentation of newlocale() and uselocale() and *_l() functions (no man
page for them, use google). These will show you how to switch to arbitrary
existing locale, no matter what your environment variables are.

These are nonstandard extensions and are a horrible mistake in design
direction. Having the character encoding even be selectable at runtime
is partly a mistake, and should be seen as a temporary measure during
the adoption of UTF-8 to allow legacy apps to continue working until
they can be fixed. In the future we should have much lighter, sleeker,
more maintainable systems without runtime-selectable character
encoding.

If you look into the GNU *_l() functions, the majority of them exist
primarily or only because of LC_CTYPE. The madness of having locally
bindable locale would not be so mad if these could all be thrown out,
and if only the ones that actually depend on cultural customs instead
of on character encoding could be kept.

However, I suspect even then it’s a mistake. Applications which just
need to present data to the user in a form that’s comfortable to the
user’s cultural expectations are fine with a single global locale.
Applications which need to deal with multinational cultural
expectations simultaneously probably need much stronger functionality
than the standard library provides anyway, and would do best to use
their own (possibly in library form) specialized machinery.

Post by Egmont Koblinger
Second, in order to perform charset conversion, you don't need locales at
all, you only need the iconv_open(3) and iconv(3) library calls. Yes, glibc
provides a function to convert between two arbitrary character sets, even if
the locale in effect uses a third, different charset.

Yes, I’m well aware. This is not specific to glibc but part of the
standard. There is no standard on which character encodings should be
supported (which is a good thing, since eventually they can all be
dropped.. and even before then, non-CJK systems may wish to omit the
large tables for legacy CJK encodings), nor on the names for the
encodings (which is rather stupid; it would be very reasonable and
practical for SUS to mandate that, if an encoding is supported, it
must be supported under its standard preferred MIME name). The
standard also does not necessarily guarantee a direct conversion from
A to C, even if conversions from A to B and B to C exist.

Post by Egmont Koblinger
file to contain them in only one language, the one you want to see. Hence
this program outputs plenty of configuration file, one for each window
manager and each language (icewm.en, icewm.hu, windowmaker.en,
windowmaker.hu and so on).

It would be nice if these apps would use some sort of message catalogs
for their menus, and if they would perform the sorting themselves at
runtime.

Post by Egmont Koblinger
ftp://ftp.uhulinux.hu/sources/uhu-menu/

You could use setlocale instead of the *_l() stuff so it would be
portable to non-glibc. For a normal user application I would say this
is an abuse of locales to begin with and that it should use its own
collation data tables, but what you’re doing seems reasonable for a
system-specific maintainence script. The code looks nice. Clean use of
plain C without huge bloated frameworks.

Post by Rich Felker
How would you deal with multiple browser windows or tabs, or even frames?

I meant you run into trouble if you were going to change locale for
each page. Obviously it works if you don’t use the locale system.

Normal implementations work the 2nd way, that is, use a Unicode-compatible
internal encoding.

Links works the other way: converting everything to the selected
character encoding. Crappy versions of links (including the popular
gui one) only support 8bit codepages, but recent ELinks supports
UTF-8.

Post by Egmont Koblinger
From the user's point of view there's only one difference
between the two ways. Using the 1st way characters not present in your
current locale are lost. Using the 2nd way they are kept and displayed
correctly. Hence I still can't see any reason for choosing the 1st way
(except for terminal applications that have to stick to the terminal
charset).

Also applications that want to interact with other applications on the
system expecting to receive text, e.g. an external text editor or
similar.

Rich

Egmont Koblinger

2007-04-02 11:27:59 UTC

On Fri, Mar 30, 2007 at 02:04:14PM -0400, Rich Felker wrote:

Hi,

Post by Rich Felker
As you’ll see, the only arguments with which you can portably call
setlocale are NULL, "", "C", "POSIX", and perhaps also a string
previously returned by setlocale.

You can portably _call_ setlocale() with any argument, as long as you check
its return value and properly handle if it failed to fulfill your request.
The arguments you listed are probably those for which you can always assume
setlocale() to succeed. In the other cases you still might give it a chance
and see whether it succeeds.

Post by Rich Felker
I’m interested only in portable applications, not “GNU/Linux
applications”.

Our goals differ. Since I'm developing a Linux distro, I'm only interested
in developing GNU/Linux applications. We don't have any resources to check
the portability of our applications, neither want to make our job harder by
working with only a subset of the available functions and re-implement
what's already implemented in glibc. I don't think newer features that get
implemented in glibc are only to make its size bigger. I think they are for
the developers to use them when appropriate. They might not be appropriate
for a portable application, but usually are apropriate for our goals.

Post by Egmont Koblinger
the documentation of newlocale() and uselocale() and *_l() functions

No, first of all, they are not about multiple encodings, but multiple
locales. (It seems to me that you slightly mix up locale and encoding.
Encoding is only a part of locale and can be used independently of them.)
For example, if you create a German-French dictionary application, it's
expectable that German strings are sorted according to the German alphabet
rules, while French words are sorted using the French rules. Even if your
operating system doesn't support these locales, it might be a reasonable
decision if the application tried these locales and fell back to a default
sorting if they weren't available.

Post by Rich Felker
If you look into the GNU *_l() functions, the majority of them exist
primarily or only because of LC_CTYPE.

It seems to me that the majority of them exist because of cultural
differences, and there would be need for them if only UTF-8 existed.
Different time/date formats, different alphabetical sorting, different
lowercase-uppercase mapping etc.

Post by Rich Felker
Applications which need to deal with multinational cultural
expectations simultaneously probably need much stronger functionality
than the standard library provides anyway, and would do best to use
their own (possibly in library form) specialized machinery.

So far the functionality provided by glibc were sufficient for me and I
would have hated if I had to use an external library. ;)

Anyway, it would really be a bad decision if glibc didn't provide a way to
easily access the locale data that's originating from glibc and is already
accessible via glibc if you set a corresponding locale. Then the external
library you'd like to see would either need to access locale-data the same
way as glibc does, or had to provide the same information on its own form
again. Sounds terrible. External library is a good approach if some
information cannot be extracted by glibc _at all_.

For example, glibc doesn't know how many people live in Hungary, it's not
part of the locale data. If you need it, you may pick up an external library
that tells you this.

However, glibc knows how to alphabetically sort Hungarian strings. You claim
that it shouldn't let applications access this piece of information, unless
they have their LANG/LC_* environment variables set to hu_HU or some variant
of it. You say that applications should find a different way (different
library, maybe different database) to access this data if they needed it
even if the system locale was not Hungarian. This is totally absurd.

Post by Rich Felker
There is no standard on which character encodings should be
supported (which is a good thing, since eventually they can all be
dropped.. and even before then, non-CJK systems may wish to omit the
large tables for legacy CJK encodings),

I don't think support for the current 8-bit encoding will die within the
next 50 years, and (as an application developer) if the underlying operating
system (its iconv() calls) doesn't support a particular encoding, I'd
happily blame it on the OS and not think about workarounds. Practically this
means that if I need to process data in a particular encoding, I pass this
encoding to iconv_open() and cry out loud if it fails. You're right, I don't
expect iconv() to support ISO-8859-1, but still, if I need, I try it, use it
if availble, and print an error message otherwise. I won't implement it on
my own, the application is not the right place to do it.

It would be nice if these apps would use some sort of message catalogs
for their menus, and if they would perform the sorting themselves at
runtime.

Yes, that'd be a theoretically better solution, but would require much-much
more work, would be less compatible with other distros, would be much harder
to adopt new window managers...

Post by Rich Felker
You could use setlocale instead of the *_l() stuff so it would be
portable to non-glibc.

If porting ever becomes an issue, I can still re-write it (with autoconf and
compile-time conditionals). Using the *_l() functions made the code cleaner
and probably faster.

Post by Rich Felker
For a normal user application I would say this
is an abuse of locales to begin with and that it should use its own
collation data tables,

Own table? Why? What's the gain in shipping duplicated data? How are we
supposed to create collation tables for all languages? Why do you think it's
wrong if glibc allows access to these data and I use them?

Post by Rich Felker
but what you’re doing seems reasonable for a
system-specific maintainence script. The code looks nice. Clean use of
plain C without huge bloated frameworks.

Thanks :)

Post by Egmont Koblinger
I can't see any problem here. Can you? Browsers work correctly, don't they?
You ask me how I'd implement a feature that _is_ implemented in basically
any browser. I guess your browser handles frames and tabs with different
charset correctly, doesn't it? Even if you run it with an 8-bit locale.

I meant you run into trouble if you were going to change locale for
each page. Obviously it works if you don’t use the locale system.

Well of course I didn't mean changing the _locale_ either, just convert
between _encodings_.

Post by Rich Felker
Links works the other way: converting everything to the selected
character encoding. Crappy versions of links (including the popular
gui one) only support 8bit codepages, but recent ELinks supports
UTF-8.

I know mainstream version of links is crappy. I haven't checked elinks yet,
I will do soon. Does it have a GUI version? In terminal, as I've said, it's
okay if it converts everything to the locale version, since in terminal it's
not possible to display out-of-default-locale's-charset characters. (Except
for the \e%G magic...) If it _is_ possible for an application to display
out-of-default-locale's-charset characters, IMO it _has_ to do so.

Post by Rich Felker
Also applications that want to interact with other applications on the
system expecting to receive text, e.g. an external text editor or
similar.

They might convert back the data to the locale encoding before passing to
the external application. It's no excuse for not displaying them if it's
otherwise technically possible.

--
Egmont

Egmont Koblinger

2007-03-30 16:44:49 UTC

If Y's computer supports the encoding X used [...]

Yes, I assumed in my examples that both computers support both encodings.
Glibc supports all well-known 8-bit character sets since 2.1 (released in
1999), Unicode and its transcripts since 2.2 (2000). Fonts are also
installed on any sane system.

I think clipboards treat the data as bytes,

Try copy-pasting from a latin1 application to an utf8 app or vice versa and
you'll see that luckily it's not the case. You'll get the same letters (i.e.
different byte sequences) in the two apps.

--
Egmont

Rich Felker

2007-03-30 17:16:35 UTC

If Y's computer supports the encoding X used [...]

You mean the iconv in glibc?

I think clipboards treat the data as bytes,

Try copy-pasting from a latin1 application to an utf8 app or vice versa and
you'll see that luckily it's not the case. You'll get the same letters (i.e.
different byte sequences) in the two apps.

But it doesn’t work the other way around. I’ve tried pasting from an
app respecting locale (UTF-8) into rxvt (with its head stuck in the
Latin-1 sand, no not urxvt) and the bytes of the UTF-8 get interpreted
as Latin-1 characters. :)

It should work, but Latin-1-oriented apps are usually dumb enough that
it doesn’t...

Rich

Egmont Koblinger

2007-03-30 17:10:28 UTC

Post by Rich Felker
But it doesn’t work the other way around. I’ve tried pasting from an
app respecting locale (UTF-8) into rxvt (with its head stuck in the
Latin-1 sand, no not urxvt) and the bytes of the UTF-8 get interpreted
as Latin-1 characters. :)
It should work, but Latin-1-oriented apps are usually dumb enough that
it doesn’t...

In this case it's a perfect reason not to use rxvt. (Actually its lack of
UTF-8 support _is_ the real reason :-)))

Of course I didn't mean there are no dumb applications... You should rather
try a good application that just happens to run under a latin1 locale.

--
Egmont

Daniel B.

2007-04-05 02:21:20 UTC

Fredrik Jervfors wrote:
...

Post by Fredrik Jervfors

Post by Egmont Koblinger
X writes a homepage in French, using either latin1 or utf8 encoding (but
mentions this encoding properly), and of course he uses all the french
letters, including e.g. è (e with grave accent).
Y is sitting in Poland for example, using a system configured to use a
latin2 locale by default. Latin2 lacks e with grave accent. Y visits the
homepage of X with some popular graphical web browser.
What should happen?
Rich says that his browser must (or should?) think in latin2 and hence
drop the è letters, maybe replace them with unaccented e or question
marks or similar.
I say that his browser mush show è correctly, it doesn't matter what its
locale is.

That depends on the configuration of the browser.
The browser should by default (programmer's choice really) think in the
encoding X used, since it's tagged with that encoding information.

In what sense do you mean "think in encoding X"? (Are you talking about
internal browser operations like displaying text, or external operations
like saving files?)

For internal operations, shouldn't the browser "think" in terms of
characters? Isn't that how HTML is defined (in terms of characters, not
byte encodings)?

That is, who cares if è (e with grave accent) doesn't exist in the
system's default Latin-2 encoding? At least for internal operations,
the browser doesn't ever have to encode the character into bytes in
the system's default encoding, does it?

The browser received an entity in UTF-8, and the browser understood
the UTF-8 byte sequence and the character it represented. The browser
can get an appropriate glyph displayed without using the system's
locale-specified encoding, right? (It would use the encoding of the
font, not any system default encoding, right)?

Wouldn't it only be for external operations (e.g., a "Save Page
Source As" command, or loading files from the local system) where
the browser would care about the local system's encoding?

Daniel

--
Daniel Barclay
***@smart.net

Daniel B.

2007-04-05 02:28:34 UTC

Fredrik Jervfors wrote:
...

Post by Fredrik Jervfors
Correct me if I'm wrong, but isn't the web server supposed to tell the
client which charset is used: Latin2 or UTF-8?

Yes.

Post by Fredrik Jervfors
This might be done by using the <meta> element in HTML.

If the client (browser) received the HTML entity via HTTP, it is not
supposed to listen to any encoding specified via a <meta> element.

Daniel

--
Daniel Barclay
***@smart.net

Marcin 'Qrczak' Kowalczyk

2007-03-30 12:23:48 UTC

In 1996 a famous message by Tomasz Kłoczko has been posted to some
Polish newsgroups <http://www.man.lodz.pl/LISTY/POLIP/1996/07/0396.html>
(in Polish).

It advocated using ISO-8859-2 instead of stripping accents (which was
common on Unix) and instead of CP-1250 (which was common on Windows)
when writing in Polish in the Internet. It claimed that Linux is getting
ready for supporting other encodings than ASCII and ISO-8859-1, that the
amount of Linux software which needs fixing is small enough that it can
be done in a reasonable time, and it has shown what to expect from web
pages, e-mailers and newsreaders.

Around that time the resources and the knowledge needed to configure a
Linux system to support Polish at all were getting strong enough that
the recommendation “please don’t use Polish letters in the Internet,
as many systems can’t display them properly” was getting obsolete and
its supporters were becoming a minority.

In these years it was hard enough for Linux to support more than Latin1,
and some software had to be specially configured to support more
than ASCII. We’ve had a Latin2 console font and some bitmap X fonts.
Netscape was still emitting only Latin1 Postscript, so Juliusz
Chroboczek has written the “ogonkify” program which munged the
Postscript by substituting composed Latin2 characters instead.

UTF-8 was out of the question these days. It was completely unsupported
by common software on both Linux and Windows. UTF-8 support on Linux
lags about 10 years behind removing the assumption about Latin1. It’s
harder than it might seem.

I switched my Linux system from ISO-8859-2 to UTF-8 in 2007. The PLD
Linux Distribution has translated the *.spec files from a mixture of
encodings (different for each language) to UTF-8 in 2007. Only recent
years have brought UTF-8 support to LaTeX and ncurses.

There is still some software I have installed here which doesn’t work
with UTF-8. I switched from ekg to gaim and from a2ps to paps because
of this. UTF-8 support in some quite popular programs still relies on
unofficial patches: mc, pine, fmt. There is still work to do.

--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Jan Willem Stumpel

2007-03-30 13:58:21 UTC

Post by Marcin 'Qrczak' Kowalczyk
There is still some software I have installed here which
doesn=E2=80=99t work with UTF-8. I switched from ekg to gaim and fr=

Post by Marcin 'Qrczak' Kowalczyk
a2ps to paps because of this. UTF-8 support in some quite
popular programs still relies on unofficial patches: mc, pine,
fmt. There is still work to do.

Yes.. for instance texmacs and maxima. And a2ps -- doomed to be
replaced by paps. But these examples are becoming rarer and rarer.

mc, for instance, is quite alright nowadays (well, in Debian it is).

Of course your point is quite correct. Until even a few years ago,
UTF-8 was only practicable for hardy pioneers. But it is different
now.

Regards, Jan

Rich Felker

2007-03-30 15:17:16 UTC

Post by Jan Willem Stumpel

Post by Marcin 'Qrczak' Kowalczyk
There is still some software I have installed here which
doesn’t work with UTF-8. I switched from ekg to gaim and from
a2ps to paps because of this. UTF-8 support in some quite
popular programs still relies on unofficial patches: mc, pine,
fmt. There is still work to do.

Yes.. for instance texmacs and maxima. And a2ps -- doomed to be
replaced by paps. But these examples are becoming rarer and rarer.
mc, for instance, is quite alright nowadays (well, in Debian it is).
Of course your point is quite correct. Until even a few years ago,
UTF-8 was only practicable for hardy pioneers. But it is different
now.

I agree. It’s amazing how much software I still fight with not
supporting UTF-8 correctly. Even bash/readline is broken in the
presence of nonspacing characters and long lines..

My point was that, had the mistake of introducing ISO-8859 support not
been made (i.e. if bytes 128-255 had remained considered as
“unprintable” at the time), there would have been both much more
incentive to get UTF-8 working quickly, and much less of an obstacle
(the tendancy of applications to treat these bytes as textual
characters).

Obviously there were plenty of people who wanted internationalization
even back in 1996 and earlier. I’m just saying they should have done
it correctly in a way that supports multilingualization rather than
taking the provincial path of ‘codepages’ some 5 years after
UCS/Unicode had obsoleted them.

Rich

David Starner

2007-03-31 12:55:25 UTC

Post by Rich Felker
My point was that, had the mistake of introducing ISO-8859 support not
been made (i.e. if bytes 128-255 had remained considered as
"unprintable" at the time), there would have been both much more
incentive to get UTF-8 working quickly, and much less of an obstacle
(the tendancy of applications to treat these bytes as textual
characters).

So people who needed to use computers in their native tongue shouldn't
have been able to do so unless they were willing to undertake a huge
project to get a huge multibyte character set working? Wow, that's
multicultural.

ＳｒｉｎＴｕａｒ

2007-03-28 21:57:35 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Does it match or not. What range of bytes in the string was matched.

Seems you didn't understand. It depends on how to interpret the byte
sequence above. If it stands in UTF-8 then it means "...) in KOI8-R and so

I wasnt talking about encoding detection really though...
The regex library can ask the locale what encoding things are in, just
like everybody else

Even then, the user and app programmer should not have to care what
encoding is being used.

Post by Egmont Koblinger
In an ideal world where no more than one character set (and one
representation) is used, a developer could expect the same from any
programming language or development environment. But our world is not ideal.
There _are_ many character sets out there, and it's _your_ job, the
programmer's job to tell the compiler/interpreter how to handle your bytes
and to hide all these charset issues from the users. Therefore you have to
be aware of the technical issues and have to be able to handle them.

If that was true then the vast majority of programs would not be i18n'd.
Luckily, there is a way to support utf-8 without having to really
worry about it:
Just think in bytes! I wish perl would let me do that- it works so well in C.

Post by Egmont Koblinger
There are many ways to solve charset problems, and which one to choose
depends on the goals of your software too. If you only handle _texts_ then
probably the best approach is to convert every string as soon as they arrive
at your application to some Unicode representation (UTF-8 for Perl, "String"
(which uses UTF-16) for Java and so on)

Hrm, I think Java needs to be fixed. Their internal utf-16 mandate was
a mistake, imo.
They should store strings in whatever the locale says they are in.
(and the locale should always say utf-8)

Normally, you should not have to ever convert strings between
encodings. Its just
not your problem, plus it indroces a ton of potential headaches.
Just assume your input is in the encoding its supposed to be in.

Daniel B.

2007-03-29 03:05:56 UTC

...f you only handle _texts_ then
probably the best approach is to convert every string as soon as they arrive
at your application to some Unicode representation (UTF-8 for Perl, "String"
(which uses UTF-16) for Java and so on)

Hrm, I think Java needs to be fixed. Their internal utf-16 mandate was
a mistake, imo.

Are you aware that Java was created (or frozen) when Unicode required
16 bits? (It wasn't a mistake at the time.)

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Normally, you should not have to ever convert strings between
encodings.

Then how do you process, say, a multi-part MIME body that has parts
in different character encodings?

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Its just
not your problem, plus it indroces a ton of potential headaches.
Just assume your input is in the encoding its supposed to be in.

You never deal with multiple inputs?

Daniel

--
Daniel Barclay
***@smart.net

ＳｒｉｎＴｕａｒ

2007-03-29 03:22:27 UTC

Post by Daniel B.
Are you aware that Java was created (or frozen) when Unicode required
16 bits? (It wasn't a mistake at the time.)

Yes, I remember the age when unicode ala ucs-2 was the future, and
there was a big push to move to it. It did strike me as somewhat
misguided though, and it seemed that there would forever be a huge
incompatibility issues in software. Using 16bit words in streams
seemed perilous, and highly internet unfriendly. Plus, the vast
majority of software was ascii-only, and would never join the unicode
world.

When I first saw the utf-8 encoding description, it was an epiphany of sorts.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Normally, you should not have to ever convert strings between
encodings.

Then how do you process, say, a multi-part MIME body that has parts
in different character encodings?

Excellent example. Email is absolutely something that you can work
with on a byte-by-byte basis and have no need for considering
characters. You can drop big blocks of bytes out to conversion
routines, and you dont ever have to know what the unicode codepoints
are.

Not every tool has to worry about encodings, and if they do we're only
going to end up with tons of non i18n programs being written. You
should reasonably be able to write an email program in perl that drops
out to iconv and openssl etc as needed to convert things to utf-8 and
otherwise doesnt care about encoding at all, and makes no special
considerations for it.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Its just
not your problem, plus it indroces a ton of potential headaches.
Just assume your input is in the encoding its supposed to be in.

You never deal with multiple inputs?

All the time :)

Daniel B.

2007-03-31 22:56:05 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
...
When I first saw the utf-8 encoding description, it was an epiphany of sorts.

Yes, it does have some nice characteristics.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Normally, you should not have to ever convert strings between
encodings.

Then how do you process, say, a multi-part MIME body that has parts
in different character encodings?

Excellent example. Email is absolutely something that you can work
with on a byte-by-byte basis and have no need for considering
characters.

What operations are you excluding when you say "work with?" You're
being quite non-specific. Maybe that's part of the cause of our
arguing.

Certainly searching for a given character string across multiple
MIME parts requires handling different encodings for different parts.

And searching with a regular expression containing "." to match any
character requires some encoding-cognizant processing somewhere in
the processing path (either in decoding byte sequences to character
sequences, or in implementing the regular-expression matching if it
operates directly on, say, UTF-8).

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-31 23:10:41 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Normally, you should not have to ever convert strings between
encodings.

Then how do you process, say, a multi-part MIME body that has parts
in different character encodings?

Excellent example. Email is absolutely something that you can work
with on a byte-by-byte basis and have no need for considering
characters.

What operations are you excluding when you say "work with?" You're
being quite non-specific. Maybe that's part of the cause of our
arguing.

Indeed, that would be good to clarify.

Post by Daniel B.
Certainly searching for a given character string across multiple
MIME parts requires handling different encodings for different parts.

Not if it was all converted at load-time.

Rich

Daniel B.

2007-04-05 02:05:45 UTC

Post by Daniel B.
Certainly searching for a given character string across multiple
MIME parts requires handling different encodings for different parts.

Not if it was all converted at load-time.

Huh? (Converting at load time doesn't avoid the need to handle
different encodings for different parts.)

I think I see part of our communication problem.

(It seems to me that) you've read more into what I wrote than I actually
wrote, or have thought I'm arguing different points than I have been.

Above, assuming I recall correctly, I was responding to some earlier
claim about just setting a single platform-level encoding and processing
everything according to that encoding (presenting the MIME multipart case
as a counterexample--that you have to handle multiple encodings
(regardless of whether you convert at load time or at search time)).

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-29 03:55:37 UTC

Hrm, I think Java needs to be fixed. Their internal utf-16 mandate was
a mistake, imo.

Are you aware that Java was created (or frozen) when Unicode required
16 bits? (It wasn't a mistake at the time.)

Java was introduced in May 1995. UTF-8 existed since September 1992.
There was never any excuse for UCS-2/UTF-16 existing at all.

Read Thompson & Pike’s UTF-8 paper for details.

〜Rich

Egmont Koblinger

2007-03-29 10:24:43 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
The regex library can ask the locale what encoding things are in, just
like everybody else

The locale tells you which encoding your system uses _by default_. This is
not necessarily the same as the data you're currently working with.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Even then, the user and app programmer should not have to care what
encoding is being used.

For the user: you're perfectly right.

For the programmer: how would you write a browser or a mail client if you
completely ignored the charset information in the MIME header? How would you
write a console mp3 id3v2 editor if you completely ignored the console's
charset or the charset used within the id3v2 tags? How would you write a
database frontend if you completely ignored the local charset as well as the
charset used in the database? (Someone inserts some data, someone else
queries it and receives different letters...)

If you're the programmer, you can only ignore the locale in the most simpler
situations, e.g. when appending two strings that you know are encoded in the
same charset; determining the extension part of a filename etc... For more
complicated operations you must know how your data is encoded, no matter
what programming language you use.

Post by Egmont Koblinger
There _are_ many character sets out there, and it's _your_ job, the
programmer's job to tell the compiler/interpreter how to handle your bytes
and to hide all these charset issues from the users. Therefore you have to
be aware of the technical issues and have to be able to handle them.

If that was true then the vast majority of programs would not be i18n'd.

That's false. Check for example the bind_textdomain_codeset call. In Gtk+-2
apps you call it with an UTF-8 argument. This happens because you _know_
that you'll need this data encoded in UTF-8. In most other applications you
omit this call because there you _know_ you need the data in the encoding of
the current locale. In both cases it's important that you _know_ what
encoding is used. (By "knowing the charset" I don't necessarily mean one
particular fixed charset known in advance, a dynamic one such as "the
charset set by our locale" or "the charset named in that variable" are
perfect choises too.)

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Luckily, there is a way to support utf-8 without having to really
Just think in bytes!

Seems we have a different concept of "thinking". For example, when you write
a Gtk+2 application, you of course _work_ with bytes, but in a higher level
of abstraction you know that an utf-8 encoding is used there and hence you
_think_ in characters.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I wish perl would let me do that- it works so well in C.

I already wrote twice. Just in case you haven't seen it, I write it for the
third time. Perl _lets_ you think/work in bytes. Just ignore everything
related to UTF-8. Just never set the utf8 mode. You'll be back at the world
of bytes. It's so simple!

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Hrm, I think Java needs to be fixed.

Sure. Just alter the specifications. 99% of the existing programs will work
incorrectly and would need to be fixed according to the new "fixed" language
definition. It's so simple, isn't it? :-)

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Their internal utf-16 mandate was a mistake, imo.

That was not utf-16 but ucs-2 at that time and imo those days it was a
perfectly reasonable decision.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
They should store strings in whatever the locale says they are in.

Oh yes... Sure everyone would be perfectly happy if his software wasn't able
to handle characters that don't fit in his locale. Just because someone
still uses an iso-8859-1 charset he sure wants his browser to display
question marks instead of foreign accented letters and kanjis, right?

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
(and the locale should always say utf-8)

Should, but doesn't. It's your choice to decide whether you want your
application to work everywhere, or only under utf-8 locales.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Normally, you should not have to ever convert strings between
encodings. Its just
not your problem, plus it indroces a ton of potential headaches.
Just assume your input is in the encoding its supposed to be in.

Ha-ha-ha. Do you know what makes my head ache? When I see accented
characters of my mother tounge displayed incorrectly. The only way they can
be displayed correctly is if you _know_ the encoding used in each file, each
data stream, each strings etc. If you don't know their encoding, it's
hopeless to display them correctly.

I admit that in an ideal world everything would be encoded in UTF-8. Just
don't forget: our world is not ideal. My browser has to display web pages
encoded in Windows-1250 correctly. My e-mail client has to display messages
encoded in iso-8859-2 correctly. And so on...

--
Egmont

ＳｒｉｎＴｕａｒ

2007-03-29 16:23:23 UTC

Post by Egmont Koblinger
The locale tells you which encoding your system uses _by default_. This is
not necessarily the same as the data you're currently working with.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Even then, the user and app programmer should not have to care what
encoding is being used.

For the user: you're perfectly right.
For the programmer: how would you write a browser or a mail client if you
completely ignored the charset information in the MIME header?

Egmont, I think we are talking at cross-purposes now.

I never suggested writing non-standards compliant code, or ignoring
MIME headers.

I just think that routines such as "regex" or "NFD" should be able to
assume that the strings they are passed match the encoding of the
current locale, or failing that ask the programmer to explicitly
qualify them as one of its supported encodings. I do not think the
strings should have built in machinery that does this work behind the
scenes implicitly.

So I agree with the GTK design, while I take objection the the "utf-8"
attribute on perl scalars.

If you thought I was suggesting something else, I was not,
Cheers

Egmont Koblinger

2007-03-29 17:15:37 UTC

On Thu, Mar 29, 2007 at 12:23:23PM -0400, ＳｒｉｎＴｕａｒ wrote:

Hi,

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I just think that routines such as "regex" or "NFD" should be able to
assume that the strings they are passed match the encoding of the
current locale

This might be a reasonable decision when you design a new programming
language. When "hacking" Unicode support into an existing 8-bit programming
language, this approach would have broken backwards compatibility and cause
_lots_ of old Perl code to malfunction when running under UTF-8 locale. Just
as if the C folks altered the language (actually its core libraries to be
precise) so that strlen() counted characters. No, they kept strlen()
counting bytes, you need different functions if you want to count
characters.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
or failing that ask the programmer to explicitly qualify them as one of
its supported encodings. I do not think the strings should have built in
machinery that does this work behind the scenes implicitly.

If you have the freedom of choosing the character set you use, you need to
tell the regexp matching function what charset you use. (It's a reasonable
decision that the default is the charset of the current locale, but it has
to be overridable.) There are basically two ways I think to reach this goal.

1st: strings are just byte sequences, and you may pass the charset
information as external data.

2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in
Java) or carry meta-information about their encoding (utf8 flag in Perl).

If I understand you, you'd prefer the 1st solution. According to my
experiences, _usually_ the 2nd is the cleaner way which is likely to lead to
better pieces of software and less bugs. An exception is when you want to
display strings that might contain non-valid byte sequences and in the mean
time you must keep those byte sequences. This may be the case for text
editors, file managers etc. I think this is only a small minority of
software.

Using the 1st approach I still can't see how you'd imagine Perl to work.
Let's go back to my earlier example. Suppose perl read's a file's content
into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then
you do this:

print "Hooray\n" if $filecontents =~ m/A.B/;

Should it print Hooray or not if you run this program under an UTF-8 locale?

On one hand, when running with a Latin1 locale it didn't print it. So it
mustn't print Hooray otherwise you brake backwards compatibility.

On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.

How would you design Perl's Unicode support to overcome this contradiction?

--
Egmont

ＳｒｉｎＴｕａｒ

2007-03-29 18:16:29 UTC

Post by Egmont Koblinger
This might be a reasonable decision when you design a new programming
language. When "hacking" Unicode support into an existing 8-bit programming
language, this approach would have broken backwards compatibility and cause
_lots_ of old Perl code to malfunction when running under UTF-8 locale. Just
as if the C folks altered the language (actually its core libraries to be
precise) so that strlen() counted characters. No, they kept strlen()
counting bytes, you need different functions if you want to count
characters.

Umm... bad example:
strlen is supposed to count bytes. Nobody cares about the number of
unicode codepoints, because that is **almost never** useful
information. Its about as informative as the parity of the string.

Post by Egmont Koblinger
If I understand you, you'd prefer the 1st solution. According to my
experiences, _usually_ the 2nd is the cleaner way which is likely to lead to
better pieces of software and less bugs. An exception is when you want to
display strings that might contain non-valid byte sequences and in the mean
time you must keep those byte sequences. This may be the case for text
editors, file managers etc. I think this is only a small minority of
software.

I would argue that it is the correct solution for all software, even
software that
might be trivially simplified by having some things built into the language.

It maintains the correct balance:
Think about it when it matters, don't think about it when it doesnt.

Post by Egmont Koblinger
Using the 1st approach I still can't see how you'd imagine Perl to work.
Let's go back to my earlier example. Suppose perl read's a file's content
into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then
print "Hooray\n" if $filecontents =~ m/A.B/;
Should it print Hooray or not if you run this program under an UTF-8 locale?
On one hand, when running with a Latin1 locale it didn't print it. So it
mustn't print Hooray otherwise you brake backwards compatibility.
On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.
How would you design Perl's Unicode support to overcome this contradiction?

Under a latin-1 locale it should not print.
Under a utf-8 locale it should print.

If a person inputs invalid latin-1, while telling everyone to expect
latin-1, this is a perfectly acceptable case of garbage in resulting
in garbage out.

There is no contradiction, nor is there any backwards compatibility issue.
If someone opened such a unqualified utf-8 file in a text editor while
in a latin-1 environment, it should show up as the binary trash that
it is in that context. I don't see how this can

Rich Felker

2007-03-29 20:46:14 UTC

If you have the freedom of choosing the character set you use, you need to

You don’t. An application should assume that there is no such freedom;
the character encoding is dictated by the user or the host
implementation, and should on all modern systems be UTF-8 (but don’t
assume this).

Any text that’s encoded with another scheme needs to be treated as
non-text (binary) data (i.e. not suitable for use with regex). It
could be converted (to the dictated encoding) or left as binary data
depending on the application.

Post by Egmont Koblinger
tell the regexp matching function what charset you use. (It's a reasonable
decision that the default is the charset of the current locale, but it has
to be overridable.) There are basically two ways I think to reach this goal.

You can get by just fine without it being overridable. For instance,
mutt does just fine using the POSIX regex routines which do not have
any way of specifying a character encoding.

Post by Egmont Koblinger
1st: strings are just byte sequences, and you may pass the charset
information as external data.
2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in
Java) or carry meta-information about their encoding (utf8 flag in Perl).

Of these (neither of which is necessary), #1 is the more unix-like and
#2 is the mac/windows approach. Unix has a strong history of
intentionally NOT assigning types to data files etc., but instead
treating everything as streams of bytes. This leads to very powerful
combinations of tools where the same (byte sequence) of data is
interpreted in different ways by different tools/contexts. I am a
mathematician and I must say it’s comparable to what we do when we
allow ourselves to think of an operator on a linear space both as a
map between linear spaces and as an element of a larger linear space
of operators (and possibly also in many other ways) at the same time.

On the other hand, DOS/Windows/Mac have a strong history of assigning
fixed types to data files. On DOS/Windows it’s mostly just extensions,
but Mac goes much farther with the ‘resource fork’, not only typing
the file but also associating it with a creating application. This
sort of mechanism is, in my opinion, deceptively convenient to
ignorant new users, but also fosters an unsophisticated, uneducated,
less powerful way of thinking about data.

Of course in either case there are ways to override things and get
around the limitations. Even on unix files tend to have suffixes to
identify the ‘type’ a user will most often want to consider the file
as, and likewise on Mac you can edit the resource forks or ignore
them. Still, I think the approach you take says a lot about your
system philosophy.

Of course.

Post by Egmont Koblinger
On one hand, when running with a Latin1 locale it didn't print it. So it
mustn't print Hooray otherwise you brake backwards compatibility.

No, the program still does the same thing if run in a Latin-1 locale,
regardless of your perl version. There’s no reason to believe that
text processing code should behave byte-identically under different
locales.

Post by Egmont Koblinger
On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.

So you need to make your data and your locale consistent. If you want
to set the locale to UTF-8, the string “AÁB” needs to be in UTF-8. If
you want to use the legacy Latin-1 data, your locale needs to be set
to something Latin-1-based.

Post by Egmont Koblinger
How would you design Perl's Unicode support to overcome this contradiction?

I don’t see it as any contradiction. The code does exactly what it’s
supposed to in either case, as long as your locale and data are
consistent.

Rich

Egmont Koblinger

2007-03-30 09:56:56 UTC

I am a mathematician

I nearly became a mathematican, too. Just a few weeks before I had to choose
university I changed my mind and went to study informatics.

When I was younger, I had a philosophy closer to yours. Programming in
assembly, thinking of files as byte streams and similar low level
abstractions. But then having spent some years as system administrator and
then as a linux distribution developer I faced a different, more
user-centric philosophy and probably I learnt to think with the users' mind
(more or less). Users don't care about implementation details, and actually
they shouldn't need to care. They just care whether things work. They're not
expecting for the theoretically correct solution if it takes ages to
implement it, they're expecting a reasonably well solution within a short
time. If as a developer you can take either a nice or a gunge way and both
lead to a solution then of course you take the nice way. However if the
disgusting way leads to solution and the nice way doesn't then you have to
go ahead along the disgusting way. This is probably what a mathematican
wouldn't do.

There's absolutely no way to explain any user that his browser isn't able to
display some letters unless he quits it and sets a different locale, but
then yet other symbols won't show up and external applications started by
the browser won't behave as expected. This is not a problem if the system
uses UTF-8, just as any modern distribution does (so as the latest release
of our distro :-)). But expectations of the users weren't different even
those days when software were not yet ready for UTF-8. And on the other hand
there would have been no technical reasons for restricting the displayable
characters either.

--
Egmont

Rich Felker

2007-03-30 15:04:09 UTC

I am a mathematician

I’m not sure if this is a cheap ad hominem ;) or just an honest
storytelling..

Post by Egmont Koblinger
(more or less). Users don't care about implementation details, and actually
they shouldn't need to care. They just care whether things work. They're not

Users should be presented with something that’s possible for ordinary
people to understand and which has reasonable explanations. Otherwise
the computer is a disempowering black box that requires them to look
to “experts” whenever something doesn’t make sense.

Here’s an interesting article that’s somehow related (though I don’t
necessarily claim it supports either of our view and don’t care to
argue over whether it does):

http://osnews.com/story.php?news_id=6282

Post by Egmont Koblinger
There's absolutely no way to explain any user that his browser isn't able to
display some letters unless he quits it and sets a different locale, but

1. Sure there is. Simply telling the user he/she is working in an
environment that doesn’t support the character is clear and does make
sense. I’ve explained this sort of thing countless times doing user
help on IRC.

It’s much more difficult to explain to the user why they can see these
characters in their web browser but can’t paste them into a text file,
because it’s INCONSISTENT and DOESN’T MAKE SENSE. The only option
you’re left with is the Microsoft one: telling users that clean
applications which respect the standards are somehow “backwards”,
while hiding from them the fact that the standards provide a much
saner path to internationalization than hard-coding all sorts of
unicode stuff into each application.

2. You don’t have to explain anything. This is 2007 and the user’s
locale uses UTF-8. Period. Unless this is some oldschooler who already
knows the reasons and insists on using a legacy encoding anyway.

Rich

Egmont Koblinger

2007-03-30 17:23:31 UTC

Post by Rich Felker
I’m not sure if this is a cheap ad hominem ;) or just an honest
storytelling..

The latter, of course. Sorry if my intent wasn't clear.

Post by Rich Felker
1. Sure there is. Simply telling the user he/she is working in an
environment that doesn’t support the character is clear and does make
sense. I’ve explained this sort of thing countless times doing user
help on IRC.
It’s much more difficult to explain to the user why they can see these
characters in their web browser but can’t paste them into a text file,
because it’s INCONSISTENT and DOESN’T MAKE SENSE.

Interesting point. Yes, we have to choose: either have a consistent system
with absolutely no support for out-of-locale characters, or an inconsistent
system with partial support for out-of-locale chars. Still, I'd choose the
latter one. Why?

First, because probably the supported subset is just sufficient for the
users, e.g. they want to see the web page, but not copy-paste it. Then why
not let them do it?

Second, because I still can easily explain that "those characters are not
_fully_ supported, only partially". I don't think it's harder to explain
this than to explain if they were not supported at all.

Third, users can easily discover that the locale is not a system-wide but
rather a process-wide setting. At this point your reasoning most likely
becomes invalid in their eyes. And actually, don't forget, web browsers
don't work the way you wanted them to work, they work the way that I think
is correct. It's not a matter of taste, it's a fact: all well-known
graphical browsers display all glyphs, even if they run under a latin1
locale. You can't tell your users that it doesn't work, since they see at
least one place where it works.

Post by Rich Felker
2. You don’t have to explain anything. This is 2007 and the user’s
locale uses UTF-8. Period.

Yes. Luckily it is the case nowadays. (At least for those who use good
distributions.)

--
Egmont

Rich Felker

2007-03-29 17:27:45 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
The regex library can ask the locale what encoding things are in, just
like everybody else

The locale tells you which encoding your system uses _by default_. This is
not necessarily the same as the data you're currently working with.

The word “default” does not appear in any standard regarding LC_CTYPE.
It determines THE encoding of text. Foreign character data from other
systems obviously cannot be treated directly as text under this view.

Post by Egmont Koblinger
write a console mp3 id3v2 editor if you completely ignored the console's
charset

The console charset uses text and text is encoded according to
LC_CTYPE. The tags are encoded according to the encoding specified by
the file and may be converted via iconv or similar library calls.

Post by Egmont Koblinger
or the charset used within the id3v2 tags? How would you write a
database frontend if you completely ignored the local charset as well as the
charset used in the database? (Someone inserts some data, someone else
queries it and receives different letters...)

The same problem exists on the filesystem. The solution locally is to
mandate a policy of a single encoding for all users sharing data. For
remote protocols, the protocol usually specifies an encoding by which
the data is delivered, so again you convert according to iconv or
similar.

Nowhere have SrinTuar nor myself said that encoding is always
something you can ignore. My point is that consideration of it can be
fully isolated to the point at which badly-encoded data is received
(from text embedded in a binary file, from http, from mime mail, etc.)
such that the other 99% of your software never has to think about it.

If that was true then the vast majority of programs would not be i18n'd..

That's false. Check for example the bind_textdomain_codeset call. In Gtk+-2
apps you call it with an UTF-8 argument. This happens because you _know_
that you'll need this data encoded in UTF-8.

Then what do you do when you want to print text to stdout, or generate
filenames, etc.? You can’t use your localized text anymore because the
encoding may not match. This is evidence that gtk’s approach is
flawed.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I wish perl would let me do that- it works so well in C.

I don’t know about SrinTuar but this is not what I meant at all. I
want (NEED!) regex to work correctly, etc. Thus Perl needs to respect
the character encoding, which thankfully matches the host encoding,
UTF-8. No problem so far. However, as soon as I try to send these Perl
character strings (which are equally valid as host character strings)
to stdout, it spews warnings, and does so in an inconsistent way!
(i.e. it complains about characters above 255 but not characters
128-255)

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Their internal utf-16 mandate was a mistake, imo.

That was not utf-16 but ucs-2 at that time and imo those days it was a
perfectly reasonable decision.

It was not. UCS-2 was already obsolete at the time Java was released
to the public in 1995. UTF-8 was invented in September 1992.

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
(and the locale should always say utf-8)

Should, but doesn't. It's your choice to decide whether you want your
application to work everywhere, or only under utf-8 locales.

Having limited functionality (displaying ??? for all characters not
available in the locale) under broken legacy locales is perfectly
acceptable behavior. If someone wants to use/display/write a
character, they need to use a character encoding where that character
is encoded!!!

Post by Egmont Koblinger
I admit that in an ideal world everything would be encoded in UTF-8. Just
don't forget: our world is not ideal. My browser has to display web pages
encoded in Windows-1250 correctly. My e-mail client has to display messages
encoded in iso-8859-2 correctly. And so on...

As you can read above, none of this is contrary to what I said. My
system does all of this quite well.

Rich

Daniel B.

2007-03-28 02:07:11 UTC

Post by Egmont Koblinger
That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

What about when it breaks a string into substrings at some delimiter,
say, using a regular expression? It has to break the underlying byte
string at a character boundary.

In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?

Any time a program uses the underlying byte string as a character
string other than simply a whole string (e.g., breaking it apart,
interpreting it), it needs to consider it at the character level,
not the byte level.

It depends how trivial the operations are.

(Offhand, the only things I think would be safe are copying and
appending.)

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-28 03:40:44 UTC

Post by Egmont Koblinger
That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

What about when it breaks a string into substrings at some delimiter,
say, using a regular expression? It has to break the underlying byte
string at a character boundary.

Searching for the delimeter already gives you a character boundary.
There is no need to think further about it.

For example, the unix "cut" program works automatically with UTF-8
text as long as the delimiter is a single byte, and if you want
multibyte delimiters, all you need to do is make it accept a multibyte
delimeter character and then do a substring search instead of a byte
search. There is no need to ever treat the input string as characters,
and in fact doing so just makes it slow and bloated.

Post by Daniel B.
In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?
Any time a program uses the underlying byte string as a character
string other than simply a whole string (e.g., breaking it apart,
interpreting it), it needs to consider it at the character level,
not the byte level.

You're mistaken. Most times, you can avoid thinking about characters
totally. Not always, but much more often than you think.

It depends how trivial the operations are.
(Offhand, the only things I think would be safe are copying and
appending.)

This is because you don't understand UTF-8..

Rich

Daniel B.

2007-03-29 02:46:01 UTC

...

Post by Daniel B.
What about when it breaks a string into substrings at some delimiter,
say, using a regular expression? It has to break the underlying byte
string at a character boundary.

Searching for the delimeter already gives you a character boundary.
There is no need to think further about it.

As long as you specified the delimiter properly (a whole character,
not a partial byte sequence).

Post by Rich Felker
For example, the unix "cut" program works automatically with UTF-8
text as long as the delimiter is a single byte,

By "single byte," do you mean a character whose UTF-8 representation
is a single byte? (If you gave it the byte 0xBF, would it reject it
as an invalid UTF-8 sequence, or would it then possibly cut in the middle
of the byte sequence for a character (e.g., 0xEF 0xBF 0x00)?)

It depends how trivial the operations are.
(Offhand, the only things I think would be safe are copying and
appending.)

This is because you don't understand UTF-8..

Bull. Try providing some real information (a couple of counterexamples).

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-29 03:57:43 UTC

Post by Rich Felker
For example, the unix "cut" program works automatically with UTF-8
text as long as the delimiter is a single byte,

Apologies for omitting the word “character” after single byte. Yes, I
meant ASCII.

Rich

Daniel B.

2007-04-05 02:34:11 UTC

Post by Rich Felker
For example, the unix "cut" program works automatically with UTF-8
text as long as the delimiter is a single byte, and if you want
multibyte delimiters, all you need to do is make it accept a multibyte
delimeter character and then do a substring search instead of a byte
search. There is no need to ever treat the input string as characters,
and in fact doing so just makes it slow and bloated.

cut -c2-3 ...

Daniel

--
Daniel Barclay
***@smart.net

ＳｒｉｎＴｕａｒ

2007-03-28 03:53:15 UTC

Post by Daniel B.
What about when it breaks a string into substrings at some delimiter,
say, using a regular expression? It has to break the underlying byte
string at a character boundary.

Ｕｎｌｅｓｓ　ｙｏｕ　ｐａｓｓ　ｉｎｖａｌｉｄ　ｕｔｆ－８　ｓｅｑｕｅｎｃｅｓ　ｔｏ　ｙｏｕｒ　ｒｅｇｕｌａｒ　ｅｘｐｒｅｓｓｉｏｎ　ｌｉｂｒａｒｙ，　ｔｈａｔ
ｓｈｏｕｌｄ　ｂｅ　ｉｍｐｏｓｓｉｂｌｅ．　ｂｒｅａｋｉｎｇ　ｓｔｒｉｎｇｓ　ｗｏｒｋｓ　ｇｒｅａｔ　ａｓ　ｌｏｎｇ　ａｓ　ｙｏｕ
ｐａｔｔｅｒｎ　ｍａｔｃｈ　ｆｏｒ　ｂｏｕｎｄａｒｉｅｓ．

Ｔｈｅ　ｏｎｌｙ　ｔｉｍｅ　ｉｔ　ｆａｉｌｓ　ｉｓ　ｉｆ　ｙｏｕ　ｂｒｅａｋ　ｉｔ　ａｔ　ａｒｂｉｔｒａｒｙ　ｂｙｔｅ
ｉｎｄｅｘｅｓ．ｎｏｔｅ　ｔｈａｔ　ｂｒｅａｋｉｎｇ　ｕｔｆ－３２　ｓｔｒｉｎｇｓ　ａｔ　ａｒｂｉｒｔｒａｒｙ　ｉｎｄｉｃｉｅｓ　ａｌｓｏ
ｄｅｓｔｒｏｙｓ　ｔｈｅ　ｔｅｘｔ．

Post by Daniel B.
In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?

Ｔｈｅ　ｒｅｇｕｌａｒ　ｅｘｐｒｅｓｓｉｏｎ　ｅｎｇｉｎｅ　ｓｈｏｕｌｄ　ｂｅ　ｕｔｆ－８　ａｗａｒｅ．　Ｔｈｅ　ｃｏｄｅ　ｔｈａｔ
ｕｓｅｓ　ａｎｄ　ｃａｌｌｓ　ｉｔ　ｈａｓ　ｎｏ　ｎｅｅｄ　ｔｏ．

Post by Daniel B.
Any time a program uses the underlying byte string as a character
string other than simply a whole string (e.g., breaking it apart,
interpreting it), it needs to consider it at the character level,
not the byte level.

Ｏｎｌｙ　ｔｈｅ　ｍｏｓｔ　ｆａｎｃｙ　ｉｎｔｅｐｒｅｔａｔｉｏｎｓ　ｒｅｑｕｉｒｅ　ａｎｙ　ｋｎｏｗｌｅｄｇｅ　ｏｆ　ｕｎｉｃｏｄｅ
ｃｏｄｅ　ｐｏｉｎｔｓ．Ａｎｙ　ｓｕｂｓｔｒｉｎｇ　ｍａｔｃｈ　ｏｎ　ｖａｌｉｄ　ｓｅｑｕｅｎｃｅｓ　ｗｉｌｌ　ｐｒｏｄｕｃｅ　ｖａｌｉｄ
ｂｏｕｎｄａｒｉｅｓ　ｉｎ　ｕｔｆ－８，ａｎｄ　ｔｈａｔｓ　ｔｈｅ　ｗｈｏｌｅ　ｐｏｉｎｔ．

Rich Felker

2007-03-28 05:05:53 UTC

Post by Daniel B.
What about when it breaks a string into substrings at some delimiter,
say, using a regular expression? It has to break the underlying byte
string at a character boundary.

Ｕｎｌｅｓｓ　ｙｏｕ　ｐａｓｓ　ｉｎｖａｌｉｄ　ｕｔｆ－８　
ｓｅｑｕｅｎｃｅｓ　ｔｏ　ｙｏｕｒ　ｒｅｇｕｌａｒ　

Haha, was it your intent to use this huge japanese wide ascii? :)
Sadly I don't think Daniel can read anything but Latin-1...
Here's an ascii transliteration...
~Rich

Post by Daniel B.
What about when it breaks a string into substrings at some delimiter,
say, using a regular expression? It has to break the underlying byte
string at a character boundary.

Unless you pass invalid utf-8 sequences to your regular expression
library, that should be impossible. breaking strings works great as
long as you pattern match for boundaries.
The only time it fails is if you break it at arbitrary byte
indexes.note that breaking utf-32 strings at arbirtrary indicies also
destroys the text.

Post by Daniel B.
In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?

The regular expression engine should be utf-8 aware. The code that
uses and calls it has no need to.

Only the most fancy intepretations require any knowledge of unicode
code points.Any substring match on valid sequences will produce valid
boundaries in utf-8,and thats the whole point.

Daniel B.

2007-03-29 02:56:11 UTC

Haha, was it your intent to use this huge japanese wide ascii? :)
Sadly I don't think Daniel can read anything but Latin-1...

Well, actually _I_ can (slowly), but right, my mailer can't.

Post by Rich Felker
Here's an ascii transliteration...
~Rich

...

The regular expression engine should be utf-8 aware.

Of course! That's why I've been arguing that byte-based regular
expression processing won't work with UTF-8. (E.g., if the match-
any-character symbol "." matches any byte, it will break multi-byte
sequences at places that aren't character boundaries.

Marcin 'Qrczak' Kowalczyk

2007-03-28 22:33:24 UTC

Dnia 26-03-2007, pon o godzinie 17:28 -0400, ＳｒｉｎＴｕａｒ

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I frequenty run into problems with utf-8 in perl,

Last time I checked, Unicode support in Perl was not well thought out.

The major source of problems is that chr($n) is ambiguous between
character $n in the current locale (i.e. byte $n in the default locale
encoding) and character $n in Unicode. It depends on pragmas in effect
and implicit assumptions of particular packages.

Python OTOH clearly distinguishes one from the other, which is better.

--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Rich Felker

2007-03-29 04:01:49 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I frequenty run into problems with utf-8 in perl, and I was wondering
if anyone else
had encountered similar things.

[...]

Can we get back on-topic with this, and look for solutions to the
problems? Maybe Larry has some thoughts for us?

~Rich

Larry Wall

2007-03-29 07:50:20 UTC

On Thu, Mar 29, 2007 at 12:01:49AM -0400, Rich Felker wrote:
: On Mon, Mar 26, 2007 at 05:28:43PM -0400, $B#S#r#i#n#T#u#a#r(B wrote:
: > I frequenty run into problems with utf-8 in perl, and I was wondering
: > if anyone else
: > had encountered similar things.
: [...]
:
: Can we get back on-topic with this, and look for solutions to the
: problems? Maybe Larry has some thoughts for us?

Heh, well, the short answer is "Perl 6"... :)

My thoughts are that all languages screw this up in various ways
(including Perl 5), and that my fond hope is that Perl 6 will
allow people to program at the appropriate abstraction level for
each given task. Perl 6 is really designed to be many languages,
not one language, and this is considered to be wonderfulness as long
as the exact pedigree of your language is properly specified from a
universal root. Sort of a URL for languages. Every lexical scope
is written in some language or another. So your desired Unicode
abstraction level is declared lexically, but that really just sets
up the defaults for that dialect. I think no language can succeed
at modern string processing without a type system that knows what it
knows, and more importantly doesn't make assumptions about what it
doesn't know without the approval of the programmer.

Although you can choose to write in some particular dialect of Perl 6,
and dialectical differences are most naturally lexically scoped, the
typology must be dynamic to allow meaningful interchange of data.
Perl 6 has four main Unicode levels it deals with. You can use a
dialect that specializes in bytes, codepoints, language-independent
graphemes, or language-dependent graphemes for a known language.
But a mere dialect cannot and should not overrule the actual typology,
nor should any dialect impose its view on a different dialect.
This is why dialects are lexically scoped.

The actual strings must know whether they are text or binary, and
each string must know which levels of API it is willing to work with.
A binary buffer has some aspects of stringiness but is treated more
like C strings insofar as a buffer type is really just an array of
identical elements. Perl 6 lets you access a byte buffer either as an array
or as a rudimentary string as long as you don't ask it to assume any
semantics it doesn't know typologically. Basically, ASCII is about all
you can assume otherwise.

A text string may have a minimum and maximum abstraction level.
Your string type may well be aware that you have a bunch of Swahili
encoded in UTF-8, so it could choose to let you deal with that string
on the byte level, the codepoint level, the grapheme level, or the
Swahili level. Or it might not choose to allow all those levels. Some
strings might only provide a codepoint or grapheme API for instance.
Perl 6 allows the lower level encoding to be encapsulated, in other
words. You don't have to care whether something is represented in
UTF-whatever as long as the Unicode abstraction is correct at the
level you want to deal with it. The underlying encoding might not
even be a "UTF". All Perl 6 requires is that the semantics be Unicode
semantics. How that gets mapped to the underlying representation
doesn't need to concern to the programmer unless they want it to.
Mostly the programmer just needs to make sure all the portals between
the outside world and the program are typologically safe.

I'm sure I've left out a few important details, but that's the jist of
it. For more on the design and development of Perl 6, some good places
to start are:

http://perlcabal.org/syn
http://www.pugscode.org/
http://www.parrotcode.org/

(If you start playing with the pugs prototype, note that it's
still somewhat hardwired to support just the codepoints dialect.
We're currently concentrating on the meta-object support, and better
Unicode support will follow from that.)

Perl 6 itself is wholeheartedly a Unicode language in the abstract;
one of the big benefits is that there is almost no pressure to overload
existing operators with completely unrelated meanings. Compare with
C++'s (mis)treatment of <<, say. If something looks strange in Perl 6,
it probably *is* strange. And if you program in the APL subset of
Perl 6, expect to get a few strange looks yourself. :-)

And after reading much of the earlier discussion, I must say that,
while I love UTF-8 dearly, it's usually the wrong abstraction level
to be working at for most text-processing jobs. Ordinary people
not steeped in systems programming and C culture just want to think
in graphemes, and so that's what the standard dialect of Perl 6 will
default to. A small nudge will push it into supporting the graphemes
of a particular human language. The people who want to think more
like a computer will also find ways to do that. It's just not the
sweet spot we're aiming for.

Larry

ＳｒｉｎＴｕａｒ

2007-03-29 16:10:10 UTC

Post by Larry Wall
And after reading much of the earlier discussion, I must say that,
while I love UTF-8 dearly, it's usually the wrong abstraction level
to be working at for most text-processing jobs. Ordinary people
not steeped in systems programming and C culture just want to think
in graphemes, and so that's what the standard dialect of Perl 6 will
default to. A small nudge will push it into supporting the graphemes
of a particular human language. The people who want to think more
like a computer will also find ways to do that. It's just not the
sweet spot we're aiming for.

Very interesting, though I must admit I'm sad to hear that. Over the years
I have come to find that what I see as the sweet spot for string processing
would be this definition of a utf-8 "string":

"a null terminated series of bytes, some of which are parts of
valid utf-8 sequences, and others which are treated as individial
binary values"

Effectively, its a bare minimum update of what we had with ascii. The
only time I want to depart from this paradigm is when I have to. But
in general, I want to avoid conversion and keep my strings in this
format as much as possible. (this is much like the way tools such as
readline, vim, curses, etc handle utf-8 strings, etc )

Most code should not have to care which parts are valid and which are
not. If they call a function which requires a specific level of
validation, that function should be free to complain when that is not
the case. But I don't see a reason why a plain "print" should ever
need to care or complain about what its printing. All it has to do is
catenate and dump bytes out, I don't think it should be a bouncer of
what is kosher or not for printing.

I think a regex engine should, for example, match one binary byte to a
"." the same way it would match a valid sequence of unicode characters
and composing characters as a singe grapheme. This is a best effort to
work with the string as provided, and someone who does not want such
behavior would not run regex's over such strings.

When a program needs to take in data from various different encodings,
it should be their job to convert that data into their locale's native
encoding. (by reading mime headers or whatever mechanism) I don't
think a programming language should have built-ins that track the
status of a string- as that strikes me as an attempt to DWIM and not
DWIS.

Taking that trend to its logical conclusion, I would not want every
scalar value to track every possible kind of validation that has
happned to a string: "utf-8, validated NFD, turkish + korean". If
someone wants to do language specific case folding they can either
default to the locale's language+encoding, or else specify which one's
they want to use. If someone wants to make sure their string is valid
utf-8 in NFKC, they can pass it to a validation routine such as
Unicode::Normalize::NFKC. But the input and output of that routine
should be a plain old scalar, with no special knowledge of what has
happened to it.

This minimal approach is much like what happens in C/C++, and i don't
see any reason why a scripting language should do more than it is
asked to and in the process potintially do the wrong thing despite its
best intentions. Admittedly, in perl 5 these are trivial annoyances
with readily available workarounds. From your post I guess I can
assume that perl 6 will be similar.

On a separate topic:
Java seems to have a much worse problem. Forcing conversion to utf-16
causes you to lose information, since utf-16 cannot represent all the
possible invalid utf-8 sequences. It forces you treat your strings as
binary blobs and lose access to all the functions that operate on
strings, and/or take a performance hit for conversion where none is
actually needed. (If the design goal of Java was to force utf-16 on
the world they are unlikely to succeed at it, as utf-8 has largely
ursurped it's place)

Larry Wall

2007-03-29 18:53:01 UTC

On Thu, Mar 29, 2007 at 12:10:10PM -0400, $B#S#r#i#n#T#u#a#r(B wrote:
: >And after reading much of the earlier discussion, I must say that,
: >while I love UTF-8 dearly, it's usually the wrong abstraction level
: >to be working at for most text-processing jobs. Ordinary people
: >not steeped in systems programming and C culture just want to think
: >in graphemes, and so that's what the standard dialect of Perl 6 will
: >default to. A small nudge will push it into supporting the graphemes
: >of a particular human language. The people who want to think more
: >like a computer will also find ways to do that. It's just not the
: >sweet spot we're aiming for.
:
: Very interesting, though I must admit I'm sad to hear that. Over the years
: I have come to find that what I see as the sweet spot for string processing
: would be this definition of a utf-8 "string":
:
: "a null terminated series of bytes, some of which are parts of
: valid utf-8 sequences, and others which are treated as individial
: binary values"

I think that definition is essentially insane.

: Effectively, its a bare minimum update of what we had with ascii. The
: only time I want to depart from this paradigm is when I have to. But
: in general, I want to avoid conversion and keep my strings in this
: format as much as possible. (this is much like the way tools such as
: readline, vim, curses, etc handle utf-8 strings, etc )

We tried the bare minimum with Perl 5, and it was insane. Then we tried
a little more than the bare minimum, and it was a little less insane.

: Most code should not have to care which parts are valid and which are
: not. If they call a function which requires a specific level of
: validation, that function should be free to complain when that is not
: the case. But I don't see a reason why a plain "print" should ever
: need to care or complain about what its printing. All it has to do is
: catenate and dump bytes out, I don't think it should be a bouncer of
: what is kosher or not for printing.

The way to get most of your functions to not have to care is to be
very careful to validate at the boundaries of your program, and not
throw away type information between the parts of your program.

: I think a regex engine should, for example, match one binary byte to a
: "." the same way it would match a valid sequence of unicode characters
: and composing characters as a singe grapheme. This is a best effort to
: work with the string as provided, and someone who does not want such
: behavior would not run regex's over such strings.

How can it possibly know whether to match a binary byte or a grapheme
if you've mixed UTF-8 and binary in the same string? The short answer
is: "It can't." The long answer is "It can't unless you supply it
with type information out of band." Which for historical reasons C
programmers don't seem to mind doing, since C basically doesn't have
a clue what a string is, let alone what it might contain or how long
it might be. And null termination has turned out to be a terrible
workaround (in security terms as well as efficiency) for not knowing
the length. C's head-in-the-sand approach to string processing is
directly responsible for many of the security breaks on the net.

There's a good place for byte-oriented serialization and deserialization,
and it's usually at the boundaries of your program, in well-crafted
protocol stacks. If you find yourself doing that sort of thing in
the middle of a program, it's often a code smell that says you should
be refactoring and maybe dereinventing some wheel or other.

: When a program needs to take in data from various different encodings,
: it should be their job to convert that data into their locale's native
: encoding. (by reading mime headers or whatever mechanism) I don't
: think a programming language should have built-ins that track the
: status of a string- as that strikes me as an attempt to DWIM and not
: DWIS.

I'd much rather have a language where I can say what I mean.

: Taking that trend to its logical conclusion, I would not want every
: scalar value to track every possible kind of validation that has
: happned to a string: "utf-8, validated NFD, turkish + korean". If
: someone wants to do language specific case folding they can either
: default to the locale's language+encoding, or else specify which one's
: they want to use. If someone wants to make sure their string is valid
: utf-8 in NFKC, they can pass it to a validation routine such as
: Unicode::Normalize::NFKC. But the input and output of that routine
: should be a plain old scalar, with no special knowledge of what has
: happened to it.

I think this attitude is just sweeping the problem under someone
else's carpet.

: This minimal approach is much like what happens in C/C++, and i don't
: see any reason why a scripting language should do more than it is
: asked to and in the process potintially do the wrong thing despite its
: best intentions.

Then by all means do your scripting in a language that more closely
resembles C or C++. But I see no reason for Perl to be just another
version of C. We already have lots of those... :-)

It's just my gut-level feeling that traditional world of C, Unix,
locales, etc. simply does not provide appropriate abstractions to deal
with internationalization. Yes, you can get there if you throw enough
libraries and random functions and macros and pipes and filters at it,
but the basic abstractions leak like a seive. It's time to clean it
all up.

: Admittedly, in perl 5 these are trivial annoyances
: with readily available workarounds. From your post I guess I can
: assume that perl 6 will be similar.

I certainly believe in giving people enough rope to shoot themselves in
the foot. But with Perl 6 you at least have to ask nicely for the rope. :-)

: On a separate topic:
: Java seems to have a much worse problem. Forcing conversion to utf-16
: causes you to lose information, since utf-16 cannot represent all the
: possible invalid utf-8 sequences. It forces you treat your strings as
: binary blobs and lose access to all the functions that operate on
: strings, and/or take a performance hit for conversion where none is
: actually needed. (If the design goal of Java was to force utf-16 on
: the world they are unlikely to succeed at it, as utf-8 has largely
: ursurped it's place)

I don't think it's Perl 6's place to force either utf-8 or utf-16 or
utf-whatever on anyone. If the abstractions are sane and properly
encapsulated, the implementors can do whatever makes sense behind
the scenes, and that very likely means different things in different
contexts.

I try hard not to be a linguistic imperialist (when I try at all). :-)

Anyway, if anyone wants to give me specific feedback on the current
design of Perl 6, that'd be cool. Though perl6-***@perl.org would
probably be a better forum for that.

Larry

Rich Felker

2007-03-29 21:17:29 UTC

Post by Larry Wall
: I think a regex engine should, for example, match one binary byte to a
: "." the same way it would match a valid sequence of unicode characters
: and composing characters as a singe grapheme. This is a best effort to
: work with the string as provided, and someone who does not want such
: behavior would not run regex's over such strings.
How can it possibly know whether to match a binary byte or a grapheme
if you've mixed UTF-8 and binary in the same string?

I agree that SrinTuar’s idea of matching . to a byte is insane. While
NFA/DFA is sometimes a nice tool even with binary data, using regex
character syntax for it is maybe a bit dubious. And surely, like you
said, they should not be mixed in the same string.

With that in mind, though, I think your emphasis on graphemes is also
a bit misplaced. The idea of a “grapheme” as the fundamental unit of
editing, instead of a character, is pretty much only appropriate when
writing Latin, Greek, and Cyrillic based languages with NFD. In most
Indian scripts, whole syllables get counted as “graphemes” for visual
presentation, yet users still expect to be able to edit, search, etc.
individual characters.

Even if you’re just considering a “grapheme” to be a base character
followed by a sequence of combining marks (Mn/Me/Cf), it’s
inappropriate for Tibetan where letters stack vertically (via
combining forms of class Mn) and yet each is considered a letter for
the purposes of editing, character counting, etc. A similar situation
applies for Hangul Jamo.

IMO, a regex pattern to match whole graphemes could be useful, but I
suspect character matching is almost always what’s wanted except for
NFD with European scripts.

Post by Larry Wall
it might be. And null termination has turned out to be a terrible
workaround (in security terms as well as efficiency) for not knowing

Null termination is not the security problem. Broken languages that
DON'T use null-termination are the security problem, particularly
mixing them with C.

Post by Larry Wall
the length. C's head-in-the-sand approach to string processing is
directly responsible for many of the security breaks on the net.

No, the incompetence of people writing C code is what’s directly
responsible for them. C’s approach might be indirectly responsible,
for being difficult or something, but certainly not directly. There
are examples of real-world C programs which are absolutely secure,
such as vsftpd.

Post by Larry Wall
It's just my gut-level feeling that traditional world of C, Unix,
locales, etc. simply does not provide appropriate abstractions to deal
with internationalization. Yes, you can get there if you throw enough
libraries and random functions and macros and pipes and filters at it,
but the basic abstractions leak like a seive. It's time to clean it
all up.

Mutt works right without any of that.. It’s as close as you’ll find to
the pinnacle of correct C application coding.

Post by Larry Wall
I don't think it's Perl 6's place to force either utf-8 or utf-16 or
utf-whatever on anyone. If the abstractions are sane and properly
encapsulated, the implementors can do whatever makes sense behind
the scenes, and that very likely means different things in different
contexts.

But the corner-case of handling “text” data with malformed sequences
in it will be very difficult and painful, no? With C and byte strings
it’s very easy..

Post by Larry Wall
I try hard not to be a linguistic imperialist (when I try at all). :-)

☺ ☻ ☺ ☻ (happy multiracial smileys)

Post by Larry Wall
Anyway, if anyone wants to give me specific feedback on the current
probably be a better forum for that.

The only feedback I’d like to give is ask that if the nasty warning
messages are kept, they should be applied to characters in the range
128-255 as well, not just characters >255.

Also.. is there a clean way to deal with the issue (aside from just
disabling warnings) on a perl build without PerlIO (and thus no
working binmode)?

Finally, I must admit I’m not at all a Perl fan, so maybe take what I
say with a grain of salt. I just wish Perl scripts I obtain from
others would work more comfortably without making me have to think
about the nonstandard (compared to the rest of a unix system)
treatment they’re giving to character encoding.

Rich

Daniel B.

2007-04-05 03:56:35 UTC

Rich Felker wrote:
...

Post by Rich Felker
Null termination is not the security problem. Broken languages that
DON'T use null-termination are the security problem, particularly
mixing them with C.

C is the language that handles one out of 256 possible byte values
inconsistently (with respect to the other 255) (in C strings).

The other languages handle all 256 byte values consistently.

Why isn't it C that is a bit broken (that has irregular limitation)?

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-04-05 06:04:04 UTC

Post by Daniel B.
....

Post by Rich Felker
Null termination is not the security problem. Broken languages that
DON'T use null-termination are the security problem, particularly
mixing them with C.

C is the language that handles one out of 256 possible byte values
inconsistently (with respect to the other 255) (in C strings).

Having a standard designated byte that can be treated specially is
very useful in practice. If there weren't such a powerful force
establishing NUL as the one, we'd have all sorts of different
conventions. Just look how much that already happens anyway... the use
of : as a separator in PATH-type strings, the use of spaces to
separate command line arguments, the use of = to separate environment
variable names from values, etc.. Having a character you know can't
occur in text (not just by arbitrary rules, but because it's actually
impossible for it to be passed in a C string) is nice because there's
at least one character you know is always safe to use for app-internal
in-band signalling. Notice also how GNU find/xargs use NUL to cleanly
separate filenames, relying on the fact that it could never occur
embedded in a filename.

You can ask what would have happened if C had used pascal-style
strings. I suspect we would have been forced to deal with ridiculously
small length limits, controversial ABI changes to correct for it, etc.
Certainly for many types of applications its beneficial to use smarter
data structures for text internally (more complex even than just
pascal style strings), but I think C made a very good choice in using
the simplest possible representation for communicating reasonable-size
strings between the application, the system, and all the various
libraries that have followed the convention.

Post by Daniel B.
The other languages handle all 256 byte values consistently.

Which ones? Now I think you're being hypocritical. One moment you're
applauding treating text as a sequence of Unicode codepoints in a way
that's not binary-clean for files containing invalid sequences, and
then you're complaining about C strings not being binary-clean because
NUL is a terminator. NUL is not text. Arguably other control
characters aside from newline (and perhaps tab) are not text either.
If you want to talk about binary data instead of text, then C isn't
doing anything inconsistent. The functions for dealing with binary
data (memcpy/memmove/memcmp/etc.) don't treat NUL specially of course.

There are plenty of languages which can't handle control characters in
strings well at all, much less NUL. I suspect most of the ones that
handle NUL the way you'd like them to also clobber invalid sequences
due to using UTF-16 internally.

Post by Daniel B.
Why isn't it C that is a bit broken (that has irregular limitation)?

Because C was there first and C is essentially the only standardized
language. When your applications run on top of a system build upon C
and POSIX you have to play by the C and POSIX rules. Ignoring this
necessity is what got Firefox burned.

Rich

P.S. If you really want to debate what I said about C being the only
standardized language/the authority/whatever, let's take it off-list
because we've gotten way off-topic from utf-8 handling already. I have
reasons for what I say, but I really don't want to burden this list
with more off-topic sub-thread spinoffs.

Marcin 'Qrczak' Kowalczyk

2007-04-05 10:54:54 UTC

Post by Rich Felker
Just look how much that already happens anyway... the use
of : as a separator in PATH-type strings, the use of spaces to
separate command line arguments, the use of = to separate environment
variable names from values, etc..

Do you propose to replace them with NULs? This would make no sense.
A single environment variable can contain a whole PATH-type string.
You can’t use NUL to delimit the whole string *and* its components
at the same time. Different contexts require different delimiters
if a string from one context is to be able to contain a sequence of
another one.

Post by Rich Felker
Having a character you know can't
occur in text (not just by arbitrary rules, but because it's actually
impossible for it to be passed in a C string) is nice because there's
at least one character you know is always safe to use for app-internal
in-band signalling.
Notice also how GNU find/xargs use NUL to cleanly
separate filenames, relying on the fact that it could never occur
embedded in a filename.

because you show an example where NUL *is* used in text, and it’s used
not internally but in communication between two programs.

Post by Daniel B.
The other languages handle all 256 byte values consistently.

Which ones?

All languages besides C, except toy interpreters written in C by some
students.

Post by Rich Felker
There are plenty of languages which can't handle control characters in
strings well at all, much less NUL.

I don’t know any such language.

Post by Rich Felker
Because C was there first and C is essentially the only standardized
language.

Nonsense.

Post by Rich Felker
When your applications run on top of a system build upon C
and POSIX you have to play by the C and POSIX rules.

Only during communication with the system.

The only influence of C on string representation in other languages
is that it’s common to redundantly have NUL stored after the string
*in addition* to storing the length explicitly, so in cases the string
doesn’t contain NUL itself it’s possible to pass the string to a C
function without copying its contents.

--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Rich Felker

2007-04-05 15:32:33 UTC

Do you propose to replace them with NULs? This would make no sense.

Of course not.

Post by Marcin 'Qrczak' Kowalczyk
A single environment variable can contain a whole PATH-type string.
You can’t use NUL to delimit the whole string *and* its components
at the same time. Different contexts require different delimiters
if a string from one context is to be able to contain a sequence of
another one.

My point is that the first level of in-band signalling is already
standardized, making for one less.

No. Inflammatory accusations like this are rather hasty and
inappropriate...

Post by Rich Felker
Notice also how GNU find/xargs use NUL to cleanly
separate filenames, relying on the fact that it could never occur
embedded in a filename.

because you show an example where NUL *is* used in text, and it’s used
not internally but in communication between two programs.

That's not text. It's binary data containing a sequence of text
strings. The assumption that pipes==text is one of the most common
incorrect perceptions about unix, caused most likely by bad experience
with DOS pipes.

Post by Daniel B.
The other languages handle all 256 byte values consistently.

Which ones?

All languages besides C, except toy interpreters written in C by some
students.

False.

Post by Rich Felker
There are plenty of languages which can't handle control characters in
strings well at all, much less NUL.

I don’t know any such language.

sed, awk, bourne shell, ....

Post by Rich Felker
Because C was there first and C is essentially the only standardized
language.

Nonsense.

Like I said if you want to debate this email me off-list. It's quite
true, but mostly unrelated to the practical issues being discussed
here.

Post by Rich Felker
When your applications run on top of a system build upon C
and POSIX you have to play by the C and POSIX rules.

Only during communication with the system.
The only influence of C on string representation in other languages
is that it’s common to redundantly have NUL stored after the string
*in addition* to storing the length explicitly, so in cases the string
doesn’t contain NUL itself it’s possible to pass the string to a C
function without copying its contents.

This is bad design that leads to the sort of bugs seen in Firefox. If
we were living back in the 8bit codepage days, it might make sense for
these languages to try to unify byte arrays and character strings, but
we're not. There's no practical reason a character string needs to
store the NUL character (it's already not binary-clean due to UTF-8)
and thus no reason to introduce this blatent incompatibility (which
almost always turns into bugs and vulnerabilities) with the underlying
system.

Also note that there's nothing "backwards" about using termination
instead of length+data. For example it's the natural way a string
would be represented in a pure (without special string type) lisp-like
language. (Of course using a list is still binary clean because the
terminator is in the cdr rather than the car.) And like with lists, C
strings have the advantage that a terminal substring of the original
string is already a string in-place, without copying.

Rich

Marcin 'Qrczak' Kowalczyk

2007-04-07 11:46:22 UTC

Post by Rich Felker
My point is that the first level of in-band signalling is already
standardized, making for one less.

The issue was whether NUL characters should be excluded from the string
type in a programming language, and whether the internal representation
of strings should rely on NUL as the terminator (as opposed to storing
the length separately).

I claim that it would be a bad idea. The interface of working with
strings can be as convenient when NUL is not excluded, and there are
cases where programs deal with strings containing NULs so excluding them
is harmful. Even of NUL is special in some OS API, it’s not a reason to
make it special in core string handling.

Post by Rich Felker
There are plenty of languages which can't handle control characters in
strings well at all, much less NUL.

I don’t know any such language.

sed, awk, bourne shell, ....

True for mawk, bash, and ksh, but not true for GNU sed, gawk, and zsh,
which are capable of storing NULs in user strings, and these NULs are
correctly passed to and processed by internal shell commands.

Anyway, these are exceptions rather than a rule.

Post by Marcin 'Qrczak' Kowalczyk
The only influence of C on string representation in other languages
is that it’s common to redundantly have NUL stored after the string
*in addition* to storing the length explicitly, so in cases the string
doesn’t contain NUL itself it’s possible to pass the string to a C
function without copying its contents.

This is another issue. I’m for distinguishing character strings from
byte strings. I’m against making U+0000 or 0 a special case in either
of them.

For example in my language Kogut a string is a sequence of Unicode code
points. My implementation uses two string representations internally:
if it contains no characters above U+00FF, then it’s stored as a
sequence of bytes, otherwise it’s a sequence of 32-bit integers.
This variation is not visible in the language. The narrow case has
a redundant NUL appended. When a string is passed to some C function
and the function expects the default encoding (normally taken from
the locale), then — under the assumption that a default encoding
is ASCII-compatible — if the string contains only ASCII characters
excluding NUL, a pointer to the string data is passed. Otherwise
a recoded array of bytes is created. This is quite a practical reason
to store the redundant NULs, even though NUL is not special as far as
the string type is concerned. Most strings manipulated by average
programs are ASCII-only.

Post by Rich Felker
Also note that there's nothing "backwards" about using termination
instead of length+data. For example it's the natural way a string
would be represented in a pure (without special string type) lisp-like
language. (Of course using a list is still binary clean because the
terminator is in the cdr rather than the car.)

The parenthesized remark is crucial. Lisp lists use an out-of-band
terminator, not in-band.

Post by Rich Felker
And like with lists, C
strings have the advantage that a terminal substring of the original
string is already a string in-place, without copying.

This is too small advantage to overcome the inability of storing NULs
and the lack of O(1) length check (which rules out bounds checking on
indexing), and it’s impractical with garbage collection anyway.

--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Rich Felker

2007-04-07 14:21:13 UTC

Post by Marcin 'Qrczak' Kowalczyk
For example in my language Kogut a string is a sequence of Unicode code
if it contains no characters above U+00FF, then it’s stored as a
sequence of bytes, otherwise it’s a sequence of 32-bit integers.
This variation is not visible in the language. The narrow case has
a redundant NUL appended. When a string is passed to some C function
and the function expects the default encoding (normally taken from
the locale), then — under the assumption that a default encoding
is ASCII-compatible — if the string contains only ASCII characters
excluding NUL, a pointer to the string data is passed. Otherwise

I hope you generate an exception (or whatever the appropriate error
behavior is) if the string contains a NUL byte other than the
terminator when it's passed to C functions. Otherwise you risk the
same sort of vuln that burned Firefox. Passing a string with embedded
NULs where a NUL-terminated string is expected is an incompatible type
error, and a self-respecting HLL should catch and protect you from
this.

Post by Marcin 'Qrczak' Kowalczyk
a recoded array of bytes is created. This is quite a practical reason
to store the redundant NULs, even though NUL is not special as far as
the string type is concerned. Most strings manipulated by average
programs are ASCII-only.

Using UTF-8 would have accomplished the same thing without
special-casing. Then even non-ASCII strings would use less memory. As
discussed recently on this list, most if not all of the advantages of
UTF-32 over UTF-8 are mythical.

The parenthesized remark is crucial. Lisp lists use an out-of-band
terminator, not in-band.

Indeed, the point was more about the O(n) thing not being a problem.

Post by Rich Felker
And like with lists, C
strings have the advantage that a terminal substring of the original
string is already a string in-place, without copying.

C strings are usually used for small strings for which O(n) is O(1)
because n is bounded by, say, 4096 (PATH_MAX). Whenever discussing
these issues, it's essential to be aware that a plain string, whether
NUL-terminated or pascal-style, is unsuitable for a large class of
uses including any data that will frequently be edited. This is
because insertion or deletion is O(n). A reasonable program working
with large textual datasets will keep small strings (maybe lines, or
maybe arbitrary chunks of a certain max size) in a list structure with
a higher level data structure indexing them.

Certainly very high level languages could do the same with ordinary
strings, providing efficient primitives for insertion, deletion,
splitting, etc. but for the majority of tiny strings the overhead may
be a net loss. I kinda prefer the Emacs Lisp approach of having
immutable string objects for small jobs and full-fledged emacs buffers
for heavy weight text processing.

At this point I'm not sure to what degree this thread is
off-/on-topic. If any list members are offended by its continuation,
please say so.

~Rich

Marcin 'Qrczak' Kowalczyk

2007-04-07 18:21:25 UTC

Post by Rich Felker
I hope you generate an exception (or whatever the appropriate error
behavior is) if the string contains a NUL byte other than the
terminator when it's passed to C functions.

I do.

Post by Rich Felker
Using UTF-8 would have accomplished the same thing without
special-casing.

Then iterating over strings and specifying string fragments could not be
done by code point indices, and it’s not obvious how a good interface
should look like. Operations like splitting on whitespace would no
longer have simple implementations based on examining successive code
points.

Post by Marcin 'Qrczak' Kowalczyk
This is too small advantage to overcome the inability of storing NULs
and the lack of O(1) length check (which rules out bounds checking on
indexing), and it’s impractical with garbage collection anyway.

C strings are usually used for small strings for which O(n) is O(1)
because n is bounded by, say, 4096 (PATH_MAX).

This still rules out bounds checking. If each s[i] among 4096 indexing
operations has the cost of 4096-i, then 8M might become noticeable.

--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Rich Felker

2007-04-07 20:04:47 UTC

Post by Rich Felker
Using UTF-8 would have accomplished the same thing without
special-casing.

Then iterating over strings and specifying string fragments could not be
done by code point indices, and it’s not obvious how a good interface
should look like.

One idea is to have a 'point' in a string be an abstract data type
rather than just an integer index. In reality it would just be a UTF-8
byte offset.

Post by Marcin 'Qrczak' Kowalczyk
Operations like splitting on whitespace would no
longer have simple implementations based on examining successive code
points.

Sure it would. Accessing a character at a point would still evaluate
to a character. Instead of if/else for 8bit/32bit string, you'd just
have a UTF-8 operation.

Post by Marcin 'Qrczak' Kowalczyk
This still rules out bounds checking. If each s[i] among 4096 indexing
operations has the cost of 4096-i, then 8M might become noticeable.

Indeed this is true. Here's a place where you're very right, a HLL
which does bounds checking will want to know (at the implementation
level) the size of arrays. On the other hand this information is
useless to C, and if you're writing C, it's your responsibility to
know whether the offset you access is safe before you access it.
Different languages and target audiences/domains have different
requirements.

Rich

ＳｒｉｎＴｕａｒ

2007-03-29 21:50:16 UTC