Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Discussion:

Christopher Fynn

2007-04-24 11:49:40 UTC

"ISO/IEC 10646 JTC1/SC2/WG2 N3248 Synchronization Issues for UTF-8"
see: <http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3248.doc>

The document referenced above proposes changes to the specification of
UTF-8 in Annexe D of the ISO/IEC 10646 Standard to make it effectively
the same as the specification of UTF-8 in the Unicode Standard.

- Chris

ＳｒｉｎＴｕａｒ

2007-04-24 20:43:59 UTC

Permalink

Basically, its a proposal to cap at 10FFFF.

I see no reason to cap utf-8 and utf-32 just to deal with the
limitations of utf-16.

As long as you don't attempt to convert to utf-16, it should not be a
problem. (and eventually, utf-16 should be phased out)

Post by Christopher Fynn
"ISO/IEC 10646 JTC1/SC2/WG2 N3248 Synchronization Issues for UTF-8"
see: <http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3248.doc>
The document referenced above proposes changes to the specification of
UTF-8 in Annexe D of the ISO/IEC 10646 Standard to make it effectively
the same as the specification of UTF-8 in the Unicode Standard.

Rich Felker

2007-04-24 21:23:13 UTC

Permalink

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Basically, its a proposal to cap at 10FFFF.
I see no reason to cap utf-8 and utf-32 just to deal with the
limitations of utf-16.
As long as you don't attempt to convert to utf-16, it should not be a
problem. (and eventually, utf-16 should be phased out)

Capping is a good thing, and 21-bit is exactly the point you want to
cap at. Not only does it ensure that required table indices for UCS
support can't grow unmanagably large; it also ensures that UTF-8 is no
larger than UTF-32, so that conversion can be done in-place in
situations where storage space is limited.

Almost all present-day scripts have already been encoded, and plenty
of historical ones too. Even 18 or 19 bits would have been a plenty. I
see no legitimate practical argument against a 21-bit limit; it just
increases the potential for implementation complexity with no
benefits.

Rich

Marcin 'Qrczak' Kowalczyk

2007-04-25 00:37:53 UTC

Permalink

Dnia 24-04-2007, wto o godzinie 16:43 -0400, ＳｒｉｎＴｕａｒ

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Basically, its a proposal to cap at 10FFFF.
I see no reason to cap utf-8 and utf-32 just to deal with the
limitations of utf-16.

Too late, it has already been decided.

And it’s harmless. There aren’t going to be any characters above
U+10FFFF ever allocated by ISO-10646 or Unicode. So it’s better
to conform to the standards, to improve consistency and reliability.

--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Christopher Fynn

2007-04-26 09:44:33 UTC

Permalink

N3266

UCS Transformation Formats summary, non-error and error sequences –
feedback on N3248

<http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3266.doc>

- c

Rich Felker

2007-04-26 23:33:52 UTC

Permalink

Post by Christopher Fynn
N3266
UCS Transformation Formats summary, non-error and error sequences –
feedback on N3248
<http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3266.doc>

I must say this is a rather stupid looking proposal. The C0 controls
already have application-defined semantics; trying to give them a
universal meaning like this is a very bad idea. Keep in mind that
U+001A is ^Z, so for example if a terminal emulator converted bogus
UTF-8 from an X11 paste into this character, it would send (possibly
many) suspend commands to the application. Certainly not what the user
had in mind!!

Moreover, C0 and C1 control codes (minus newline and perhaps tab),
along with Unicode line/paragraph separator, should be considered
INVALID in plain text themselves. So generating them as a means of
error replacement is counterproductive as the ^Z's could be seen as
errors in themselves.

Also note that ^Z is DOS EOF. I bet some bad Windows software would
truncate files at the first ^Z...

Finally, I think the fact that this document was submitted in MS Word
form speaks for the author's qualifications (or lack thereof) to
design such a specification...

Rich

McDonald, Ira

2007-04-27 04:02:12 UTC

Permalink

Hi,

One comment - ALL ISO working documents are now
submitted in MS Word form (for better or worse),
so the criticism below is unfounded.

Now I agree that re-assigning semantics to the C0
controls is thoroughly unwise.

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Blue Roof Music / High North Inc
PO Box 221 Grand Marais, MI 49839
phone: +1-906-494-2434
email: ***@sharplabs.com

-----Original Message-----
From: linux-utf8-***@nl.linux.org
[mailto:linux-utf8-***@nl.linux.org]On Behalf Of Rich Felker
Sent: Thursday, April 26, 2007 6:34 PM
To: linux-***@nl.linux.org
Subject: Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Post by Christopher Fynn
N3266
UCS Transformation Formats summary, non-error and error sequences –
feedback on N3248
<http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3266.doc>

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.463 / Virus Database: 269.6.1/776 - Release D

Christopher Fynn

2007-04-27 11:15:16 UTC

Permalink

Post by Rich Felker

Post by Christopher Fynn
N3266
UCS Transformation Formats summary, non-error and error sequences –
feedback on N3248
<http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3266.doc>

I must say this is a rather stupid looking proposal. The C0 controls
already have application-defined semantics; trying to give them a
universal meaning like this is a very bad idea. Keep in mind that
U+001A is ^Z, so for example if a terminal emulator converted bogus
UTF-8 from an X11 paste into this character, it would send (possibly
many) suspend commands to the application. Certainly not what the user
had in mind!!
Moreover, C0 and C1 control codes (minus newline and perhaps tab),
along with Unicode line/paragraph separator, should be considered
INVALID in plain text themselves. So generating them as a means of
error replacement is counterproductive as the ^Z's could be seen as
errors in themselves.
Also note that ^Z is DOS EOF. I bet some bad Windows software would
truncate files at the first ^Z...

N3266 was discussed and rejected by WG2 yesterday. As you pointed out
there are all sorts of problems with this proposal, and accepting it
would break many existing implementations.

Post by Rich Felker
Finally, I think the fact that this document was submitted in MS Word
form speaks for the author's qualifications (or lack thereof) to
design such a specification...

WG2 documents are all supposed to be submitted in MS Word .doc format -
fortunately OO.o Writer can also generate this file format. I got away
with submitting N3240 in PDF format generated by OO.o.

- Chris

Rich Felker

2007-04-27 17:52:17 UTC

Permalink

Post by Christopher Fynn
N3266 was discussed and rejected by WG2 yesterday. As you pointed out
there are all sorts of problems with this proposal, and accepting it
would break many existing implementations.

That's good to hear. In followup, I think the whole idea of trying to
standardize error handling is flawed. What you should do when
encountering invalid data varies a lot depending on the application.
For filenames or text file contents you probably want to avoid
corrupting them at all costs, even if they contain illegal sequences,
to avoid catastrophic data loss or vulnerabilities. On the other hand,
when presenting or converting data, there are many approaches that are
all acceptable. These include dropping the corrupt data, replacing it
with U+FFFD, or even interpreting the individual bytes according to a
likely legacy codepage. This last option is popular for example in IRC
clients and works well to deal with the stragglers who refuse to
upgrade their clients to use UTF-8. Also, some applications may wish
to give fatal errors and refuse to process data at all unless it's
valid to begin with.

Rich

Christopher Fynn

2007-04-27 18:13:27 UTC

Permalink

Post by Rich Felker

Yes. Someone who was there tells me the main reason it was rejected was
that it was considered out of scope for ISO 10646 or even Unicode to
dictate what a process should do in an error condition. Should it throw
an exception, etc. etc. The UTF-8 validity specification is expressed in
terms of what constitutes a valid string or substring rather than what a
process needs to do in a given condition. Neither standard wants to get
into the game of standardizing API type things like what processes
should do.

- Chris

Post by Rich Felker
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Ben Wiley Sittler

2007-04-27 19:41:22 UTC

Permalink

glad it was rejected. the only really sensible approach i have yet
seen is utf-8b (see my take on it here:
http://bsittler.livejournal.com/10381.html and another implementation
here: http://hyperreal.org/~est/utf-8b/ )

the utf-8b approach is superior to many others in that binary is
preserved, but it does not inject control characters. instead it is an
extension to utf-8 that allows all byte sequences, both those that are
valid utf-8 and those that are not. when converting utf-8 <-> utf-16,
the bytes in invalid utf-8 sequences <-> unpaired utf-16 surrogates.
the correspondence is 1-1, so data is never lost. valid paired
surrogates are unaffected (and are used for characters outside the
bmp.)

i realize i've mentioned this before, but i feel i should mention it
whenever someone mentions a non-data-preserving proposal (like
converting everything invalid to U+FFFD REPLACEMENT CHARACTER) or an
actively harmful proposal (like converting invalid bytes into U+001A
SUB which has well-defined and sometimes-destructive semantics.)

Post by Christopher Fynn

Post by Rich Felker

Post by Rich Felker
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2007-04-27 21:01:20 UTC

Permalink

Post by Ben Wiley Sittler
glad it was rejected. the only really sensible approach i have yet
http://bsittler.livejournal.com/10381.html and another implementation
here: http://hyperreal.org/~est/utf-8b/ )
the utf-8b approach is superior to many others in that binary is
preserved, but it does not inject control characters. instead it is an
extension to utf-8 that allows all byte sequences, both those that are
valid utf-8 and those that are not. when converting utf-8 <-> utf-16,
the bytes in invalid utf-8 sequences <-> unpaired utf-16 surrogates.
the correspondence is 1-1, so data is never lost. valid paired
surrogates are unaffected (and are used for characters outside the
bmp.)

this approach is perhaps reasonable for applications that want to use
utf-16 internally without corrupting invalid sequences in utf-8, but
it has problems too. for example it's not stable under string
concatenation or substring operations.

the whole reason utf-8 is usable comes from its self-synchronizing
property and the property that one character is never a substring of
another character. this necessarily forces the encoding to treat some
strings as invalid; that is, it's provably impossible to make an
encoding with the required properties where all strings are valid. as
a consequence, any treatment of invalid sequences as if they were
'special characters', like utf-8b does, will break all of the
essential properties. for some applications this may not matter; for
others it would be disastrous. it's certainly not possible to do such
a thing as the C library level (mb*towc family) without causing all
sorts of breakage.

my view is that it's best to just leave the data in its original utf-8
form and not do conversions until 'just in time', for presentation,
character identification, etc. caching this 'presentation' form
alongside the data may be appropriate for many applications.

rich

Ben Wiley Sittler

2007-04-27 21:34:58 UTC

Permalink

yes, i agree. the utf-8b approach is useful mainly when sending binary
data through a utf-16 channel with the hope of recovering it at the
far side. once byte string or character string manipulations are
performed, all bets are off.

Post by Rich Felker

this approach is perhaps reasonable for applications that want to use
utf-16 internally without corrupting invalid sequences in utf-8, but
it has problems too. for example it's not stable under string
concatenation or substring operations.
the whole reason utf-8 is usable comes from its self-synchronizing
property and the property that one character is never a substring of
another character. this necessarily forces the encoding to treat some
strings as invalid; that is, it's provably impossible to make an
encoding with the required properties where all strings are valid. as
a consequence, any treatment of invalid sequences as if they were
'special characters', like utf-8b does, will break all of the
essential properties. for some applications this may not matter; for
others it would be disastrous. it's certainly not possible to do such
a thing as the C library level (mb*towc family) without causing all
sorts of breakage.
my view is that it's best to just leave the data in its original utf-8
form and not do conversions until 'just in time', for presentation,
character identification, etc. caching this 'presentation' form
alongside the data may be appropriate for many applications.
rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/