Discussion:
High-Speed UTF-8 to UTF-16 Conversion
Rob Cameron
2007-03-14 21:01:04 UTC
Permalink
As part of my research program into high-speed XML/Unicode/text
processing using SIMD techniques, I have experimented extensively
with the UTF-8 to UTF-16 conversion problem. I've generally been
comparing performance of my software with that of iconv under
Linux and Mac OS X. Are there any substantially faster implementations
available? Currently our u8u16-0.9 software runs about 3X to 25X faster
than iconv depending on platform and data characteristics.

u8u16-0.9 is available as open source software under an OSL 3.0 license
at http://u8u16.costar.sfu.ca/
--
Robert D. Cameron, Ph.D.
Professor of Computing Science
Simon Fraser University
Rich Felker
2007-03-15 17:06:54 UTC
Permalink
Post by Rob Cameron
As part of my research program into high-speed XML/Unicode/text
processing using SIMD techniques, I have experimented extensively
with the UTF-8 to UTF-16 conversion problem. I've generally been
comparing performance of my software with that of iconv under
Linux and Mac OS X. Are there any substantially faster implementations
available? Currently our u8u16-0.9 software runs about 3X to 25X faster
than iconv depending on platform and data characteristics.
GNU iconv is an extremely bad implementation to test for performance.
It has high overhead per call (so it will only be remotely fast on
very large runs, not individual character conversions), and even then
I don't suspect it would be very fast.

Why not just write the naive conversion algorithm yourself? For the
UTF-8 decoding, refer to uClibc's implementation of mbrtowc for UTF-8
locales, which is probably the fastest I've seen. I also have an
implementation in i386 asm which might be slightly faster.
Post by Rob Cameron
u8u16-0.9 is available as open source software under an OSL 3.0 license
at http://u8u16.costar.sfu.ca/
Thanks. I'll take a look.

Rich
Rich Felker
2007-03-15 17:27:33 UTC
Permalink
Post by Rob Cameron
u8u16-0.9 is available as open source software under an OSL 3.0 license
at http://u8u16.costar.sfu.ca/
On second thought, I will not offer any further advice on this. The
website refers to "patent-pending technology". Software patents are
fundamentally wrong and unless you withdraw this nonsense you are an
enemy of Free Software, of programmers, and users in general, and
deserve to be ostracized by the community. Even if you intend to
license the patents obtained freely for use in Free Software, it's
still wrong to obtain them because it furthers a precedent that
software patents are valid, particularly stupid patents like "applying
vectorization in the obvious way to existing problem X".

Sorry for the harsh language but this is what you should expect when
you ask for advice from the Linux/Free Software community on
leveraging patents against them.

Sincerely,
Rich Felker
Simon Josefsson
2007-03-15 17:44:49 UTC
Permalink
Post by Rich Felker
Post by Rob Cameron
u8u16-0.9 is available as open source software under an OSL 3.0 license
at http://u8u16.costar.sfu.ca/
On second thought, I will not offer any further advice on this. The
website refers to "patent-pending technology". Software patents are
fundamentally wrong and unless you withdraw this nonsense you are an
enemy of Free Software, of programmers, and users in general, and
deserve to be ostracized by the community. Even if you intend to
license the patents obtained freely for use in Free Software, it's
still wrong to obtain them because it furthers a precedent that
software patents are valid, particularly stupid patents like "applying
vectorization in the obvious way to existing problem X".
Sorry for the harsh language but this is what you should expect when
you ask for advice from the Linux/Free Software community on
leveraging patents against them.
I downloaded the source, but I agree with your concerns and will not
look further. Note that the OSL 3.0 license contains discussion about
patents, and give us some rights to the patented technology. It is
not clear to me whether the author intended this.

I have to wonder where this technology would be relevant. Converting
between UTF-8 and UTF-16 is a very quick operation even if implemented
fairly poorly. Does anyone know where this operation is a serious
bottleneck worth optimizing? I'd like to see such analysis.
Remember, "premature optimization is the root of all evil"...

/Simon
Rob Cameron
2007-03-15 20:28:58 UTC
Permalink
Simon,

You asked about relevance. The UTF-8 to UTF-16 bottleneck
is widely cited in literature on XML processing performance.

For example, in SAX processing, Psaila finds that transcoding
takes > 50% of XML processing time.
Giuseppe Psaila, "On the problem of coupling Java Algorithms and XML Parsers
17th Int. Conf on Database Systems and Applications, 2006.
Post by Simon Josefsson
I downloaded the source, but I agree with your concerns and will not
look further. Note that the OSL 3.0 license contains discussion about
patents, and give us some rights to the patented technology. It is
not clear to me whether the author intended this.
I have to wonder where this technology would be relevant. Converting
between UTF-8 and UTF-16 is a very quick operation even if implemented
fairly poorly. Does anyone know where this operation is a serious
bottleneck worth optimizing? I'd like to see such analysis.
Remember, "premature optimization is the root of all evil"...
/Simon
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
--
Robert D. Cameron, Ph.D.
Professor of Computing Science
Simon Fraser University
William J Poser
2007-03-15 20:31:36 UTC
Permalink
Post by Rob Cameron
For example, in SAX processing, Psaila finds that transcoding
takes > 50% of XML processing time.
Granting that this means that UTF8 <-> UTF16 conversion is
the place to look if one wants to speed up SAX, is SAX
processing too slow? Another interpretation of such data is
that both transcoding and other aspects of XML processing
are sufficiently fast that optimizing them is not a concern.
Rich Felker
2007-03-15 21:47:42 UTC
Permalink
Post by Rob Cameron
Simon,
You asked about relevance. The UTF-8 to UTF-16 bottleneck
is widely cited in literature on XML processing performance.
And why would you do this? Simply keep the data as UTF-8. There's no
good reason for using UTF-16 at all; it's just a bad implementation
choice. IIRC either HTML or XML (yes I know they're different but I
forget which does it..) specifies that everything is UTF-16
internally, which is naturally a stupid thing to specify, but this can
in almost all cases simply be ignored because it's an internal
implementation detail that's not testably different.
Post by Rob Cameron
For example, in SAX processing, Psaila finds that transcoding
takes > 50% of XML processing time.
But isn't XML processing something like 1-5% of your total time for
whatever you're ultimately trying to do?

Rich
Colin Paul Adams
2007-03-16 21:59:06 UTC
Permalink
Rich> UTF-8. There's no good reason for using UTF-16 at all; it's
Rich> just a bad implementation choice. IIRC either HTML or XML
Rich> (yes I know they're different but I forget which does it..)

I don't ever recall seeing this in HTML, but it certainly isn't in
XML.
The only thing XML has to say on the subject is that XML parsers must
be able to read both.
--
Colin Adams
Preston Lancashire
Ben Wiley Sittler
2007-03-17 02:16:55 UTC
Permalink
I believe it's more "DHTML" that is the problem.

DOMString is specified to be UTF-16. Likewise for ECMAScript strings,
IIRC, although they may still be officially UCS-2.

In practice ECMAScript specifies (and implementations provide) such
minimal Unicode support (no canonicalization or character class
primitives [combining, etc.], for instance, and no way to work with
characters rather than UCS-2 codes/surrogate halves, and no access to
codecs other than UTF-8<->UTF-16 [often buggy and incomplete, and
rarely able to deal with errors in any way other than throwing
exceptions], nor any access to the Unicode names database or the
Unihan database) that applications are basically on their own.

On 16 Mar 2007 21:59:06 +0000, Colin Paul Adams
Post by Colin Paul Adams
Rich> UTF-8. There's no good reason for using UTF-16 at all; it's
Rich> just a bad implementation choice. IIRC either HTML or XML
Rich> (yes I know they're different but I forget which does it..)
I don't ever recall seeing this in HTML, but it certainly isn't in
XML.
The only thing XML has to say on the subject is that XML parsers must
be able to read both.
--
Colin Adams
Preston Lancashire
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2007-03-17 07:45:23 UTC
Permalink
Post by Ben Wiley Sittler
I believe it's more "DHTML" that is the problem.
DOMString is specified to be UTF-16. Likewise for ECMAScript strings,
IIRC, although they may still be officially UCS-2.
Indeed, this was what I was thinking of. Thanks for clarifying. BTW,
any idea WHY they brought the UTF-16 nonsense to DOM/DHTML/etc.? As
far as I can tell there's no reason JS and such were restricted to
16bit types for characters; changing it to 32bit (or 21bit or
whatever) shouldn't be able to break anything... It's not like JS is a
systems programming language with pointers and type casts between
pointer types. It's just working with abstract character numbers.

I wonder if there's any hope that the madness will eventually be
fixed, or if we'll be stuck with UTF-16 forever here..

Rich
Colin Paul Adams
2007-03-17 08:05:23 UTC
Permalink
Rich> Indeed, this was what I was thinking of. Thanks for
Rich> clarifying. BTW, any idea WHY they brought the UTF-16
Rich> nonsense to DOM/DHTML/etc.?

I don't know for certain, but I can speculate well, I think.

DOM was a micros**t invention (and how it shows!). NT was UCS-2
(effectively).
--
Colin Adams
Preston Lancashire
Christopher Fynn
2007-03-17 12:25:59 UTC
Permalink
Post by Colin Paul Adams
Rich> Indeed, this was what I was thinking of. Thanks for
Rich> clarifying. BTW, any idea WHY they brought the UTF-16
Rich> nonsense to DOM/DHTML/etc.?
I don't know for certain, but I can speculate well, I think.
DOM was a micros**t invention (and how it shows!). NT was UCS-2
(effectively).
AFAIK Unicode was originally only planned to be a 16-bit encoding.
the The Unicode Consortium and ISO 10646 then agreed to synchronize the
two standards - though originally Unicode was only going to be a 16-bit
subset of the UCS. A little after that Unicode decided to support UCS
characters beyond plane 0.

Anyway at the time NT was being designed (late eighties) Unicode was
supposed to be limited to < 65536 characers and UTF-8 hadn't been
thought of, so 16-bits probably seemed like a good idea.
Ben Wiley Sittler
2007-03-17 16:49:24 UTC
Permalink
just a hypothesis, but i'm guessing that at the time they put this
together, both major platforms (win32 and java) dealing with DOM used
ucs-2 (and now use utf-16) internally.

even today, win32 and java mostly do not use utf-8. the only form
widely supported outside of linux and unix systems is utf-8 with a
fictitious "byte order mark" (obviously as a byte-oriented encoding
this is useless) which is of course incompatible with tools used on
unix and linux, and with many web browsers. Notepad uses this form,
and Java uses a bunch of incompatible utf-8 "extensions" in its
serializations (incorrect encoding of NUL and incorrect encoding of
plane 1 ... plane 16 using utf-8 sequences corresponding to individual
surrogate codes). unfortunately this is perpetuated in several network
protocols, and e.g. is what one does when interfacing to Oracle or
MySQL.

even on mac os x, where it's the encoding used for the unix-type
filesystem access, it's still not the default text encoding in
TextEdit, and utf-8 text files "don't work" (i.e. they open as
MacRoman or whatever Mac* encoding is paired with the OS language.)
fortunately this si configurable, unfortunately changing it breaks all
sorts of other stuff (apps frequently still ship with macroman README
files, etc.)

so basically, if you want it to work i recommend switching to linux,
unix, plan 9, or similar :(
Post by Christopher Fynn
Post by Colin Paul Adams
Rich> Indeed, this was what I was thinking of. Thanks for
Rich> clarifying. BTW, any idea WHY they brought the UTF-16
Rich> nonsense to DOM/DHTML/etc.?
I don't know for certain, but I can speculate well, I think.
DOM was a micros**t invention (and how it shows!). NT was UCS-2
(effectively).
AFAIK Unicode was originally only planned to be a 16-bit encoding.
the The Unicode Consortium and ISO 10646 then agreed to synchronize the
two standards - though originally Unicode was only going to be a 16-bit
subset of the UCS. A little after that Unicode decided to support UCS
characters beyond plane 0.
Anyway at the time NT was being designed (late eighties) Unicode was
supposed to be limited to < 65536 characers and UTF-8 hadn't been
thought of, so 16-bits probably seemed like a good idea.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2007-03-18 01:47:33 UTC
Permalink
Post by Christopher Fynn
Post by Colin Paul Adams
Rich> Indeed, this was what I was thinking of. Thanks for
Rich> clarifying. BTW, any idea WHY they brought the UTF-16
Rich> nonsense to DOM/DHTML/etc.?
I don't know for certain, but I can speculate well, I think.
DOM was a micros**t invention (and how it shows!). NT was UCS-2
(effectively).
AFAIK Unicode was originally only planned to be a 16-bit encoding.
the The Unicode Consortium and ISO 10646 then agreed to synchronize the
two standards - though originally Unicode was only going to be a 16-bit
subset of the UCS. A little after that Unicode decided to support UCS
characters beyond plane 0.
Anyway at the time NT was being designed (late eighties) Unicode was
supposed to be limited to < 65536 characers and UTF-8 hadn't been
thought of, so 16-bits probably seemed like a good idea.
While this is probably true, it's also aside from the point. I wasn't
asking why Windows used UCS-2, but why JavaScript remained stuck on
the 16bit idea even after the character set expanded -- since JS is a
pretty high level lang and the size of types is largely irrelevant,
redefining characters to be 32bit integers shouldn't have broken
anything.

Rich

Daniel B.
2007-03-17 16:59:20 UTC
Permalink
Post by Rich Felker
Post by Ben Wiley Sittler
I believe it's more "DHTML" that is the problem.
DOMString is specified to be UTF-16. Likewise for ECMAScript strings,
IIRC, although they may still be officially UCS-2.
Indeed, this was what I was thinking of. Thanks for clarifying. BTW,
any idea WHY they brought the UTF-16 nonsense to DOM/DHTML/etc.?
Was it like Java, which was defined when Unicode was just 16 bits, before
it got expanded to 32 bits and 16-bit Unicode effectively became UTF-16?


Daniel
--
Daniel Barclay
***@smart.net
SrinTuar
2007-03-15 21:46:35 UTC
Permalink
Post by Rob Cameron
You asked about relevance. The UTF-8 to UTF-16 bottleneck
is widely cited in literature on XML processing performance.
The problem then appears to be UTF-16. You have to spend an awful lot
less time transcoding if you simply leave the XML in utf-8 like it
should be. (Even if these means either dropping Java, or proposing a
fix to the language ) UTF-16 is the worst possible choice of unicode
encoding, having all the downsides of the other choices and none of
their upsides.

This type of transcoding is the height of trivial, and I can only
imagine it being significant if implemented in a highly inefficient
manner. (perhaps a naive implementation in native Java its not so
swift.)
Post by Rob Cameron
Giuseppe Psaila, "On the problem of coupling Java Algorithms and XML Parsers
17th Int. Conf on Database Systems and Applications, 2006.
The article does not seems to be freely readable.


In any case, a patent pending vector implementation is not exactly interesting.
Rob Cameron
2007-03-15 18:43:51 UTC
Permalink
Rich,

I would agree that the abuse of software patents is fundamentally
wrong and that patent reform is highly overdue. I am doing
something about it.

However, I would prefer to see that discussion taken back to Groklaw.
We have already had two rounds with respect to International Characters
draft Covenant Not to Assert as well as my patent-based open source model.
http://www.cs.sfu.ca/~cameron/tech-transfer.html

The "obvious" application of vectorization to UTF-8 doesn't work,
because UTF-8 comes in variable length chunks. Anyway, comments
on obviousness and prior art is invited as part of the Community Patent
Review process to which we've applied.
dotank.nyls.edu/communitypatent/
Post by Rich Felker
Post by Rob Cameron
u8u16-0.9 is available as open source software under an OSL 3.0 license
at http://u8u16.costar.sfu.ca/
On second thought, I will not offer any further advice on this. The
website refers to "patent-pending technology". Software patents are
fundamentally wrong and unless you withdraw this nonsense you are an
enemy of Free Software, of programmers, and users in general, and
deserve to be ostracized by the community. Even if you intend to
license the patents obtained freely for use in Free Software, it's
still wrong to obtain them because it furthers a precedent that
software patents are valid, particularly stupid patents like "applying
vectorization in the obvious way to existing problem X".
Sorry for the harsh language but this is what you should expect when
you ask for advice from the Linux/Free Software community on
leveraging patents against them.
Sincerely,
Rich Felker
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
--
Robert D. Cameron, Ph.D.
Professor of Computing Science
Simon Fraser University
Rich Felker
2007-03-15 19:46:58 UTC
Permalink
Post by Rob Cameron
Rich,
I would agree that the abuse of software patents is fundamentally
wrong and that patent reform is highly overdue. I am doing
something about it.
The use of software patents is the abuse of software patents. There is
no difference. Any software patent is fundamentally wrong.
Post by Rob Cameron
However, I would prefer to see that discussion taken back to Groklaw.
We have already had two rounds with respect to International Characters
draft Covenant Not to Assert as well as my patent-based open source model.
http://www.cs.sfu.ca/~cameron/tech-transfer.html
No matter how much you try to "help" 'open source' (a movement to
which I do not belong and do not want to belong) with patents,
reinforcing the software patent system and giving it legitimacy will
only hurt Free Software (and all programmers) more in the long run.
Why are you pursuing patents anyway? Do you even have a reason that
you're willing to share with us?
Post by Rob Cameron
The "obvious" application of vectorization to UTF-8 doesn't work,
because UTF-8 comes in variable length chunks.
Without reading your source, my "obvious" implementation would be a
sort of nondeterministic model of computing the decoding in all
alignments at once, and ensuring that an error flag accumulates for
invalid ones to allow only the valid ones to be kept. While UTF-8
chunks are variable length, UTF-8 has the very nice property that a
misaligned decode will never emulate a valid sequence. I thought this
out in under a minute, being moderately experienced in writing UTF-8
decoders. It's not rocket science.

I'm sure there are other approaches too. Maybe you have a somewhat
better one. In any case the madness of patenting "applying
vectorization to problem X", or in general "applying known tool Y to
problem X", has got to stop!

Rich
Dale Gulledge
2007-03-16 15:36:49 UTC
Permalink
Post by Rob Cameron
As part of my research program into high-speed XML/Unicode/text
processing using SIMD techniques, I have experimented extensively
with the UTF-8 to UTF-16 conversion problem. I've generally been
comparing performance of my software with that of iconv under
Linux and Mac OS X. Are there any substantially faster implementations
available? Currently our u8u16-0.9 software runs about 3X to 25X faster
than iconv depending on platform and data characteristics.
u8u16-0.9 is available as open source software under an OSL 3.0 license
at http://u8u16.costar.sfu.ca/
I can't vouch for the relative performance because I haven't used it,
but you might want to try ICU. The documentation for its character set
converters is here:

http://www.icu-project.org/userguide/codepageConverters.html
Loading...