c++ strings and UTF-8 (other charsets)

Discussion:

c++ strings and UTF-8 (other charsets)

Julien Claassen

2007-02-19 17:47:31 UTC

Hello!
I've got one question. I'm writing a library in c++, which needs to handle
different character sets. I suppose for internal purposes UTF-8 is quite
sufficient. So is there a standard string class in the libstdc++ which
supports it?
Can I use something like:
printw(0,0,"%s",my_utf8_string.c_str());
with it?
Is there some kind of good, small example code of how to use libiconv most
efficiently with strings in c++?
Any good hints are appreciated! Thanks!
Kindest regards
Julien

--------
Music was my first love and it will be my last (John Miles)

======== FIND MY WEB-PROJECT AT: ========
http://ltsb.sourceforge.net
the Linux TextBased Studio guide
======= AND MY PERSONAL PAGES AT: =======
http://www.juliencoder.de

Julien Claassen

2007-02-19 17:49:20 UTC

Hello!
I've got one question. I'm writing a library in c++, which needs to handle
different character sets. I suppose for internal purposes UTF-8 is quite
sufficient. So is there a standard string class in the libstdc++ which
supports it?
Can I use something like:
printw(0,0,"%s",my_utf8_string.c_str());
with it?
Is there some kind of good, small example code of how to use libiconv most
efficiently with strings in c++?
Any good hints are appreciated! Thanks!
Kindest regards
Julien

--------
Music was my first love and it will be my last (John Miles)

======== FIND MY WEB-PROJECT AT: ========
http://ltsb.sourceforge.net
the Linux TextBased Studio guide
======= AND MY PERSONAL PAGES AT: =======
http://www.juliencoder.de

Rich Felker

2007-02-21 00:40:28 UTC

Post by Julien Claassen
Hello!
I've got one question. I'm writing a library in c++, which needs to handle
different character sets. I suppose for internal purposes UTF-8 is quite
sufficient. So is there a standard string class in the libstdc++ which
supports it?
printw(0,0,"%s",my_utf8_string.c_str());
with it?

The whole point of UTF-8 is that it's usable directly as a normal
string. You don't need any special classes, just a normal string
class. If you want to add extra UTF-8-specific functionality you could
perhaps make a derived class.

Post by Julien Claassen
Is there some kind of good, small example code of how to use libiconv most
efficiently with strings in c++?

Not sure what you mean by most efficiently. If you're converting from
another encoding to UTF-8, I would just initially allocate some small
constant times the original size in the legacy encoding (3 times
should be sufficient; 4 times surely is), then use iconv to convert
into the allocated buffer, and subsequently resize it to free the
unused space if you care about space.

Sorry my suggestions aren't very C++-specific. I only use C and am not
very fond of C++ so I'm not particularly familiar with it.

Post by Julien Claassen
Any good hints are appreciated! Thanks!

Hope this helps a little bit. If you have more specific questions feel
free to ask (on-list please).

Rich

Julien Claassen

2007-02-24 17:13:37 UTC

Hi!
What I meant about UTF-8-strings in c++: I mean in c and c++ they're not
standard like in Java. I think UTF-8 is a variable width multibyte charset, so
there are specific problems in handling them allocating the right space. I
mean the Glib contains something like UString and QT has its QStrings, which
I think are also UTF-8 capable.
Kindest regards and thanks for the hintsso far.
Julien

--------
Music was my first love and it will be my last (John Miles)

======== FIND MY WEB-PROJECT AT: ========
http://ltsb.sourceforge.net
the Linux TextBased Studio guide
======= AND MY PERSONAL PAGES AT: =======
http://www.juliencoder.de

Rich Felker

2007-02-25 23:57:57 UTC

Post by Julien Claassen
Hi!
What I meant about UTF-8-strings in c++: I mean in c and c++ they're not
standard like in Java.

UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

Post by Julien Claassen
I think UTF-8 is a variable width multibyte charset, so
there are specific problems in handling them allocating the right space. I
mean the Glib contains something like UString and QT has its QStrings, which
I think are also UTF-8 capable.

All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.

Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?

It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.

Rich

Marcel Ruff

2007-02-26 07:10:59 UTC

Post by Rich Felker

Post by Julien Claassen
Hi!
What I meant about UTF-8-strings in c++: I mean in c and c++ they're not
standard like in Java.

UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

Post by Julien Claassen
I think UTF-8 is a variable width multibyte charset, so
there are specific problems in handling them allocating the right space. I
mean the Glib contains something like UString and QT has its QStrings, which
I think are also UTF-8 capable.

As far as i know:

Using UTF-8 in C or C++ is very simple:
As UTF-8 may not contain '\0' you can simply use all
functions as before (strcmp(), std::string etc.).
Old code doesn't need to be ported.

The only place to take care is when interfacing other libraries
using wchar_t and such (UTF-16, UTF-32), here
you need to convert using functions like wcstrtombs(), mbstrtowcs(),
mbrtowc() and such.
This works well on Linux, Windows or other OS,

Marcel

Post by Rich Felker
All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.
Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?
It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Stephane Bortzmeyer

2007-02-26 14:35:05 UTC

On Mon, Feb 26, 2007 at 08:10:59AM +0100,

As UTF-8 may not contain '\0' you can simply use all functions as
before (strcmp(), std::string etc.).

As long as you just store or retrieve strings. If you compare them
(strcmp), you HAVE TO take normalization into account. If you measure
them (strlen), you HAVE TO use a character semantic, not a byte
semantic. And so on.

Old code doesn't need to be ported.

Very strange advice, indeed.

Rich Felker

2007-02-27 02:14:27 UTC

Post by Stephane Bortzmeyer
On Mon, Feb 26, 2007 at 08:10:59AM +0100,

As UTF-8 may not contain '\0' you can simply use all functions as
before (strcmp(), std::string etc.).

As long as you just store or retrieve strings. If you compare them
(strcmp), you HAVE TO take normalization into account.

No you don't. Nothing in Unicode says that you must treat canonically
equivalent strings as identical, and in fact doing so is a bad idea in
most of the situations I've worked with. Unicode only says that you
should not assume that another process (in the Unicode sense of the
word "process") will treat them as being distinct.

If your particular application has a special need for normalization,
then yes you need to take it into account. But if you're doing
something like passing around filenames you most surely should not be
normalizing anything.

Post by Stephane Bortzmeyer
If you measure
them (strlen), you HAVE TO use a character semantic, not a byte
semantic. And so on.

Huh? Length in characters is basically useless to know. Length in
bytes and width of the text when rendered to a visual presentation are
both useful, but the only place where knowing length in number of
characters is useful is for fields that are limited to a fixed number
of characters. If the limit is for the sake of using a fixed-size
storage object, then this limit should just be changed to a limit in
bytes instead of in characters..

Post by Stephane Bortzmeyer

Old code doesn't need to be ported.

Very strange advice, indeed.

?? Hardly strange.. It depends on what the code does. See Markus
Kuhn's UTF-8 FAQ.

But Marcel is right about a lot of old code (just not all). Most code
doesn't care at all about the contents of the text, just that it's a
string.

Rich

Marcel Ruff

2007-02-27 11:10:16 UTC

Post by Rich Felker

Post by Stephane Bortzmeyer
On Mon, Feb 26, 2007 at 08:10:59AM +0100,

As UTF-8 may not contain '\0' you can simply use all functions as
before (strcmp(), std::string etc.).

As long as you just store or retrieve strings. If you compare them
(strcmp), you HAVE TO take normalization into account.

No you don't. Nothing in Unicode says that you must treat canonically
equivalent strings as identical, and in fact doing so is a bad idea in
most of the situations I've worked with. Unicode only says that you
should not assume that another process (in the Unicode sense of the
word "process") will treat them as being distinct.
If your particular application has a special need for normalization,
then yes you need to take it into account. But if you're doing
something like passing around filenames you most surely should not be
normalizing anything.

Post by Stephane Bortzmeyer
If you measure
them (strlen), you HAVE TO use a character semantic, not a byte
semantic. And so on.

Huh? Length in characters is basically useless to know. Length in
bytes and width of the text when rendered to a visual presentation are
both useful, but the only place where knowing length in number of
characters is useful is for fields that are limited to a fixed number
of characters. If the limit is for the sake of using a fixed-size
storage object, then this limit should just be changed to a limit in
bytes instead of in characters..

Post by Stephane Bortzmeyer

Old code doesn't need to be ported.

Very strange advice, indeed.

?? Hardly strange.. It depends on what the code does. See Markus
Kuhn's UTF-8 FAQ.
But Marcel is right about a lot of old code (just not all). Most code
doesn't care at all about the contents of the text, just that it's a
string.

Thanks for all those details.

I can only tell that when i started to port a C and a C++ library to
support unicode
on Linux/Unix/Windows/WindowsCE is was totally lost with the heaps of
complicated
and confusing advice found in the internet (the reason why i joined this
mailing list).

But in the end everything was very simple:

1. UTF-8 does not contain zero bytes
2. Doing all in UTF-8 and keeping my std::string and char* was a very
simple solution
3. I would need to define own data types if i want to support UTF-16
(similar to xerces an all the others)
This would be a major effort.
4. Take care when passing the strings to other libraries / GUIs as
mentioned in my first post

Getting to above *simple* insight took me several confused days,
after that the porting effort was done in one day.

I just wanted to share this to save others all the confusion,

Marcel

Post by Rich Felker
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

ＳｒｉｎＴｕａｒ

2007-02-27 14:49:50 UTC

Post by Stephane Bortzmeyer

Post by Marcel Ruff
Old code doesn't need to be ported.

Very strange advice, indeed.

You might want to read up on the history of UTF-8.
Not needed to make any code changes at all to most applications was in
fact one of the primary design goal of the encoding.

Post by Stephane Bortzmeyer
If you measure them (strlen), you HAVE TO use a character semantic,

not a byte semantic.

I have yet to encounter a case where a "character" count is useful.
Display length is sometimes useful, mostly in graphics or UI code, but
even then it has little to do with character count. 99.5% of the
times, strlen is used to determine storage requirements or buffer
length.

Post by Stephane Bortzmeyer
If you compare them (strcmp), you HAVE TO take normalization into account.

Hrm, I would say that is incorrect. You don't want to normalize input
most of the time.
When you are going to case-fold, perhaps for searching, its almost
always allright to normalize. If you are a big fat word processor, or
an import/conversion tool, its also okay. Most other programs are
better off not normalizing or even being aware of the concept, and
are better off assuming that their input is in a suitable format for
storage or output.

Rich Felker

2007-02-27 21:06:12 UTC

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½

Post by Stephane Bortzmeyer

Post by Marcel Ruff
Old code doesn't need to be ported.

Very strange advice, indeed.

You might want to read up on the history of UTF-8.

Here are some references for anyone wanting to do so:
http://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
Not needed to make any code changes at all to most applications was in
fact one of the primary design goal of the encoding.

I'd like to expand on and strengthen this statement a bit: the goal
was not just to avoid making code changes, but to avoid requirements
on text that would be fundamentally incompatible with some of the most
powerful tools in the unix model. UTF-16 (or at that time, UCS-2) not
only broke the API of standard C and unix; it also broke the
statelessness and robustness of text and the ability to treat it as
binary byte streams in pipes, etc. due to byte order issues and BOM.
This could have been avoided only by redefining the atomic data unit
(byte) to be 16 (or later 21 :) bits, which would in turn have
required scrapping and replacing every octet-based internet protocol..

Hopefully a good understanding of the history and motivations behind
UTF-8 makes it clear that UTF-8 is not (as Windows and Java fans try
to portrary it) a backwards-compatibility hack, but instead a
fundamentally better encoding scheme which allows powerful unix data
processing principles to continue to be used with text. It's a shame
the history isn't better-known.

Rich

Daniel B.

2007-03-09 03:18:55 UTC

???????? wrote:
...

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I have yet to encounter a case where a "character" count is useful.

Well, if an an editor the user tries to move forward three characters,
you probably want to increment a character count (an offset from
the beginning of the string).

(No, I don't know how dealing with glyphs instead of just characters
adds to that.)

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-03-09 04:34:53 UTC

Post by Daniel B.
....

Post by ï¼³ï½ï½ï½ï¼´ï½ï½ï½
I have yet to encounter a case where a "character" count is useful.

Well, if an an editor the user tries to move forward three characters,
you probably want to increment a character count (an offset from
the beginning of the string).

1. Normally you want to move locally by a (very) small integer number
of characters, e.g. 1, not to a particular character offset a long way
away. While the latter is a valid operation and is expensive in UTF-8
it has no practical applications that I know of except when all
characters occupy exactly one column and you’re trying to line up
columns. Relative seeking by n characters in UTF-8 is O(n),
independent of string length, so no problem for small relative cursor
motion like your example.

2. Even in such an editor, normally the unit by which you want to move
by is “graphemes” and not “characters”. That is, if the cursor is
positioned prior to ‘ã’ (LATIN LETTER SMALL A + COMBINING TILDE) and
you press the right arrow, you probably want it to move past both
characters and not “between” the two. The concept of graphemes is
slightly more complex in Indic scripts. There’s also the cases of
Korean (decomposed Jamo), Tibetan (stacking letters), etc. which can
be treated logically just like the A-TILDE example above.

Post by Daniel B.
(No, I don't know how dealing with glyphs instead of just characters
adds to that.)

Hopefully the above answers a little bit of that uncertainty..

Rich

Keld Jørn Simonsen

2007-02-27 22:55:55 UTC

Post by Stephane Bortzmeyer
On Mon, Feb 26, 2007 at 08:10:59AM +0100,

As UTF-8 may not contain '\0' you can simply use all functions as
before (strcmp(), std::string etc.).

As long as you just store or retrieve strings. If you compare them
(strcmp), you HAVE TO take normalization into account. If you measure
them (strlen), you HAVE TO use a character semantic, not a byte
semantic. And so on.

No you do not have to normalize the data to compare. That is, if you
follow ISO 14651/Unicode to compare at some precision, different from
absolute equality, the comparison will work for unnormalized data.
And that is the normal way of comparison anyway. Eg for looking after
a phrase in a document, you would normally do a case insensitive
comparison. And even if you do a case sensitive comparison you could use
14651 data or the data for your locale on unnormalized data.

The only catch is that 14882 does not provide an API for doing 14651
collating on different levels of precision. Maybe we could make such an
API, but probably in a future library TR.

best regards
keld

Daniel B.

2007-02-28 00:49:17 UTC

Marcel Ruff wrote:
...

As UTF-8 may not contain '\0' ...

Yes it can.

Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

Daniel

--
Daniel Barclay
***@smart.net

Marcel Ruff

2007-02-28 09:52:56 UTC

Post by Daniel B.
...

As UTF-8 may not contain '\0' ...

Yes it can.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

Oi oi oi, this complicates things again.

1. Serializing UTF-8 in Java over a socket and reading it in C/C++ as
UTF-8 could make problems?
-> Is there a Java-UTF-8-standard conversion utility?

2. Using C UTF-8: When/how can it happen that a char* contains a '\0'
which is a character instead of
the end of a char* ?

thanks for some enlightment,

Marcel

Post by Daniel B.
Daniel

Keld Jørn Simonsen

2007-02-28 10:10:52 UTC

Post by Daniel B.
...

As UTF-8 may not contain '\0' ...

Yes it can.

yes, it can, but then it represent the character NULL.
And strings in C/C++ are not supposed to contain the NULL character.

best regards
keld

Daniel B.

2007-03-09 03:23:56 UTC

Post by Keld JÃ¸rn Simonsen

Post by Daniel B.
...

As UTF-8 may not contain '\0' ...

Yes it can.

yes, it can, but then it represent the character NULL.
And strings in C/C++ are not supposed to contain the NULL character.

True, C strings can't contain a null byte other than the terminating
byte, so, since they can't contain a(ny other) null byte, they can't
represent the character NUL/NULL (in ASCII or standard UTF-8 encoding).

However, make sure you don't neglect to handle the fact that that a
UTF-8 input stream (just link an ASCII input stream), can contain a
null byte (representing a NULL character).

(I don't know if this mailing list deals only with files in general
(which could contain null-byte representations of NULL characters)
or deals with restricted strings (e.g., strings used to name files,
which strings are defined to never contain a NULL character).)

Daniel

--
Daniel Barclay
***@smart.net

Rich Felker

2007-02-28 22:32:32 UTC

Post by Daniel B.
....

As UTF-8 may not contain '\0' ...

Yes it can.

No, I think he just meant to say "a string of non-NUL _characters_ may
not contain a 0 _byte_". The NUL character is not valid "text" or a
valid part of a "string" in the POSIX sense of "text" or the C/POSIX
sense of "string".

Post by Daniel B.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

Uhg, disgusting...

BTW, note that ill-advised programs allowing NUL characters in text
where they do not belong often leads to vulnerabilities, like the
Firefox vuln just a few days ago.

Rich

William J Poser

2007-03-01 02:23:23 UTC

Although a zero byte may not be part of a C string, it may
be part of a "character string literal". See section 6.4.5,
p. 62, of the C99 standard. "character string literals"
need not be strings.

Bill

Marcel Ruff

2007-03-01 08:38:49 UTC

Post by William J Poser
Although a zero byte may not be part of a C string, it may
be part of a "character string literal". See section 6.4.5,
p. 62, of the C99 standard. "character string literals"
need not be strings.

Ok, so no danger here.

Thanks
Marcel

Post by William J Poser
Bill
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Marcel Ruff

2007-03-01 08:40:19 UTC

Post by Rich Felker

Post by Daniel B.
....

As UTF-8 may not contain '\0' ...

Yes it can.

No, I think he just meant to say "a string of non-NUL _characters_ may
not contain a 0 _byte_". The NUL character is not valid "text" or a
valid part of a "string" in the POSIX sense of "text" or the C/POSIX
sense of "string".

Yes, you describe my issue more precise.

thanks
Marcel

Marcel Ruff

2007-03-01 08:41:44 UTC

Post by Rich Felker

Post by Daniel B.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

Uhg, disgusting...

Yes - this is an open & serious issue for my approach!

Has anybody some practical advice on this?

Marcel

Rich Felker

2007-03-01 15:37:21 UTC

Post by Marcel Ruff

Post by Rich Felker

Post by Daniel B.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

Uhg, disgusting...

Yes - this is an open & serious issue for my approach!
Has anybody some practical advice on this?

Just treat the sequence c0 80 according to the spec, as an invalid
sequence. Neither it (because it's illegal utf-8) nor a real NUL
(because it's illegal in text) should appear. If your problem is more
specific and there's a real reason you need to handle such data
differently, please describe what you're doing so we can offer better
advice.

Rich

Marcel Ruff

2007-03-01 18:53:52 UTC

Post by Rich Felker

Post by Marcel Ruff

Post by Rich Felker

Post by Daniel B.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

Uhg, disgusting...

Yes - this is an open & serious issue for my approach!
Has anybody some practical advice on this?

Just treat the sequence c0 80 according to the spec, as an invalid
sequence. Neither it (because it's illegal utf-8) nor a real NUL
(because it's illegal in text) should appear. If your problem is more
specific and there's a real reason you need to handle such data
differently, please describe what you're doing so we can offer better
advice.

The first sentence from the above wiki says:

"In normal usage, the Java programming language
<http://en.wikipedia.org/wiki/Java_%28programming_language%29> supports
standard UTF-8 when reading and writing strings through
|InputStreamReader
<http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html>|
and |OutputStreamWriter
<http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html>"|

and this is what i do to access sockets, so no problems here.

But then it states that 'Supplementary multilingual plane' is encoded
incompatible.
So must i assume if i send 'mathematical alphanumeric symbols'
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
like 'ℝ' from C to java they will be corrupted?
Both applications work with what they think is 'UTF-8' ...

Marcel

Post by Rich Felker
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2007-03-02 07:43:32 UTC

Post by Marcel Ruff

Post by Daniel B.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

"In normal usage, the Java programming language
<http://en.wikipedia.org/wiki/Java_%28programming_language%29> supports
standard UTF-8 when reading and writing strings through
|InputStreamReader
<http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html>|
and |OutputStreamWriter
<http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html>"|
and this is what i do to access sockets, so no problems here.
But then it states that 'Supplementary multilingual plane' is encoded
incompatible.

Oh, you're talking about that part, not the NUL issue. Then yes, it's
a major problem. Java generates and processes bogus illegal UTF-8
(surrogates). I don't know if there are any easy workarounds except to
flame Sun to hell for being so stupid..

Post by Marcel Ruff
So must i assume if i send 'mathematical alphanumeric symbols'
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
like 'ℝ' from C to java they will be corrupted?

ℝ is in the BMP, so no problem with it. It's just the huge pages of
random letters in every single font/style imaginable that are outside
the BMP. Of course various important CJK characters (needed for
writing certain names) and historical scripts are also outside the
BMP.

Post by Marcel Ruff
Both applications work with what they think is 'UTF-8' ...

Yes. And Java is wrong. However, according to the Wikipedia article
referenced, Java _does_ do the right thing in input and output
streams. It's only the object serialization stuff that uses the bogus
UTF-8. So I don't think you're likely to have problems in practice as
long as you don't try to pass this data off (which would be in binary
files anyway, I think...?) as UTF-8.

Rich

Marcel Ruff

2007-03-02 07:43:16 UTC

Post by Rich Felker

Post by Marcel Ruff

Post by Daniel B.
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?

"In normal usage, the Java programming language
<http://en.wikipedia.org/wiki/Java_%28programming_language%29> supports
standard UTF-8 when reading and writing strings through
|InputStreamReader
<http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html>|
and |OutputStreamWriter
<http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html>"|
and this is what i do to access sockets, so no problems here.
But then it states that 'Supplementary multilingual plane' is encoded
incompatible.

Oh, you're talking about that part, not the NUL issue. Then yes, it's
a major problem. Java generates and processes bogus illegal UTF-8
(surrogates). I don't know if there are any easy workarounds except to
flame Sun to hell for being so stupid..

Post by Marcel Ruff
So must i assume if i send 'mathematical alphanumeric symbols'
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
like 'ℝ' from C to java they will be corrupted?

ℝ is in the BMP, so no problem with it. It's just the huge pages of
random letters in every single font/style imaginable that are outside
the BMP. Of course various important CJK characters (needed for
writing certain names) and historical scripts are also outside the
BMP.

Post by Marcel Ruff
Both applications work with what they think is 'UTF-8' ...

Yes. And Java is wrong. However, according to the Wikipedia article
referenced, Java _does_ do the right thing in input and output
streams. It's only the object serialization stuff that uses the bogus
UTF-8. So I don't think you're likely to have problems in practice as
long as you don't try to pass this data off (which would be in binary
files anyway, I think...?) as UTF-8.

Ok, thanks, so porting legacy C/C++ to unicode UTF-8 is simple :-)

Marcel

26 Replies
4 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Julien Claassen 2007-02-19 17:47:31 UTC

Julien Claassen 2007-02-19 17:49:20 UTC

Rich Felker 2007-02-21 00:40:28 UTC

Julien Claassen 2007-02-24 17:13:37 UTC

Rich Felker 2007-02-25 23:57:57 UTC

Marcel Ruff 2007-02-26 07:10:59 UTC

Stephane Bortzmeyer 2007-02-26 14:35:05 UTC

Rich Felker 2007-02-27 02:14:27 UTC

Marcel Ruff 2007-02-27 11:10:16 UTC

ＳｒｉｎＴｕａｒ 2007-02-27 14:49:50 UTC

Rich Felker 2007-02-27 21:06:12 UTC

Daniel B. 2007-03-09 03:18:55 UTC

Rich Felker 2007-03-09 04:34:53 UTC

Keld Jørn Simonsen 2007-02-27 22:55:55 UTC

Daniel B. 2007-02-28 00:49:17 UTC

Marcel Ruff 2007-02-28 09:52:56 UTC

Keld Jørn Simonsen 2007-02-28 10:10:52 UTC

Daniel B. 2007-03-09 03:23:56 UTC

Rich Felker 2007-02-28 22:32:32 UTC

William J Poser 2007-03-01 02:23:23 UTC

Marcel Ruff 2007-03-01 08:38:49 UTC

Marcel Ruff 2007-03-01 08:40:19 UTC

Marcel Ruff 2007-03-01 08:41:44 UTC

Rich Felker 2007-03-01 15:37:21 UTC

Marcel Ruff 2007-03-01 18:53:52 UTC

Rich Felker 2007-03-02 07:43:32 UTC

Marcel Ruff 2007-03-02 07:43:16 UTC

about - legalese

Loading...