Markus Kuhn
2007-05-31 18:23:02 UTC
Let's say I live in a completely ISO 8859/etc.-free world, that I don't
care about the existance of any other character representation than
UTF-8, and that I am therefore absolutely not interested in any form of
character encoding conversion function.
How can I then switch between a "byte string" and a "character string"
in Perl without ever actually touching the stored bytes of the string?
All I want to change is the UTF-8 flag associated with a string that
tells the regular expression engine, for example, whether /./ matches
just a single byte or an entire UTF-8 character?
It seems the low-level Perl functions utf8::upgrade(),
utf8::downgrade(), utf8::encode(), and utf8::decode() (see "man 3 utf8")
are not usable, because they interpret and convert any binary string as
if it was an ISO 8859-1 string. I don't want to load any huge encoding
packages such as "use encode 'utf8';" or "use Encoding;", because I
don't need and want any character encoding conversion functions. All I
want to change is a simple flag. Unfortunately, the documentation is far
from clear on how to do this, and my experimentation leads to strange
results that look like strings going through several ISO 8859-1 to UTF-8
conversion steps (whereas I want zero of these).
Any help?
Markus
care about the existance of any other character representation than
UTF-8, and that I am therefore absolutely not interested in any form of
character encoding conversion function.
How can I then switch between a "byte string" and a "character string"
in Perl without ever actually touching the stored bytes of the string?
All I want to change is the UTF-8 flag associated with a string that
tells the regular expression engine, for example, whether /./ matches
just a single byte or an entire UTF-8 character?
It seems the low-level Perl functions utf8::upgrade(),
utf8::downgrade(), utf8::encode(), and utf8::decode() (see "man 3 utf8")
are not usable, because they interpret and convert any binary string as
if it was an ISO 8859-1 string. I don't want to load any huge encoding
packages such as "use encode 'utf8';" or "use Encoding;", because I
don't need and want any character encoding conversion functions. All I
want to change is a simple flag. Unfortunately, the documentation is far
from clear on how to do this, and my experimentation leads to strange
results that look like strings going through several ISO 8859-1 to UTF-8
conversion steps (whereas I want zero of these).
Any help?
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain