Conversion-free switching between binary and character strings in Perl

Discussion:

Markus Kuhn

2007-05-31 18:23:02 UTC

Let's say I live in a completely ISO 8859/etc.-free world, that I don't
care about the existance of any other character representation than
UTF-8, and that I am therefore absolutely not interested in any form of
character encoding conversion function.

How can I then switch between a "byte string" and a "character string"
in Perl without ever actually touching the stored bytes of the string?
All I want to change is the UTF-8 flag associated with a string that
tells the regular expression engine, for example, whether /./ matches
just a single byte or an entire UTF-8 character?

It seems the low-level Perl functions utf8::upgrade(),
utf8::downgrade(), utf8::encode(), and utf8::decode() (see "man 3 utf8")
are not usable, because they interpret and convert any binary string as
if it was an ISO 8859-1 string. I don't want to load any huge encoding
packages such as "use encode 'utf8';" or "use Encoding;", because I
don't need and want any character encoding conversion functions. All I
want to change is a simple flag. Unfortunately, the documentation is far
from clear on how to do this, and my experimentation leads to strange
results that look like strings going through several ISO 8859-1 to UTF-8
conversion steps (whereas I want zero of these).

Any help?

Markus

--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain

Egmont Koblinger

2007-05-31 18:35:10 UTC

Permalink

Post by Markus Kuhn
How can I then switch between a "byte string" and a "character string"

I guess you're looking for Encode::_utf8_{on,off}

--
Egmont

Markus Kuhn

2007-05-31 19:43:13 UTC

Permalink

Post by Egmont Koblinger

Post by Markus Kuhn
How can I then switch between a "byte string" and a "character string"

I guess you're looking for Encode::_utf8_{on,off}

Looks good, but can't get this to work either:

#!/usr/bin/perl
use Encode;
$s = pack("C2", 0xc2, 0xa9); # binary string containing COPYRIGHT SIGN
print "length=", length($s),"\n"; # gives 2
print "utf8=", Encode::is_utf8($s),"\n"; # gives false
# Convert non-ASCII UTF-8 into XML numeric character reference
$s =~ s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/Encode::_utf8_on($1),sprintf("&#x%02X;", ord($1))/ge;
print "$s\n"; # we want to see here: ©

$ ./test.pl
length=2
utf8=
Â

Is there something special about $1 inside a s/.../.../ge expression
that prevents the application of Encode::_utf8_on($1)?

Seems so, since

$s =~ s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a = $1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;

does the trick.

Markus

--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain

Larry Wall

2007-05-31 20:22:59 UTC

Permalink

On Thu, May 31, 2007 at 08:43:13PM +0100, Markus Kuhn wrote:
: Is there something special about $1 inside a s/.../.../ge expression
: that prevents the application of Encode::_utf8_on($1)?
:
: Seems so, since
:
: $s =~ s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a = $1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;
:
: does the trick.

Yes, in Perl 5 a magical variable like $1 is essentially a tied
reference into the middle of another string, and not a real value
in its own right, so when you read its value it copies out the
substring and ignores any flags you might have set on the original
scalar variable, since it thinks $1 is a read-only variable. (And,
in fact, assigning to $1 complains about what it sees as an attempt
to modify a read-only variable, but _utf8_on() is not checking to
see if the scalar is considered writeable.) But if it didn't simply
ignore the flag when copying out the value, you will have succeeded
in setting the utf8 flag for *all* $1 in your program, because Perl 5
only has one global $1 variable that interrogates the "current match"
every time you read it.

In theory this should all work better in Perl 6, where match variables
are properly lexically scoped, and $1 is just an alias into the list of
matches contained in the current match variable, so the identity of
each match can be preserved. (Along with the fact that Perl 6 treats
byte strings and character strings as fundamentally different types
that must not be confused with each other.)

Larry