On Thu, Mar 29, 2007 at 12:10:10PM -0400, $B#S#r#i#n#T#u#a#r(B wrote:
: >And after reading much of the earlier discussion, I must say that,
: >while I love UTF-8 dearly, it's usually the wrong abstraction level
: >to be working at for most text-processing jobs. Ordinary people
: >not steeped in systems programming and C culture just want to think
: >in graphemes, and so that's what the standard dialect of Perl 6 will
: >default to. A small nudge will push it into supporting the graphemes
: >of a particular human language. The people who want to think more
: >like a computer will also find ways to do that. It's just not the
: >sweet spot we're aiming for.
:
: Very interesting, though I must admit I'm sad to hear that. Over the years
: I have come to find that what I see as the sweet spot for string processing
: would be this definition of a utf-8 "string":
:
: "a null terminated series of bytes, some of which are parts of
: valid utf-8 sequences, and others which are treated as individial
: binary values"
I think that definition is essentially insane.
: Effectively, its a bare minimum update of what we had with ascii. The
: only time I want to depart from this paradigm is when I have to. But
: in general, I want to avoid conversion and keep my strings in this
: format as much as possible. (this is much like the way tools such as
: readline, vim, curses, etc handle utf-8 strings, etc )
We tried the bare minimum with Perl 5, and it was insane. Then we tried
a little more than the bare minimum, and it was a little less insane.
: Most code should not have to care which parts are valid and which are
: not. If they call a function which requires a specific level of
: validation, that function should be free to complain when that is not
: the case. But I don't see a reason why a plain "print" should ever
: need to care or complain about what its printing. All it has to do is
: catenate and dump bytes out, I don't think it should be a bouncer of
: what is kosher or not for printing.
The way to get most of your functions to not have to care is to be
very careful to validate at the boundaries of your program, and not
throw away type information between the parts of your program.
: I think a regex engine should, for example, match one binary byte to a
: "." the same way it would match a valid sequence of unicode characters
: and composing characters as a singe grapheme. This is a best effort to
: work with the string as provided, and someone who does not want such
: behavior would not run regex's over such strings.
How can it possibly know whether to match a binary byte or a grapheme
if you've mixed UTF-8 and binary in the same string? The short answer
is: "It can't." The long answer is "It can't unless you supply it
with type information out of band." Which for historical reasons C
programmers don't seem to mind doing, since C basically doesn't have
a clue what a string is, let alone what it might contain or how long
it might be. And null termination has turned out to be a terrible
workaround (in security terms as well as efficiency) for not knowing
the length. C's head-in-the-sand approach to string processing is
directly responsible for many of the security breaks on the net.
There's a good place for byte-oriented serialization and deserialization,
and it's usually at the boundaries of your program, in well-crafted
protocol stacks. If you find yourself doing that sort of thing in
the middle of a program, it's often a code smell that says you should
be refactoring and maybe dereinventing some wheel or other.
: When a program needs to take in data from various different encodings,
: it should be their job to convert that data into their locale's native
: encoding. (by reading mime headers or whatever mechanism) I don't
: think a programming language should have built-ins that track the
: status of a string- as that strikes me as an attempt to DWIM and not
: DWIS.
I'd much rather have a language where I can say what I mean.
: Taking that trend to its logical conclusion, I would not want every
: scalar value to track every possible kind of validation that has
: happned to a string: "utf-8, validated NFD, turkish + korean". If
: someone wants to do language specific case folding they can either
: default to the locale's language+encoding, or else specify which one's
: they want to use. If someone wants to make sure their string is valid
: utf-8 in NFKC, they can pass it to a validation routine such as
: Unicode::Normalize::NFKC. But the input and output of that routine
: should be a plain old scalar, with no special knowledge of what has
: happened to it.
I think this attitude is just sweeping the problem under someone
else's carpet.
: This minimal approach is much like what happens in C/C++, and i don't
: see any reason why a scripting language should do more than it is
: asked to and in the process potintially do the wrong thing despite its
: best intentions.
Then by all means do your scripting in a language that more closely
resembles C or C++. But I see no reason for Perl to be just another
version of C. We already have lots of those... :-)
It's just my gut-level feeling that traditional world of C, Unix,
locales, etc. simply does not provide appropriate abstractions to deal
with internationalization. Yes, you can get there if you throw enough
libraries and random functions and macros and pipes and filters at it,
but the basic abstractions leak like a seive. It's time to clean it
all up.
: Admittedly, in perl 5 these are trivial annoyances
: with readily available workarounds. From your post I guess I can
: assume that perl 6 will be similar.
I certainly believe in giving people enough rope to shoot themselves in
the foot. But with Perl 6 you at least have to ask nicely for the rope. :-)
: On a separate topic:
: Java seems to have a much worse problem. Forcing conversion to utf-16
: causes you to lose information, since utf-16 cannot represent all the
: possible invalid utf-8 sequences. It forces you treat your strings as
: binary blobs and lose access to all the functions that operate on
: strings, and/or take a performance hit for conversion where none is
: actually needed. (If the design goal of Java was to force utf-16 on
: the world they are unlikely to succeed at it, as utf-8 has largely
: ursurped it's place)
I don't think it's Perl 6's place to force either utf-8 or utf-16 or
utf-whatever on anyone. If the abstractions are sane and properly
encapsulated, the implementors can do whatever makes sense behind
the scenes, and that very likely means different things in different
contexts.
I try hard not to be a linguistic imperialist (when I try at all). :-)
Anyway, if anyone wants to give me specific feedback on the current
design of Perl 6, that'd be cool. Though perl6-***@perl.org would
probably be a better forum for that.
Larry