Questions about Unicode-aware C programs under Linux

Discussion:

Ali Majdzadeh

2007-04-16 08:03:26 UTC

Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t data
type (glibc), is compiler and platform dependent. What is the best practice
of writing portable Unicode-aware C programs? Is it a good practice to use
Unicode literals directly in a C program? I have experienced some problems
with glibc's wide character string functions, I want to know is there any
standard way of programming or standard template to write a Unicode-aware C
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document. For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)

Best Regards
Ali

Rich Felker

2007-04-16 17:23:57 UTC

Permalink

It depends on the degree of portability you want. Using them in wide
strings is not entirely portable (depends on the translation character
encoding), but using them in UTF-8 strings is (they're just byte
sequences).

Post by Ali Majdzadeh
I have experienced some problems
with glibc's wide character string functions, I want to know is there any
standard way of programming or standard template to write a Unicode-aware C
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document.

If your application is Persian-specific, then you're completely
entitled to assume the text encoding is UTF-8 and that the system is
capable of dealing with UTF-8 and Unicode. Will there be any Persion
specific text processing though or do you just want to be able to pass
through Persian text?

Post by Ali Majdzadeh
For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)

wcsstr doesn't care about encoding or Unicode semantics or anything.
It just looks for binary substring matches, just like strstr but using
wchar_t instead of char as the unit.

Overall I'd suggest ignoring the wchar_t functions. Especially the
wide stdio functions are problematic. Using UTF-8 is just as easy and
then your strings are directly usable for input and output to/from
text files, commandline, etc.

Rich

Ali Majdzadeh

2007-04-17 06:16:44 UTC

Permalink

Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
Do you mean that I should use common stdio functions? (like, fgets(), ...)
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
variables? So, what about the character size defference (Unicode and ASCII)?
And also, string functions? (like, strtok())
Sorry, I am new to the issue.

Best Regards
Ali

Post by Ali Majdzadeh

Post by Ali Majdzadeh
Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t

data

Post by Ali Majdzadeh
type (glibc), is compiler and platform dependent. What is the best

practice

Post by Ali Majdzadeh
of writing portable Unicode-aware C programs? Is it a good practice to

use

Post by Ali Majdzadeh
Unicode literals directly in a C program?

Post by Ali Majdzadeh
I have experienced some problems
with glibc's wide character string functions, I want to know is there

any

Post by Ali Majdzadeh
standard way of programming or standard template to write a

Unicode-aware C

Post by Ali Majdzadeh
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document.

wcsstr doesn't care about encoding or Unicode semantics or anything.
It just looks for binary substring matches, just like strstr but using
wchar_t instead of char as the unit.
Overall I'd suggest ignoring the wchar_t functions. Especially the
wide stdio functions are problematic. Using UTF-8 is just as easy and
then your strings are directly usable for input and output to/from
text files, commandline, etc.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2007-04-17 06:29:13 UTC

Permalink

Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.

OK.

Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(), ...)

Yes, they'll work fine.

Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*

Yes.

Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and ASCII)?
And also, string functions? (like, strtok())

strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.

strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.

Rich

Ali Majdzadeh

2007-04-17 08:28:36 UTC

Permalink

Hi Rich
Thanks a lot for your response.
I am going to test it. Thanks.

Best Regards
Ali

Post by Rich Felker

Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.

OK.

Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(),

...)
Yes, they'll work fine.

Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*

Yes.

Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and

ASCII)?

Post by Ali Majdzadeh
And also, string functions? (like, strtok())

strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.
strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Ali Majdzadeh

2007-04-17 08:47:19 UTC

Permalink

Hello Rich
Sorry, again.
I wrote a simple C program using your guidelines but unfortunately it does
not work well:
The program is as follows:

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <langinfo.h>

int main (
int argc,
char *argv[]
)
{
FILE *input_file;

char buffer[1024];

if (!setlocale (LC_CTYPE, ""))
{
fprintf (stderr, "Locale not specified. Check LC_ALL,
LC_CTYPE or LANG.\n");
return EXIT_FAILURE;
}

if (!(input_file = fopen ("./in.txt", "r")))
{
fprintf (stderr, "Could not open file : %s\n", strerror
(errno));
return EXIT_FAILURE;
}

fgets (buffer, sizeof (buffer), input_file);
fprintf (stdout, "%s", buffer);

return EXIT_SUCCESS;
}

The program does not print the line read from the file to stdout (some junks
are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt" to
produce a UTF-8 oriented file.

Best Regards
Ali

Post by Rich Felker

Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.

OK.

Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(),

...)
Yes, they'll work fine.

Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*

Yes.

Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and

ASCII)?

Post by Ali Majdzadeh
And also, string functions? (like, strtok())

strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.
strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Ali Majdzadeh

2007-04-17 10:55:47 UTC

Permalink

Hi Rich
Sorry. I managed to solve the problem. You were right.
Of course, there are only some minor problems regarding that string literals
do not match exactly with those strings read from a file, thus string
comparison functions fail to operate. I am going to investigate on it.
Thanks a lot

Best Regards
Ali

Post by Ali Majdzadeh
Hello Rich
Sorry, again.
I wrote a simple C program using your guidelines but unfortunately it does
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <langinfo.h>
int main (
int argc,
char *argv[]
)
{
FILE *input_file;
char buffer[1024];
if (!setlocale (LC_CTYPE, ""))
{
fprintf (stderr, "Locale not specified. Check LC_ALL,
LC_CTYPE or LANG.\n");
return EXIT_FAILURE;
}
if (!(input_file = fopen ("./in.txt", "r")))
{
fprintf (stderr, "Could not open file : %s\n", strerror
(errno));
return EXIT_FAILURE;
}
fgets (buffer, sizeof (buffer), input_file);
fprintf (stdout, "%s", buffer);
return EXIT_SUCCESS;
}
The program does not print the line read from the file to stdout (some
junks are printed). I also used "cat ./persian.txt | iconv -t utf-8 >
in.txt" to produce a UTF-8 oriented file.
Best Regards
Ali

Post by Rich Felker

Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.

OK.

Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(),

...)
Yes, they'll work fine.

Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*

Yes.

Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and

ASCII)?

Post by Ali Majdzadeh
And also, string functions? (like, strtok())

strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.
strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Rich Felker

2007-04-17 15:00:33 UTC

Permalink

Post by Ali Majdzadeh
The program does not print the line read from the file to stdout (some junks
are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt" to
produce a UTF-8 oriented file.

If your native encoding is not UTF-8 then of course sending UTF-8 to
stdout is not going to result in something directly legible. I was
assuming you were using UTF-8 everywhere, which you should be doing on
any modern unix system...

Rich

Ali Majdzadeh

2007-04-17 15:17:48 UTC

Permalink

Hi Rich
Thanks for your attention. I do use UTF-8 but the files I am dealing with
are encoded using a strange encoding system, I used iconv to convert them
into UTF-8. By the way, another question, if all those stdio.h and
string.hfunctions, work well with UTF-8 strings, as they actually do,
what would be
the reason to use wchar_t and wchar_t-aware functions?

Best Regards
Ali

Post by Rich Felker

Post by Ali Majdzadeh
The program does not print the line read from the file to stdout (some

junks

Post by Ali Majdzadeh
are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt"

Post by Ali Majdzadeh
produce a UTF-8 oriented file.

Rich Felker

2007-04-17 15:48:30 UTC

Permalink

Post by Ali Majdzadeh
Hi Rich
Thanks for your attention. I do use UTF-8 but the files I am dealing with
are encoded using a strange encoding system, I used iconv to convert them
into UTF-8. By the way, another question, if all those stdio.h and
string.hfunctions, work well with UTF-8 strings, as they actually do,
what would be
the reason to use wchar_t and wchar_t-aware functions?

There are a mix of reasons, but most stem from the fact that the
Japanese designed some really bad encodings for their language prior
to UTF-8, which are almost impossible to use in a standard C
environment. At the time, the ANSI/ISO C committee thought that it
would be necessary to avoid using char strings directly for
multilingual text purposes, and was setting up to transition to
wchar_t strings; however, this was very incomplete. Note that C has no
support for using wchar_t strings as filenames, and likewise POSIX has
no support for using them for anything having to do with interfacing
with the system or library in places where strings are needed. Thus
there was going to be a dichotomy where multilingual text would be a
special case only available in some places, while system stuff,
filenames, etc. would have to be ASCII. UTF-8 does away with that
dichotomy.

The main remaining use of wchar_t is that, if you wish to write
portable C applications which work on many different text encodings
(both UTF-8 and legacy) depending on the system's or user's locale,
you can use mbrtowc/wcrtomb and related functions when it's necessary
to determine the identity of a particular character in a char string,
then use the isw* functions to ask questions like: Is it alphabetic?
Is it printable? etc.

On modern C systems (indicated by the presence of the
__STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode
UCS-4, so if you're willing to sacrifice some degree of portability,
you can use the values from the mb/wc conversions functions directly
as Unicode character numbers for lookup in fonts, character data
tables, etc.

Another option is to use the iconv() API (but this is part of Single
Unix Specification, not general C) to convert between the locale's
encoding and UTF-8 or UCS-4 if you need to make sure your data is in a
particular form.

However, for natural language work where there's a reasonable
expectation that any user of the software would be using UTF-8 as
their encoding already, IMO it makes sense to just assume you're
working in a UTF-8 environment. Some may disagree on this.

Hope this helps.

Rich

Ali Majdzadeh

2007-04-17 18:55:52 UTC

Permalink

Hello Rich
Thanks a lot. That was really a nice clarification of different aspects of
the issue, and sorry again if my questions were so elementary. But for me,
that was a nice discussion and I learned a lot.
Thank you so much.

Best Regards
Ali

Post by Rich Felker

Post by Ali Majdzadeh
Hi Rich
Thanks for your attention. I do use UTF-8 but the files I am dealing

with

Post by Ali Majdzadeh
are encoded using a strange encoding system, I used iconv to convert

them

Post by Ali Majdzadeh
into UTF-8. By the way, another question, if all those stdio.h and
string.hfunctions, work well with UTF-8 strings, as they actually do,
what would be
the reason to use wchar_t and wchar_t-aware functions?

There are a mix of reasons, but most stem from the fact that the
Japanese designed some really bad encodings for their language prior
to UTF-8, which are almost impossible to use in a standard C
environment. At the time, the ANSI/ISO C committee thought that it
would be necessary to avoid using char strings directly for
multilingual text purposes, and was setting up to transition to
wchar_t strings; however, this was very incomplete. Note that C has no
support for using wchar_t strings as filenames, and likewise POSIX has
no support for using them for anything having to do with interfacing
with the system or library in places where strings are needed. Thus
there was going to be a dichotomy where multilingual text would be a
special case only available in some places, while system stuff,
filenames, etc. would have to be ASCII. UTF-8 does away with that
dichotomy.
The main remaining use of wchar_t is that, if you wish to write
portable C applications which work on many different text encodings
(both UTF-8 and legacy) depending on the system's or user's locale,
you can use mbrtowc/wcrtomb and related functions when it's necessary
to determine the identity of a particular character in a char string,
then use the isw* functions to ask questions like: Is it alphabetic?
Is it printable? etc.
On modern C systems (indicated by the presence of the
__STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode
UCS-4, so if you're willing to sacrifice some degree of portability,
you can use the values from the mb/wc conversions functions directly
as Unicode character numbers for lookup in fonts, character data
tables, etc.
Another option is to use the iconv() API (but this is part of Single
Unix Specification, not general C) to convert between the locale's
encoding and UTF-8 or UCS-4 if you need to make sure your data is in a
particular form.
However, for natural language work where there's a reasonable
expectation that any user of the software would be using UTF-8 as
their encoding already, IMO it makes sense to just assume you're
working in a UTF-8 environment. Some may disagree on this.
Hope this helps.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

ＳｒｉｎＴｕａｒ

2007-04-16 17:16:11 UTC

Permalink

The best advice you can get is to steer clear of wide characters.
You should never need to use any wide character functions.
Keep the data in your program internally represented as utf-8.
The standard byte-oriented "strlen", "strcpy", "strstr", "printf" etc
work fine with utf-8.

XML uses utf-8 by default as well, so little if any conversion between
encodings should be needed. You may have to convert your input from a
legacy encoding to utf-8, or you could just externally convert using
something such as this:

cat inputfile | iconv -t utf-8 | myprogram

Being "unicode aware" is trivial in this fashion.

Post by Ali Majdzadeh
Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t data
type (glibc), is compiler and platform dependent. What is the best practice
of writing portable Unicode-aware C programs? Is it a good practice to use
Unicode literals directly in a C program? I have experienced some problems
with glibc's wide character string functions, I want to know is there any
standard way of programming or standard template to write a Unicode-aware C
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document. For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)
Best Regards
Ali