Discussion:
Questions about Unicode-aware C programs under Linux
Ali Majdzadeh
2007-04-16 08:03:26 UTC
Permalink
Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t data
type (glibc), is compiler and platform dependent. What is the best practice
of writing portable Unicode-aware C programs? Is it a good practice to use
Unicode literals directly in a C program? I have experienced some problems
with glibc's wide character string functions, I want to know is there any
standard way of programming or standard template to write a Unicode-aware C
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document. For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)

Best Regards
Ali
Rich Felker
2007-04-16 17:23:57 UTC
Permalink
Post by Ali Majdzadeh
Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t data
type (glibc), is compiler and platform dependent. What is the best practice
of writing portable Unicode-aware C programs? Is it a good practice to use
Unicode literals directly in a C program?
It depends on the degree of portability you want. Using them in wide
strings is not entirely portable (depends on the translation character
encoding), but using them in UTF-8 strings is (they're just byte
sequences).
Post by Ali Majdzadeh
I have experienced some problems
with glibc's wide character string functions, I want to know is there any
standard way of programming or standard template to write a Unicode-aware C
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document.
If your application is Persian-specific, then you're completely
entitled to assume the text encoding is UTF-8 and that the system is
capable of dealing with UTF-8 and Unicode. Will there be any Persion
specific text processing though or do you just want to be able to pass
through Persian text?
Post by Ali Majdzadeh
For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)
wcsstr doesn't care about encoding or Unicode semantics or anything.
It just looks for binary substring matches, just like strstr but using
wchar_t instead of char as the unit.

Overall I'd suggest ignoring the wchar_t functions. Especially the
wide stdio functions are problematic. Using UTF-8 is just as easy and
then your strings are directly usable for input and output to/from
text files, commandline, etc.

Rich
Ali Majdzadeh
2007-04-17 06:16:44 UTC
Permalink
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
Do you mean that I should use common stdio functions? (like, fgets(), ...)
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
variables? So, what about the character size defference (Unicode and ASCII)?
And also, string functions? (like, strtok())
Sorry, I am new to the issue.

Best Regards
Ali
Post by Ali Majdzadeh
Post by Ali Majdzadeh
Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t
data
Post by Ali Majdzadeh
type (glibc), is compiler and platform dependent. What is the best
practice
Post by Ali Majdzadeh
of writing portable Unicode-aware C programs? Is it a good practice to
use
Post by Ali Majdzadeh
Unicode literals directly in a C program?
It depends on the degree of portability you want. Using them in wide
strings is not entirely portable (depends on the translation character
encoding), but using them in UTF-8 strings is (they're just byte
sequences).
Post by Ali Majdzadeh
I have experienced some problems
with glibc's wide character string functions, I want to know is there
any
Post by Ali Majdzadeh
standard way of programming or standard template to write a
Unicode-aware C
Post by Ali Majdzadeh
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document.
If your application is Persian-specific, then you're completely
entitled to assume the text encoding is UTF-8 and that the system is
capable of dealing with UTF-8 and Unicode. Will there be any Persion
specific text processing though or do you just want to be able to pass
through Persian text?
Post by Ali Majdzadeh
For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)
wcsstr doesn't care about encoding or Unicode semantics or anything.
It just looks for binary substring matches, just like strstr but using
wchar_t instead of char as the unit.
Overall I'd suggest ignoring the wchar_t functions. Especially the
wide stdio functions are problematic. Using UTF-8 is just as easy and
then your strings are directly usable for input and output to/from
text files, commandline, etc.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2007-04-17 06:29:13 UTC
Permalink
Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
OK.
Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(), ...)
Yes, they'll work fine.
Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
Yes.
Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and ASCII)?
And also, string functions? (like, strtok())
strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.

strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.

Rich
Ali Majdzadeh
2007-04-17 08:28:36 UTC
Permalink
Hi Rich
Thanks a lot for your response.
I am going to test it. Thanks.

Best Regards
Ali
Post by Rich Felker
Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
OK.
Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(),
...)
Yes, they'll work fine.
Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
Yes.
Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and
ASCII)?
Post by Ali Majdzadeh
And also, string functions? (like, strtok())
strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.
strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Ali Majdzadeh
2007-04-17 08:47:19 UTC
Permalink
Hello Rich
Sorry, again.
I wrote a simple C program using your guidelines but unfortunately it does
not work well:
The program is as follows:

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <langinfo.h>


int main (
int argc,
char *argv[]
)
{
FILE *input_file;

char buffer[1024];

if (!setlocale (LC_CTYPE, ""))
{
fprintf (stderr, "Locale not specified. Check LC_ALL,
LC_CTYPE or LANG.\n");
return EXIT_FAILURE;
}

if (!(input_file = fopen ("./in.txt", "r")))
{
fprintf (stderr, "Could not open file : %s\n", strerror
(errno));
return EXIT_FAILURE;
}

fgets (buffer, sizeof (buffer), input_file);
fprintf (stdout, "%s", buffer);

return EXIT_SUCCESS;
}

The program does not print the line read from the file to stdout (some junks
are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt" to
produce a UTF-8 oriented file.

Best Regards
Ali
Post by Rich Felker
Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
OK.
Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(),
...)
Yes, they'll work fine.
Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
Yes.
Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and
ASCII)?
Post by Ali Majdzadeh
And also, string functions? (like, strtok())
strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.
strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Ali Majdzadeh
2007-04-17 10:55:47 UTC
Permalink
Hi Rich
Sorry. I managed to solve the problem. You were right.
Of course, there are only some minor problems regarding that string literals
do not match exactly with those strings read from a file, thus string
comparison functions fail to operate. I am going to investigate on it.
Thanks a lot

Best Regards
Ali
Post by Ali Majdzadeh
Hello Rich
Sorry, again.
I wrote a simple C program using your guidelines but unfortunately it does
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <langinfo.h>
int main (
int argc,
char *argv[]
)
{
FILE *input_file;
char buffer[1024];
if (!setlocale (LC_CTYPE, ""))
{
fprintf (stderr, "Locale not specified. Check LC_ALL,
LC_CTYPE or LANG.\n");
return EXIT_FAILURE;
}
if (!(input_file = fopen ("./in.txt", "r")))
{
fprintf (stderr, "Could not open file : %s\n", strerror
(errno));
return EXIT_FAILURE;
}
fgets (buffer, sizeof (buffer), input_file);
fprintf (stdout, "%s", buffer);
return EXIT_SUCCESS;
}
The program does not print the line read from the file to stdout (some
junks are printed). I also used "cat ./persian.txt | iconv -t utf-8 >
in.txt" to produce a UTF-8 oriented file.
Best Regards
Ali
Post by Rich Felker
Post by Ali Majdzadeh
Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
OK.
Post by Ali Majdzadeh
Do you mean that I should use common stdio functions? (like, fgets(),
...)
Yes, they'll work fine.
Post by Ali Majdzadeh
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
Yes.
Post by Ali Majdzadeh
variables? So, what about the character size defference (Unicode and
ASCII)?
Post by Ali Majdzadeh
And also, string functions? (like, strtok())
strtok, strsep, strchr, strrchr, strpbrk, strspn, and strcspn will all
work just fine on UTF-8 strings as long as the separator characters
you're looking for are ASCII.
strstr always works on UTF-8, and can be used in place of strchr to
search for single non-ascii characters or longer substrings.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2007-04-17 15:00:33 UTC
Permalink
Post by Ali Majdzadeh
The program does not print the line read from the file to stdout (some junks
are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt" to
produce a UTF-8 oriented file.
If your native encoding is not UTF-8 then of course sending UTF-8 to
stdout is not going to result in something directly legible. I was
assuming you were using UTF-8 everywhere, which you should be doing on
any modern unix system...

Rich
Ali Majdzadeh
2007-04-17 15:17:48 UTC
Permalink
Hi Rich
Thanks for your attention. I do use UTF-8 but the files I am dealing with
are encoded using a strange encoding system, I used iconv to convert them
into UTF-8. By the way, another question, if all those stdio.h and
string.hfunctions, work well with UTF-8 strings, as they actually do,
what would be
the reason to use wchar_t and wchar_t-aware functions?

Best Regards
Ali
Post by Rich Felker
Post by Ali Majdzadeh
The program does not print the line read from the file to stdout (some
junks
Post by Ali Majdzadeh
are printed). I also used "cat ./persian.txt | iconv -t utf-8 > in.txt"
to
Post by Ali Majdzadeh
produce a UTF-8 oriented file.
If your native encoding is not UTF-8 then of course sending UTF-8 to
stdout is not going to result in something directly legible. I was
assuming you were using UTF-8 everywhere, which you should be doing on
any modern unix system...
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Rich Felker
2007-04-17 15:48:30 UTC
Permalink
Post by Ali Majdzadeh
Hi Rich
Thanks for your attention. I do use UTF-8 but the files I am dealing with
are encoded using a strange encoding system, I used iconv to convert them
into UTF-8. By the way, another question, if all those stdio.h and
string.hfunctions, work well with UTF-8 strings, as they actually do,
what would be
the reason to use wchar_t and wchar_t-aware functions?
There are a mix of reasons, but most stem from the fact that the
Japanese designed some really bad encodings for their language prior
to UTF-8, which are almost impossible to use in a standard C
environment. At the time, the ANSI/ISO C committee thought that it
would be necessary to avoid using char strings directly for
multilingual text purposes, and was setting up to transition to
wchar_t strings; however, this was very incomplete. Note that C has no
support for using wchar_t strings as filenames, and likewise POSIX has
no support for using them for anything having to do with interfacing
with the system or library in places where strings are needed. Thus
there was going to be a dichotomy where multilingual text would be a
special case only available in some places, while system stuff,
filenames, etc. would have to be ASCII. UTF-8 does away with that
dichotomy.

The main remaining use of wchar_t is that, if you wish to write
portable C applications which work on many different text encodings
(both UTF-8 and legacy) depending on the system's or user's locale,
you can use mbrtowc/wcrtomb and related functions when it's necessary
to determine the identity of a particular character in a char string,
then use the isw* functions to ask questions like: Is it alphabetic?
Is it printable? etc.

On modern C systems (indicated by the presence of the
__STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode
UCS-4, so if you're willing to sacrifice some degree of portability,
you can use the values from the mb/wc conversions functions directly
as Unicode character numbers for lookup in fonts, character data
tables, etc.

Another option is to use the iconv() API (but this is part of Single
Unix Specification, not general C) to convert between the locale's
encoding and UTF-8 or UCS-4 if you need to make sure your data is in a
particular form.

However, for natural language work where there's a reasonable
expectation that any user of the software would be using UTF-8 as
their encoding already, IMO it makes sense to just assume you're
working in a UTF-8 environment. Some may disagree on this.

Hope this helps.

Rich
Ali Majdzadeh
2007-04-17 18:55:52 UTC
Permalink
Hello Rich
Thanks a lot. That was really a nice clarification of different aspects of
the issue, and sorry again if my questions were so elementary. But for me,
that was a nice discussion and I learned a lot.
Thank you so much.

Best Regards
Ali
Post by Rich Felker
Post by Ali Majdzadeh
Hi Rich
Thanks for your attention. I do use UTF-8 but the files I am dealing
with
Post by Ali Majdzadeh
are encoded using a strange encoding system, I used iconv to convert
them
Post by Ali Majdzadeh
into UTF-8. By the way, another question, if all those stdio.h and
string.hfunctions, work well with UTF-8 strings, as they actually do,
what would be
the reason to use wchar_t and wchar_t-aware functions?
There are a mix of reasons, but most stem from the fact that the
Japanese designed some really bad encodings for their language prior
to UTF-8, which are almost impossible to use in a standard C
environment. At the time, the ANSI/ISO C committee thought that it
would be necessary to avoid using char strings directly for
multilingual text purposes, and was setting up to transition to
wchar_t strings; however, this was very incomplete. Note that C has no
support for using wchar_t strings as filenames, and likewise POSIX has
no support for using them for anything having to do with interfacing
with the system or library in places where strings are needed. Thus
there was going to be a dichotomy where multilingual text would be a
special case only available in some places, while system stuff,
filenames, etc. would have to be ASCII. UTF-8 does away with that
dichotomy.
The main remaining use of wchar_t is that, if you wish to write
portable C applications which work on many different text encodings
(both UTF-8 and legacy) depending on the system's or user's locale,
you can use mbrtowc/wcrtomb and related functions when it's necessary
to determine the identity of a particular character in a char string,
then use the isw* functions to ask questions like: Is it alphabetic?
Is it printable? etc.
On modern C systems (indicated by the presence of the
__STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode
UCS-4, so if you're willing to sacrifice some degree of portability,
you can use the values from the mb/wc conversions functions directly
as Unicode character numbers for lookup in fonts, character data
tables, etc.
Another option is to use the iconv() API (but this is part of Single
Unix Specification, not general C) to convert between the locale's
encoding and UTF-8 or UCS-4 if you need to make sure your data is in a
particular form.
However, for natural language work where there's a reasonable
expectation that any user of the software would be using UTF-8 as
their encoding already, IMO it makes sense to just assume you're
working in a UTF-8 environment. Some may disagree on this.
Hope this helps.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
SrinTuar
2007-04-16 17:16:11 UTC
Permalink
The best advice you can get is to steer clear of wide characters.
You should never need to use any wide character functions.
Keep the data in your program internally represented as utf-8.
The standard byte-oriented "strlen", "strcpy", "strstr", "printf" etc
work fine with utf-8.

XML uses utf-8 by default as well, so little if any conversion between
encodings should be needed. You may have to convert your input from a
legacy encoding to utf-8, or you could just externally convert using
something such as this:

cat inputfile | iconv -t utf-8 | myprogram

Being "unicode aware" is trivial in this fashion.
Post by Ali Majdzadeh
Hello All
Sorry, if my questions are elementary. As I know, the size of wchar_t data
type (glibc), is compiler and platform dependent. What is the best practice
of writing portable Unicode-aware C programs? Is it a good practice to use
Unicode literals directly in a C program? I have experienced some problems
with glibc's wide character string functions, I want to know is there any
standard way of programming or standard template to write a Unicode-aware C
program? By the way, my native language is Persian. I am working on a C
program which reads a Persian text file, parses it and generates an XML
document. For this, there exist lots of issues that need the use of library
functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
and, as I mentioned earlier, I have experienced some odd problems using
them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
strings.)
Best Regards
Ali
Loading...