Home > mailing lists

Re: Unicode support - Mailing list pgsql-odbc

From	Marko Ristola
Subject	Re: Unicode support
Date	September 8, 2005 14:34:37
Msg-id	4320736A.10308@kolumbus.fi Whole thread Raw
In response to	Re: Unicode support ("Dave Page" <dpage@vale-housing.co.uk>)
List	pgsql-odbc

Tree view

Marc Herbert wrote:

>Marko Ristola <Marko.Ristola@kolumbus.fi> writes:
>
>
>>So I ask you, how you have thought about these things:
>>
>>If I have understood Windows correctly, it uses UCS-2 as it's internal
>>UNICODE
>>character set. Linux prefers into UTF-8.
>>
>>
>
>I am not sure what you mean by "internal UNICODE character set", but I
>understand that Linux does prefer UTF-32, NOT UTF-8 !
>
>
>

If you want to know the details about UTF-8's encoding, the following
is a recommended reading (Linux manual page) :)

man utf-8

It gives you a good explanation of the encoding used in  UTF-8.

UTF-8 uses from one to four bytes per character.
It supports almost all character sets in the World.

Because the task is so huge, there exist variants and bugs in
the implementations. That's what I read from Samba filesystem
FAQ.

So, if you stick with Windows implementation, you don't find
any bugs, but when you move the file into another operating system,
the file might look different :(

UCS-2 is a 32-bit Unicode wchar_t type. According to
Linux manuals, wchar_t is not equal on all implementations.
According to manuals, inside binary files, it is recommended in C
to use UTF-8 strings, that are then converted at runtime into
wchar_t type. Java language is another story. There might
be same problems though. The number remains the same, but
if you try to draw the character into the window with
different implementations, you might get different drawings.

>On all platforms I had a look at, variable-length encodings are only
>for disk and network, never used in memory.
>
>Don't you agree?
>
>
> locale
LANG=fi_FI.UTF-8@euro
LC_CTYPE="fi_FI.UTF-8@euro"
LC_NUMERIC="fi_FI.UTF-8@euro"
LC_TIME="fi_FI.UTF-8@euro"
LC_COLLATE="fi_FI.UTF-8@euro"
LC_MONETARY="fi_FI.UTF-8@euro"
LC_MESSAGES="fi_FI.UTF-8@euro"
LC_PAPER="fi_FI.UTF-8@euro"
LC_NAME="fi_FI.UTF-8@euro"
LC_ADDRESS="fi_FI.UTF-8@euro"
LC_TELEPHONE="fi_FI.UTF-8@euro"
LC_MEASUREMENT="fi_FI.UTF-8@euro"
LC_IDENTIFICATION="fi_FI.UTF-8@euro"
LC_ALL=

So, under Linux nowadays, UTF-8 is used very much.
Just as Windows recommends everybody to move into
native Windows Unicode characters (UCS-2), under Linux
it is recommended to move into UTF-8. Both are UNICODE
character encodings.  UCS-2 encoding is just simpler: just
an integer, that has a numerical value.

The reason for the popularity of UTF-8 under Linux is, that each
program needs to be adjusted very little to be able to move
from LATIN1 style encoding into UTF-8.

Happy studying about Unicode character sets :)

Regards,
Marko Ristola

pgsql-odbc by date:

From: "Merlin Moncure"
Date: 08 September 2005, 14:07:20
Subject: Re: Application bottlenecks

From: Marko Ristola
Date: 08 September 2005, 14:48:10
Subject: Re: Continuing encoding fun....

Re: Unicode support - Mailing list pgsql-odbc

Previous

Next