Thread: Encoding, Unicode, locales, etc.

Encoding, Unicode, locales, etc.

From
Carlos Moreno
Date:
Hi,

Even though I *think* I have a fairly clear understanding of encoding
and locale principles, I'm somewhat unsure of how some of the tools
available with PostgreSQL apply  (or rather, how should they be used).

1)  The way I understand it, encoding (character set) and locale are two
different things.  Yet, I see that initdb allows me to specify the default
encoding (I can override it when creating the database) for the databases
that I create later, and also the locale(s).  Why is it that the database
cluster is resrticted to a single locale (or single set of locales) instead
of being configurable on a per-database basis?

2)  On the same token (more or less), I have a test database, for which
I ran initdb without specifying encoding or locale;  then, I create a
database with UTF8 encoding.  Then, from a psql console on the same
Linux machine that is running the server, I try lower of a string that
contains characters with accents  (e.g., Spanish or French characters),
and it works as it should according to Spanish or French rules --- it
returns a string with the same characters in lowecase, with the same
accent.  Why did that work?  My Linux machine has all en_US.UTF-8
locales, and en_US is not even aware of characters with accents, so it
doesn't seem like it's taking by default the encoding from the OS.  (is it
simply that the case is "too obvious" so by default case conversion
does the "obvious" thing?)

I have several other details in which I'm not too clear, but perhaps
with any responses or pointers that I might get for the above, it would
clarify the whole confusion?

BTW, I did read the online PG documentation --- the section
localization;  the thing is, with everything that I read in there, my
reaction was more or less "ok, I knew that";  that is, it states facts
for which I know (or at least I think I know) the "theory", but it did
not clarify how to use the given tools.

Thanks,

Carlos
--


Re: Encoding, Unicode, locales, etc.

From
Tom Lane
Date:
Carlos Moreno <moreno_pg@mochima.com> writes:
> Why is it that the database
> cluster is resrticted to a single locale (or single set of locales) instead
> of being configurable on a per-database basis?

Because we depend on libc's locale support, which (on many platforms)
isn't designed to switch between locales cheaply.  The fact that we
allow a per-database encoding spec at all was probably a bad idea in
hindsight --- it's out front of what the code can really deal with.
My recollection is that the Japanese contingent argued for it on the
grounds that they needed to deal with multiple encodings and didn't
care about encoding/locale mismatch because they were going to use
C locale anyway.  For everybody else though, it's a gotcha waiting
to happen.

This stuff is certainly far from ideal, but the amount of work involved
to fix it is daunting; see many past pg-hackers discussions.

> 2)  On the same token (more or less), I have a test database, for which
> I ran initdb without specifying encoding or locale;  then, I create a
> database with UTF8 encoding.

There's no such thing as "you didn't specify a locale".  If you didn't
specify one on the initdb command line, then it was taken from the
environment.  Try "show lc_collate" and "show lc_ctype" to see what
got used.

> I try lower of a string that
> contains characters with accents  (e.g., Spanish or French characters),
> and it works as it should according to Spanish or French rules --- it
> returns a string with the same characters in lowecase, with the same
> accent.  Why did that work?  My Linux machine has all en_US.UTF-8
> locales, and en_US is not even aware of characters with accents,

You sure?  I'd sort of expect a UTF8 locale to know this stuff anyway.
In any case, Postgres doesn't know anything about case conversion
beyond what toupper/tolower tell it, so your experimental result is
sufficient proof that that locale includes these conversions.

            regards, tom lane

Re: Encoding, Unicode, locales, etc.

From
Karsten Hilbert
Date:
On Tue, Oct 31, 2006 at 11:47:56PM -0500, Tom Lane wrote:

> Because we depend on libc's locale support, which (on many platforms)
> isn't designed to switch between locales cheaply.  The fact that we
> allow a per-database encoding spec at all was probably a bad idea in
> hindsight --- it's out front of what the code can really deal with.
> My recollection is that the Japanese contingent argued for it on the
> grounds that they needed to deal with multiple encodings and didn't
> care about encoding/locale mismatch because they were going to use
> C locale anyway.  For everybody else though, it's a gotcha waiting
> to happen.

Could this paragraph be put into the docs and/or the FAQ,
please ? Along with the recommendation that if you require
multiple encodings for your databases you better had your OS
locale configured properly for UTF8 and use UNICODE
databases or do initdb with the C-locale.

> This stuff is certainly far from ideal, but the amount of work involved
> to fix it is daunting; see many past pg-hackers discussions.

Here are a few data points from my Debian/Testing system in
favour of not worrying too much about installed ICU size as
it is being used by other packages anyways:

libicu36
Reverse Depends:
  openoffice.org-writer                * OOo
  openoffice.org-filter-so52
  openoffice.org-core
  libxerces27                        * Xerces XML parser (Apache camp)
  libboost-regex1.33.1
  libboost-dbg

icu
Reverse Depends:
  libicu36
  libicu36
  libxercesicu26                    * Xerces, again
  libxercesicu25
  libicu28-dev
  libicu28
  libicu21c102
  icu-i18ndata
  icu-data
  libwine                            * Wine

This, of course, does not decrease the work required to get
this going in PostgreSQL.

Thanks for the great work,
Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Encoding, Unicode, locales, etc.

From
Carlos Moreno
Date:
Thanks Tom, for your reply.

Tom Lane wrote:

>Carlos Moreno <moreno_pg@mochima.com> writes:
>
>
>>Why is it that the database
>>cluster is resrticted to a single locale (or single set of locales) instead
>>of being configurable on a per-database basis?
>>
>>
>
>Because we depend on libc's locale support, which (on many platforms)
>isn't designed to switch between locales cheaply  [...]
>
>This stuff is certainly far from ideal, but the amount of work involved
>to fix it is daunting; see many past pg-hackers discussions.
>
>

Fair enough --- and good to know.

>>2)  On the same token (more or less), I have a test database, for which
>>I ran initdb without specifying encoding or locale;  then, I create a
>>database with UTF8 encoding.
>>
>>
>
>There's no such thing as "you didn't specify a locale".  If you didn't
>specify one on the initdb command line, then it was taken from the
>environment.  Try "show lc_collate" and "show lc_ctype" to see what
>got used.
>
>

Yes, that's what I meant --- I meant that I did not use the --locale or
-E command-
line switches for the initdb command.  Both lc_ctype and lc_collate show
en_US.UTF-8

>>I try lower of a string that
>>contains characters with accents  (e.g., Spanish or French characters),
>>and it works as it should according to Spanish or French rules --- it
>>returns a string with the same characters in lowecase, with the same
>>accent.  Why did that work?  My Linux machine has all en_US.UTF-8
>>locales, and en_US is not even aware of characters with accents,
>>
>>
>
>You sure?  I'd sort of expect a UTF8 locale to know this stuff anyway.
>In any case, Postgres doesn't know anything about case conversion
>beyond what toupper/tolower tell it, so your experimental result is
>sufficient proof that that locale includes these conversions.
>
>

Are you sure there's nothing about the way PostgreSQL interacts with C
conversion functions?   I ask because, as part of a "sanity check", I
repeated
the tests --- now with two machines;  one that has PG 8.1.4, and the
other one
has 7.4.14, and they behave differently.

The one that does the case conversion "correctly" (read:  as I expect it
as per
Spanish or French rules) is 8.1.4 with en_US locale (LC_CTYPE and
LC_COLLATE both showing en_US.UTF-8).  PG 7.4.14, *even with
locale es_ES*, does not do the case conversion  (characters with accent
or tilde are left untouched).

I wonder if someone could shed some light on this little mystery....???
Perhaps to add more confusion to my experimental/informal tests, PG 8.1.4
is running on a FC4 AMD64 X2 box  (the command "locale" at the shell
prompt shows all en_US.utf8), and PG 7.4.14 is running on a laptop with
FC5 on an Intel Celeron M  (the command locale shows exactly the same
in that case).   Does this perhaps account for the difference?

Thanks,

Carlos
--


Re: Encoding, Unicode, locales, etc.

From
Tom Lane
Date:
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> Could this paragraph be put into the docs and/or the FAQ,
> please ?

Section 21.2. Character Set Support already has something about it:

     Important: Although you can specify any encoding you want for a
     database, it is unwise to choose an encoding that is not what is
     expected by the locale you have selected. The LC_COLLATE and
     LC_CTYPE settings imply a particular encoding, and locale-dependent
     operations (such as sorting) are likely to misinterpret data that
     is in an incompatible encoding.

     Since these locale settings are frozen by initdb, the apparent
     flexibility to use different encodings in different databases of a
     cluster is more theoretical than real. It is likely that these
     mechanisms will be revisited in future versions of PostgreSQL.

     One way to use multiple encodings safely is to set the locale to C
     or POSIX during initdb, thus disabling any real locale awareness.

            regards, tom lane

Re: Encoding, Unicode, locales, etc.

From
Martijn van Oosterhout
Date:
On Wed, Nov 01, 2006 at 11:41:43AM +0100, Karsten Hilbert wrote:
> Could this paragraph be put into the docs and/or the FAQ,
> please ? Along with the recommendation that if you require
> multiple encodings for your databases you better had your OS
> locale configured properly for UTF8 and use UNICODE
> databases or do initdb with the C-locale.

Err, multiple encodings don't work full-stop. Any particular locale (as
defined by POSIX) is only really designed to work with one encoding.
The fact that the C locale produces an order when sorting UTF8 text is
really just luck.

In hindsight the people in POSIX who decided to tie locale and encoding
into one variable should probably be shot, but it's a bit late now.

> > This stuff is certainly far from ideal, but the amount of work involved
> > to fix it is daunting; see many past pg-hackers discussions.
>
> Here are a few data points from my Debian/Testing system in
> favour of not worrying too much about installed ICU size as
> it is being used by other packages anyways:

We'd need a suitable patch first before we start worrying about that. I
think diskspace is less of an issue now. There are discussions going on
about having the clog and the xlog taking dozens of megabytes. At the
end of the day I don't think 10MB for the Unicode data it going to be
that big a deal, *if* the patch solves all the problems in this area in
a reasonably clean way...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Encoding, Unicode, locales, etc.

From
Karsten Hilbert
Date:
On Wed, Nov 01, 2006 at 08:50:30PM +0100, Martijn van Oosterhout wrote:

> > Could this paragraph be put into the docs and/or the FAQ,
> > please ? Along with the recommendation that if you require
> > multiple encodings for your databases you better had your OS
> > locale configured properly for UTF8 and use UNICODE
> > databases or do initdb with the C-locale.
>
> Err, multiple encodings don't work full-stop.
Well, yes, I was thinking of multiple client encodings which
can be supported either via a C-locale-initdb with the
databases set to the encoding you require (but sorting/etc
won't work, I know) or by doing a unicode-initdb and using
unicode databases. In each case the client encodings can be
"multiple" ones - as long as conversion is possible. Sorting
etc may still be wrong, but at least the proper characters
are going in and coming back.

> Any particular locale (as
> defined by POSIX) is only really designed to work with one encoding.
Sure. What I meant is that if you have a unicode database
you can use several client encodings and get back the
properly encoded characters.

> The fact that the C locale produces an order when sorting UTF8 text is
> really just luck.
Yes.

> > Here are a few data points from my Debian/Testing system in
> > favour of not worrying too much about installed ICU size as
> > it is being used by other packages anyways:
>
> We'd need a suitable patch first before we start worrying about that. I
> think diskspace is less of an issue now.
Well, size did come up in a "recent" discussion so I thought
I'd mention the above facts.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Encoding, Unicode, locales, etc.

From
Tom Lane
Date:
Carlos Moreno <moreno_pg@mochima.com> writes:
> ... The one that does the case conversion "correctly" (read: as I
> expect it as per Spanish or French rules) is 8.1.4 with en_US locale
> (LC_CTYPE and LC_COLLATE both showing en_US.UTF-8).  PG 7.4.14, *even
> with locale es_ES*, does not do the case conversion (characters with
> accent or tilde are left untouched).

IIRC, 7.4 has no chance of doing upper/lower sanely with multibyte UTF8
characters, because it only knows about the old-line toupper/tolower
<ctype.h> functions.  8.0 and up know about <wctype.h> and can do this
as you'd expect.  See the CVS history at
http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/oracle_compat.c

            regards, tom lane