Re: Encoding, Unicode, locales, etc. - Mailing list pgsql-general

From Tom Lane
Subject Re: Encoding, Unicode, locales, etc.
Date
Msg-id 12356.1162356476@sss.pgh.pa.us
Whole thread Raw
In response to Encoding, Unicode, locales, etc.  (Carlos Moreno <moreno_pg@mochima.com>)
Responses Re: Encoding, Unicode, locales, etc.  (Karsten Hilbert <Karsten.Hilbert@gmx.net>)
Re: Encoding, Unicode, locales, etc.  (Carlos Moreno <moreno_pg@mochima.com>)
List pgsql-general
Carlos Moreno <moreno_pg@mochima.com> writes:
> Why is it that the database
> cluster is resrticted to a single locale (or single set of locales) instead
> of being configurable on a per-database basis?

Because we depend on libc's locale support, which (on many platforms)
isn't designed to switch between locales cheaply.  The fact that we
allow a per-database encoding spec at all was probably a bad idea in
hindsight --- it's out front of what the code can really deal with.
My recollection is that the Japanese contingent argued for it on the
grounds that they needed to deal with multiple encodings and didn't
care about encoding/locale mismatch because they were going to use
C locale anyway.  For everybody else though, it's a gotcha waiting
to happen.

This stuff is certainly far from ideal, but the amount of work involved
to fix it is daunting; see many past pg-hackers discussions.

> 2)  On the same token (more or less), I have a test database, for which
> I ran initdb without specifying encoding or locale;  then, I create a
> database with UTF8 encoding.

There's no such thing as "you didn't specify a locale".  If you didn't
specify one on the initdb command line, then it was taken from the
environment.  Try "show lc_collate" and "show lc_ctype" to see what
got used.

> I try lower of a string that
> contains characters with accents  (e.g., Spanish or French characters),
> and it works as it should according to Spanish or French rules --- it
> returns a string with the same characters in lowecase, with the same
> accent.  Why did that work?  My Linux machine has all en_US.UTF-8
> locales, and en_US is not even aware of characters with accents,

You sure?  I'd sort of expect a UTF8 locale to know this stuff anyway.
In any case, Postgres doesn't know anything about case conversion
beyond what toupper/tolower tell it, so your experimental result is
sufficient proof that that locale includes these conversions.

            regards, tom lane

pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Index greater than 8k
Next
From: "Joshua D. Drake"
Date:
Subject: Re: [HACKERS] Index greater than 8k