Thread: Locale, encoding, sort order confusion

Locale, encoding, sort order confusion

From
John Gunther
Date:
I've been reading about locales, encodings, sort orders, the to_ascii
function but I'm more confused than enlightened.

What I want is very simple:
1) I want the database to correctly accept, store, and display
alphabetic characters, including European accented characters, entered
and viewed in HTML
forms.
2) I want sorting to ignore the diacritical marks so that, for example,
u, u-accent, and u-umlaut are all sorted as if they were plain u.
3) I want sorting to ignore non-alphanumerics, letter case, and white
space.

To illustrate, the following data is in sorted order:

St-Émile
stendahl
st ènders
St. Epson

Can someone tell me what combination of PostgreSQL and Linux settings I
need for this? Or point me somewhere that it's well explained. It seems
like a very basic question, but I'm just dense,
I guess. I've tried a half dozen time-consuming configs without success.

Thank you.

John Gunther



Re: Locale, encoding, sort order confusion

From
Peter Eisentraut
Date:
John Gunther wrote:
> Can someone tell me what combination of PostgreSQL and Linux settings
> I need for this? Or point me somewhere that it's well explained. It
> seems like a very basic question, but I'm just dense,
> I guess.

Pretty much any locale (say, en_US for you) with a matching character
set should work.  Unless you go out of your way, this should be the
default setting.

> I've tried a half dozen time-consuming configs without
> success.

Like what?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locale, encoding, sort order confusion

From
John Gunther
Date:
>Pretty much any locale (say, en_US for you) with a matching character
>set should work.  Unless you go out of your way, this should be the
>default setting.

>> I've tried a half dozen time-consuming configs without
>> success.

>Like what?

For example, with all LC_* parameters set to "en_US.UTF8", I get the following incorrect "order:by":

Béarn
Bécancour
Beaupré

Bécancour should be last.

Incidentally, in psql via a putty.exe session, the only character set translation I can find that displays the accented
charactersis CP437. Does this seem right? 


Re: Locale, encoding, sort order confusion

From
Alvaro Herrera
Date:
John Gunther wrote:
> >Pretty much any locale (say, en_US for you) with a matching character
> >set should work.  Unless you go out of your way, this should be the
> >default setting.
>
> >>I've tried a half dozen time-consuming configs without
> >>success.
>
> >Like what?
>
> For example, with all LC_* parameters set to "en_US.UTF8", I get the
> following incorrect "order:by":
>
> Béarn
> Bécancour
> Beaupré

Did you initdb with locale en_US.UTF8, and also createdb with encoding
UTF8?  While you can certainly choose mismatching values in createdb and
initdb, you shouldn't because it doesn't work.  See the docs here:

http://www.postgresql.org/docs/8.1/static/multibyte.html

    Important:  Although you can specify any encoding you want for a
    database, it is unwise to choose an encoding that is not what is
    expected by the locale you have selected. The LC_COLLATE and
    LC_CTYPE settings imply a particular encoding, and locale-dependent
    operations (such as sorting) are likely to misinterpret data that is
    in an incompatible encoding.

    Since these locale settings are frozen by initdb, the apparent
    flexibility to use different encodings in different databases of a
    cluster is more theoretical than real. It is likely that these
    mechanisms will be revisited in future versions of PostgreSQL.


> Incidentally, in psql via a putty.exe session, the only character set
> translation I can find that displays the accented characters is CP437. Does
> this seem right?

You should probably set client_encoding in the psql session (using
\encoding) if you want to change the charset in putty.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Locale, encoding, sort order confusion

From
John Gunther
Date:
Alvaro Herrera wrote: <blockquote cite="mid20060821001057.GB1130@alvh.no-ip.org" type="cite"><blockquote
type="cite"><prewrap="">For example, with all LC_* parameters set to "en_US.UTF8", I get the 
 
following incorrect "order:by":

Béarn
Bécancour
Beaupré   </pre></blockquote><pre wrap="">
Did you initdb with locale en_US.UTF8, and also createdb with encoding
UTF8?  While you can certainly choose mismatching values in createdb and
initdb, you shouldn't because it doesn't work.</pre></blockquote> That's an interesting question. Are the LC_*
variablesset by initdb or createdb? In other words, does their value indicate what initdb settings I used? If I do a
defaultcreatedb, will the new database automatically be consistent with the cluster's initdb?<br />