Re: Per-column collation - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: Per-column collation
Date
Msg-id 1289936463.31200.19.camel@vanquo.pezone.net
Whole thread Raw
In response to Re: Per-column collation  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: Per-column collation
Re: Per-column collation
List pgsql-hackers
On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
> yes - my first question is: Why we need to specify encoding, when only
> one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
> use a UTF8 - btw there is wrong message:
> 
> yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
> ERROR:  collation "cs_CZ.iso88592" for current database encoding
> "UTF8" does not exist
> LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
>                                            ^

Sorry, is there some mistake in that message?

> I don't know why, but preferred encoding for czech is iso88592 now -
> but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
> always have to use a full name "cs_CZ.utf8". It's wrong. More - from
> this moment, my application depends on firstly used encoding - I can't
> to change encoding without refactoring of SQL statements - because
> encoding is hardly there (in collation clause).

I can only look at the locales that the operating system provides.  We
could conceivably make some simplifications like stripping off the
".utf8", but then how far do we go and where do we stop?  Locale names
on Windows look different too.  But in general, how do you suppose we
should map an operating system locale name to an "acceptable" SQL
identifier?  You might hope, for example, that we could look through the
list of operating system locale names and map, say,

cs_CZ        -> "czech"
cs_CZ.iso88592  -> "czech"
cs_CZ.utf8      -> "czech"
czech           -> "czech"

but we have no way to actually know that these are semantically similar,
so this illustrated mapping is AI complete.  We need to take the locale
names as is, and that may or may not carry encoding information.

> So I don't understand, why you fill a table pg_collation with thousand
> collated that are not possible to use? If I use a utf8, then there
> should be just utf8 based collates. And if you need to work with wide
> collates, then I am for a preferring utf8 - minimally for central
> europe region. if somebody would to use a collates here, then he will
> use a combination cs, de, en - so it must to use a latin2 and latin1
> or utf8. I think so encoding should not be a part of collation when it
> is possible.

Different databases can have different encodings, but the pg_collation
catalog is copied from the template database in any case.  We can't do
any changes in system catalogs as we create new databases, so the
"useless" collations have to be there.  There are only a few hundred,
actually, so it's not really a lot of wasted space.




pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: GiST insert algorithm rewrite
Next
From: Robert Haas
Date:
Subject: Re: Explain analyze getrusage tracking