Re: Per-column collation - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: Per-column collation
Date
Msg-id 1291663608.10677.28.camel@vanquo.pezone.net
Whole thread Raw
In response to Re: Per-column collation  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Responses Re: Per-column collation
List pgsql-hackers
On mån, 2010-12-06 at 21:06 +0900, Itagaki Takahiro wrote:
> On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut <peter_e@gmx.net> wrote:
> > Here is an updated patch to address the issues discussed during this
> > commitfest.
> 
> Here are comments and questions after I tested the latest patch:
> 
> ==== Issues ====
> * initdb itself seems to be succeeded, but it says "could not determine
> encoding for locale" messages for any combination of encoding=utf8/eucjp
> and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior?
> ----
> creating collations ...initdb: locale name has non-ASCII characters,
> skipped: bokm虱
> initdb: locale name has non-ASCII characters, skipped: fran軋is
> could not determine encoding for locale "hy_AM.armscii8": codeset is "ARMSCII-8"
> ... (a dozen of lines) ...
> could not determine encoding for locale "vi_VN.tcvn": codeset is "TCVN5712-1"
> ok
> ----

What this does it take the output of locale -a and populate the
pg_collation catalog with the locales it finds.  When it finds an
operating system locale that uses an encoding that is not recognized,
you will see this warning.

I understand that that would probably annoy users.  We could hide the
warning and silently skip those locales.  But then could that hide
genuine configuration problems?

> * contrib/citext raises an encoding error when COLLATE is specified
> even if it is the collation as same as the database default.
> We might need some special treatment for C locale.
> =# SHOW lc_collate;  ==> C
> =# SELECT ('A'::citext) = ('a'::citext);  ==> false
> =# SELECT ('A'::citext) = ('a'::citext) COLLATE "C";
> ERROR:  invalid multibyte character for locale
> HINT:  The server's LC_CTYPE locale is probably incompatible with the
> database encoding.

OK, I can reproduce that.  That's fallout from the lc_ctype_is_c()
optimization that I removed, as explained in another email.  I'll have
to think about that again.

> * pg_dump would generate unportable files for different platforms
> because collation names

pg_dump can already produce unportable files for a number of other
reasons, including per-database locale, tablespaces, OS-dependent
configuration settings.

The way I imagine this working is that someone who wants to design a
genuinely portable application using this feature would create their own
collation based on the existing, OS-specific collation (using a
to-be-added CREATE COLLATION command).  As mentioned earlier, however,
we can't actually solve the problem that the OS locales may not behave
the same across systems.

> ==== Source codes ====
> * PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION.

It's not the collation of a function argument, it's the collation of a
function call.  (You could conceivably also fetch the collation of a
function argument, but that isn't used in any way.)

> * What is the different between InvalidOid and DEFAULT_COLLATION_OID
> for collation oids? The patch replaces DirectFunctionCall to
> DirectFunctionCallC in some places, but we could shrink the diff size
> if we can use InvalidOid instead of DEFAULT_COLLATION_OID,

Think of DEFAULT_COLLATION_OID as analogous to UKNOWNOID.  A long time
ago we used InvalidOid for all kinds of types, including unknown,
pseudotypes, cstring, and no type at all.  The reason we changed this
was that this masked errors and made processing of the unknown type
difficult/impossible.  I know this makes the code bigger, but it's
necessary.  I originally coded the patch using InvalidOid for
everything, but that wasn't very robust.

This also ties into the next question ...

> * I still think an explicit passing collations from-function-to-function
> is horrible because we might forget it in some places, and almost existing
> third party module won't work.  Is it possible to make it a global variable,
> and push/pop the state when changed? Sorry I'm missing something, but
> I think we could treat the collation setting as like as GUC settings.

A collation is a property of a datum or an expression.  You might as
well argue that we don't keep track of types of expressions and instead
store it globally.  Doesn't make sense.

Extensions are not required to support collations.  Those that might
want to will usually end up calling one of the locale-enabled functions
such as varstr_cmp(), and there the function prototype will ensure that
specifying a collation cannot be missed.

Additionally, the distinction of InvalidOid and DEFAULT_COLLATION_OID
does a great deal to ensure that in case a collation is unspecified or
missing in some new code, you will get a proper error message instead of
unspecified behavior.




pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: pg_execute_from_file review
Next
From: Peter Eisentraut
Date:
Subject: Re: Per-column collation