Re: Per-column collation - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Re: Per-column collation |
Date | |
Msg-id | 1291663608.10677.28.camel@vanquo.pezone.net Whole thread Raw |
In response to | Re: Per-column collation (Itagaki Takahiro <itagaki.takahiro@gmail.com>) |
Responses |
Re: Per-column collation
|
List | pgsql-hackers |
On mån, 2010-12-06 at 21:06 +0900, Itagaki Takahiro wrote: > On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut <peter_e@gmx.net> wrote: > > Here is an updated patch to address the issues discussed during this > > commitfest. > > Here are comments and questions after I tested the latest patch: > > ==== Issues ==== > * initdb itself seems to be succeeded, but it says "could not determine > encoding for locale" messages for any combination of encoding=utf8/eucjp > and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior? > ---- > creating collations ...initdb: locale name has non-ASCII characters, > skipped: bokm虱 > initdb: locale name has non-ASCII characters, skipped: fran軋is > could not determine encoding for locale "hy_AM.armscii8": codeset is "ARMSCII-8" > ... (a dozen of lines) ... > could not determine encoding for locale "vi_VN.tcvn": codeset is "TCVN5712-1" > ok > ---- What this does it take the output of locale -a and populate the pg_collation catalog with the locales it finds. When it finds an operating system locale that uses an encoding that is not recognized, you will see this warning. I understand that that would probably annoy users. We could hide the warning and silently skip those locales. But then could that hide genuine configuration problems? > * contrib/citext raises an encoding error when COLLATE is specified > even if it is the collation as same as the database default. > We might need some special treatment for C locale. > =# SHOW lc_collate; ==> C > =# SELECT ('A'::citext) = ('a'::citext); ==> false > =# SELECT ('A'::citext) = ('a'::citext) COLLATE "C"; > ERROR: invalid multibyte character for locale > HINT: The server's LC_CTYPE locale is probably incompatible with the > database encoding. OK, I can reproduce that. That's fallout from the lc_ctype_is_c() optimization that I removed, as explained in another email. I'll have to think about that again. > * pg_dump would generate unportable files for different platforms > because collation names pg_dump can already produce unportable files for a number of other reasons, including per-database locale, tablespaces, OS-dependent configuration settings. The way I imagine this working is that someone who wants to design a genuinely portable application using this feature would create their own collation based on the existing, OS-specific collation (using a to-be-added CREATE COLLATION command). As mentioned earlier, however, we can't actually solve the problem that the OS locales may not behave the same across systems. > ==== Source codes ==== > * PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION. It's not the collation of a function argument, it's the collation of a function call. (You could conceivably also fetch the collation of a function argument, but that isn't used in any way.) > * What is the different between InvalidOid and DEFAULT_COLLATION_OID > for collation oids? The patch replaces DirectFunctionCall to > DirectFunctionCallC in some places, but we could shrink the diff size > if we can use InvalidOid instead of DEFAULT_COLLATION_OID, Think of DEFAULT_COLLATION_OID as analogous to UKNOWNOID. A long time ago we used InvalidOid for all kinds of types, including unknown, pseudotypes, cstring, and no type at all. The reason we changed this was that this masked errors and made processing of the unknown type difficult/impossible. I know this makes the code bigger, but it's necessary. I originally coded the patch using InvalidOid for everything, but that wasn't very robust. This also ties into the next question ... > * I still think an explicit passing collations from-function-to-function > is horrible because we might forget it in some places, and almost existing > third party module won't work. Is it possible to make it a global variable, > and push/pop the state when changed? Sorry I'm missing something, but > I think we could treat the collation setting as like as GUC settings. A collation is a property of a datum or an expression. You might as well argue that we don't keep track of types of expressions and instead store it globally. Doesn't make sense. Extensions are not required to support collations. Those that might want to will usually end up calling one of the locale-enabled functions such as varstr_cmp(), and there the function prototype will ensure that specifying a collation cannot be missed. Additionally, the distinction of InvalidOid and DEFAULT_COLLATION_OID does a great deal to ensure that in case a collation is unspecified or missing in some new code, you will get a proper error message instead of unspecified behavior.
pgsql-hackers by date: