Thread: Collation rules and multi-lingual databases
My understanding is that the entire set of localization parameters needs to be decided upon when the initdb is done and can never be changed later. Is that right? I have a multi-lingual web site, I want to be able to sort using collation rules for en_US, en_CA, and fr_CA depending on the current web site user. There's no way I can do a database query to do these sorts is there? -- greg
Greg Stark <gsstark@mit.edu> writes: > My understanding is that the entire set of localization parameters needs to be > decided upon when the initdb is done and can never be changed later. Is that > right? No, not all of them are frozen. Unfortunately, the one you care about (LC_COLLATE) is. The reason for this is that it determines index ordering for textual columns, and so changing LC_COLLATE on the fly produces instant corrupt indexes :-( A solution for this is on the TODO list, but don't hold your breath ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > My understanding is that the entire set of localization parameters needs to be > > decided upon when the initdb is done and can never be changed later. Is that > > right? > > No, not all of them are frozen. Unfortunately, the one you care about > (LC_COLLATE) is. The reason for this is that it determines index > ordering for textual columns, and so changing LC_COLLATE on the fly > produces instant corrupt indexes :-( Yeah, I really would be perfectly happy to have indexes be in "C" order and have my queries have to specifically specify the sort order in the ORDER BY clause, knowing they won't use the index. Well, ok, not quite perfectly happy. But in this case there are no indexes on the columns anyways so... > A solution for this is on the TODO list, but don't hold your breath ... -- greg
I think it would be nice, and I may write it eventually, to have a function called: COLLATION_VALUE( 'string', 'encoding' ) Which could be used like: SELECT field_a, field_b FROM table_a GROUP BY COLLATION_VALUE( field_a ) ORDER BY COLLATION_VALUE( field_b ); or in other creative ways. Greg Stark wrote: >Tom Lane <tgl@sss.pgh.pa.us> writes: > > > >>Greg Stark <gsstark@mit.edu> writes: >> >> >>>My understanding is that the entire set of localization parameters needs to be >>>decided upon when the initdb is done and can never be changed later. Is that >>>right? >>> >>> >>No, not all of them are frozen. Unfortunately, the one you care about >>(LC_COLLATE) is. The reason for this is that it determines index >>ordering for textual columns, and so changing LC_COLLATE on the fly >>produces instant corrupt indexes :-( >> >> > >Yeah, I really would be perfectly happy to have indexes be in "C" order and >have my queries have to specifically specify the sort order in the ORDER BY >clause, knowing they won't use the index. > >Well, ok, not quite perfectly happy. But in this case there are no indexes on >the columns anyways so... > > > >>A solution for this is on the TODO list, but don't hold your breath ... >> >> > > >
Dennis Gearon <gearond@fireserve.net> writes: > I think it would be nice, and I may write it eventually, to have a function > called: > > COLLATION_VALUE( 'string', 'encoding' ) Indeed that would be really nice. I wish I had that and a pony. Unfortunately my understanding is that the collation rules are simply too complex to allow such a function in general. It's too bad because it would indeed eliminate a lot of the problems in a single swoop. Hm. But perhaps I can come up with such a function for the few collation domains I care about. I think English is just a matter of making it case insensitive, and fr_xx is just a matter of squashing accents. In which case I should be able to do it myself. -- greg
Greg Stark <gsstark@MIT.EDU> writes: > Dennis Gearon <gearond@fireserve.net> writes: > > > I think it would be nice, and I may write it eventually, to have a function > > called: > > > > COLLATION_VALUE( 'string', 'encoding' ) > > Indeed that would be really nice. I wish I had that and a pony. > > Unfortunately my understanding is that the collation rules are simply too > complex to allow such a function in general. It's too bad because it would > indeed eliminate a lot of the problems in a single swoop. Uh, so apparently I'm on crack and this is *precisely* how the l10n collation rules work. Sorry for jumping in with an uninformed opinion. > Effectively, the way these functions work is by applying a mapping to > transform the characters in a string to a byte sequence that represents > the string's position in the collating sequence of the current locale. > Comparing two such byte sequences in a simple fashion is equivalent to > comparing the strings with the locale's collating sequence. > > The functions `strcoll' and `wcscoll' perform this translation > implicitly, in order to do one comparison. By contrast, `strxfrm' and > `wcsxfrm' perform the mapping explicitly. If you are making multiple > comparisons using the same string or set of strings, it is likely to be > more efficient to use `strxfrm' or `wcsxfrm' to transform all the > strings just once, and subsequently compare the transformed strings > with `strcmp' or `wcscmp'. Given this it should be easy to write a collation_value(string,locale) C function that switches the collation order, calls strxfrm and then restores the collation order. I fear memory leaks or performance losses on frequent locale switches like this but it should be easy enough to try out. I don't see any problems with postgres as long as it's possible to ensure the locale is always switched back properly. It might not be thread-safe though. At worst I could always call strxfrm in the application for each locale I care about when inserting the data. That would bloat my tables for nothing though. So it's looking like I might get my pony after all. -- greg
I was thinking of INGNORING locale, since it is basically fixed for a DB for long periods of time. If a table/column HAD it's own locale, that could be used, but I was more interested in a function taht would allow the explicit declaration of the encoding(s) to look for. BTW, what is l10n Greg Stark wrote: >Greg Stark <gsstark@MIT.EDU> writes: > > > >>Dennis Gearon <gearond@fireserve.net> writes: >> >> >> >>>I think it would be nice, and I may write it eventually, to have a function >>>called: >>> >>>COLLATION_VALUE( 'string', 'encoding' ) >>> >>> >>Indeed that would be really nice. I wish I had that and a pony. >> >>Unfortunately my understanding is that the collation rules are simply too >>complex to allow such a function in general. It's too bad because it would >>indeed eliminate a lot of the problems in a single swoop. >> >> > >Uh, so apparently I'm on crack and this is *precisely* how the l10n collation >rules work. Sorry for jumping in with an uninformed opinion. > > > >> Effectively, the way these functions work is by applying a mapping to >>transform the characters in a string to a byte sequence that represents >>the string's position in the collating sequence of the current locale. >>Comparing two such byte sequences in a simple fashion is equivalent to >>comparing the strings with the locale's collating sequence. >> >> The functions `strcoll' and `wcscoll' perform this translation >>implicitly, in order to do one comparison. By contrast, `strxfrm' and >>`wcsxfrm' perform the mapping explicitly. If you are making multiple >>comparisons using the same string or set of strings, it is likely to be >>more efficient to use `strxfrm' or `wcsxfrm' to transform all the >>strings just once, and subsequently compare the transformed strings >>with `strcmp' or `wcscmp'. >> >> > >Given this it should be easy to write a collation_value(string,locale) C >function that switches the collation order, calls strxfrm and then restores >the collation order. > >I fear memory leaks or performance losses on frequent locale switches like >this but it should be easy enough to try out. I don't see any problems with >postgres as long as it's possible to ensure the locale is always switched back >properly. It might not be thread-safe though. > >At worst I could always call strxfrm in the application for each locale I care >about when inserting the data. That would bloat my tables for nothing though. > >So it's looking like I might get my pony after all. > > >
Dennis Gearon <gearond@fireserve.net> writes: > I was thinking of INGNORING locale, since it is basically fixed for a DB for > long periods of time. > > If a table/column HAD it's own locale, that could be used, > but I was more interested in a function taht would allow the explicit > declaration of the encoding(s) to look for. Indeed for my purposes that's what I'll have to do. but the strxfrm function uses the current application locale, so I'll have to call setlocale to set it, call strxfrm, then call setlocale to set it back. I fear that some implementations might do a lot of work when setlocale is called loading large data files and might leak memory expecting it to only be called once at program initialization. That would suck > BTW, what is l10n l10n = localization i18n = internationalization arguably i should have said i18n actually. -- greg
NO No no, do not mess with set locale, You will need to hack your own version of strxfrm which takes an aragument for locale. If you mess with locale,you will corrupt the indexes I've been told. Greg Stark wrote: >Dennis Gearon <gearond@fireserve.net> writes: > > > >>I was thinking of INGNORING locale, since it is basically fixed for a DB for >>long periods of time. >> >>If a table/column HAD it's own locale, that could be used, >>but I was more interested in a function taht would allow the explicit >>declaration of the encoding(s) to look for. >> >> > >Indeed for my purposes that's what I'll have to do. > >but the strxfrm function uses the current application locale, so I'll have to >call setlocale to set it, call strxfrm, then call setlocale to set it back. > >I fear that some implementations might do a lot of work when setlocale is >called loading large data files and might leak memory expecting it to only be >called once at program initialization. That would suck > > > >>BTW, what is l10n >> >> > >l10n = localization >i18n = internationalization > >arguably i should have said i18n actually. > > >