Thread: Collation rules and multi-lingual databases

Collation rules and multi-lingual databases

From
Greg Stark
Date:
My understanding is that the entire set of localization parameters needs to be
decided upon when the initdb is done and can never be changed later. Is that
right?

I have a multi-lingual web site, I want to be able to sort using collation
rules for en_US, en_CA, and fr_CA depending on the current web site user.
There's no way I can do a database query to do these sorts is there?

--
greg

Re: Collation rules and multi-lingual databases

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> My understanding is that the entire set of localization parameters needs to be
> decided upon when the initdb is done and can never be changed later. Is that
> right?

No, not all of them are frozen.  Unfortunately, the one you care about
(LC_COLLATE) is.  The reason for this is that it determines index
ordering for textual columns, and so changing LC_COLLATE on the fly
produces instant corrupt indexes :-(

A solution for this is on the TODO list, but don't hold your breath ...

            regards, tom lane

Re: Collation rules and multi-lingual databases

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > My understanding is that the entire set of localization parameters needs to be
> > decided upon when the initdb is done and can never be changed later. Is that
> > right?
>
> No, not all of them are frozen.  Unfortunately, the one you care about
> (LC_COLLATE) is.  The reason for this is that it determines index
> ordering for textual columns, and so changing LC_COLLATE on the fly
> produces instant corrupt indexes :-(

Yeah, I really would be perfectly happy to have indexes be in "C" order and
have my queries have to specifically specify the sort order in the ORDER BY
clause, knowing they won't use the index.

Well, ok, not quite perfectly happy. But in this case there are no indexes on
the columns anyways so...

> A solution for this is on the TODO list, but don't hold your breath ...

--
greg

Re: Collation rules and multi-lingual databases

From
Dennis Gearon
Date:
I think it would be nice, and I may write it eventually, to have a
function called:

COLLATION_VALUE( 'string', 'encoding' )

Which could be used like:

SELECT field_a, field_b
FROM table_a
GROUP BY COLLATION_VALUE( field_a )
ORDER BY COLLATION_VALUE( field_b );

or in other creative ways.

Greg Stark wrote:

>Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>
>
>>Greg Stark <gsstark@mit.edu> writes:
>>
>>
>>>My understanding is that the entire set of localization parameters needs to be
>>>decided upon when the initdb is done and can never be changed later. Is that
>>>right?
>>>
>>>
>>No, not all of them are frozen.  Unfortunately, the one you care about
>>(LC_COLLATE) is.  The reason for this is that it determines index
>>ordering for textual columns, and so changing LC_COLLATE on the fly
>>produces instant corrupt indexes :-(
>>
>>
>
>Yeah, I really would be perfectly happy to have indexes be in "C" order and
>have my queries have to specifically specify the sort order in the ORDER BY
>clause, knowing they won't use the index.
>
>Well, ok, not quite perfectly happy. But in this case there are no indexes on
>the columns anyways so...
>
>
>
>>A solution for this is on the TODO list, but don't hold your breath ...
>>
>>
>
>
>


Re: Collation rules and multi-lingual databases

From
Greg Stark
Date:
Dennis Gearon <gearond@fireserve.net> writes:

> I think it would be nice, and I may write it eventually, to have a function
> called:
>
> COLLATION_VALUE( 'string', 'encoding' )

Indeed that would be really nice. I wish I had that and a pony.

Unfortunately my understanding is that the collation rules are simply too
complex to allow such a function in general. It's too bad because it would
indeed eliminate a lot of the problems in a single swoop.

Hm. But perhaps I can come up with such a function for the few collation
domains I care about. I think English is just a matter of making it case
insensitive, and fr_xx is just a matter of squashing accents. In which case I
should be able to do it myself.

--
greg

Re: Collation rules and multi-lingual databases

From
Greg Stark
Date:
Greg Stark <gsstark@MIT.EDU> writes:

> Dennis Gearon <gearond@fireserve.net> writes:
>
> > I think it would be nice, and I may write it eventually, to have a function
> > called:
> >
> > COLLATION_VALUE( 'string', 'encoding' )
>
> Indeed that would be really nice. I wish I had that and a pony.
>
> Unfortunately my understanding is that the collation rules are simply too
> complex to allow such a function in general. It's too bad because it would
> indeed eliminate a lot of the problems in a single swoop.

Uh, so apparently I'm on crack and this is *precisely* how the l10n collation
rules work. Sorry for jumping in with an uninformed opinion.

>    Effectively, the way these functions work is by applying a mapping to
> transform the characters in a string to a byte sequence that represents
> the string's position in the collating sequence of the current locale.
> Comparing two such byte sequences in a simple fashion is equivalent to
> comparing the strings with the locale's collating sequence.
>
>    The functions `strcoll' and `wcscoll' perform this translation
> implicitly, in order to do one comparison.  By contrast, `strxfrm' and
> `wcsxfrm' perform the mapping explicitly.  If you are making multiple
> comparisons using the same string or set of strings, it is likely to be
> more efficient to use `strxfrm' or `wcsxfrm' to transform all the
> strings just once, and subsequently compare the transformed strings
> with `strcmp' or `wcscmp'.

Given this it should be easy to write a collation_value(string,locale) C
function that switches the collation order, calls strxfrm and then restores
the collation order.

I fear memory leaks or performance losses on frequent locale switches like
this but it should be easy enough to try out. I don't see any problems with
postgres as long as it's possible to ensure the locale is always switched back
properly. It might not be thread-safe though.

At worst I could always call strxfrm in the application for each locale I care
about when inserting the data. That would bloat my tables for nothing though.

So it's looking like I might get my pony after all.

--
greg

Re: Collation rules and multi-lingual databases

From
Dennis Gearon
Date:
I was thinking of INGNORING locale, since it is basically fixed for a DB
for long periods of time.

If a table/column HAD it's own locale, that could be used,
but I was more interested in a function taht would allow the explicit
declaration of the encoding(s) to look for.

BTW, what is l10n

Greg Stark wrote:

>Greg Stark <gsstark@MIT.EDU> writes:
>
>
>
>>Dennis Gearon <gearond@fireserve.net> writes:
>>
>>
>>
>>>I think it would be nice, and I may write it eventually, to have a function
>>>called:
>>>
>>>COLLATION_VALUE( 'string', 'encoding' )
>>>
>>>
>>Indeed that would be really nice. I wish I had that and a pony.
>>
>>Unfortunately my understanding is that the collation rules are simply too
>>complex to allow such a function in general. It's too bad because it would
>>indeed eliminate a lot of the problems in a single swoop.
>>
>>
>
>Uh, so apparently I'm on crack and this is *precisely* how the l10n collation
>rules work. Sorry for jumping in with an uninformed opinion.
>
>
>
>>   Effectively, the way these functions work is by applying a mapping to
>>transform the characters in a string to a byte sequence that represents
>>the string's position in the collating sequence of the current locale.
>>Comparing two such byte sequences in a simple fashion is equivalent to
>>comparing the strings with the locale's collating sequence.
>>
>>   The functions `strcoll' and `wcscoll' perform this translation
>>implicitly, in order to do one comparison.  By contrast, `strxfrm' and
>>`wcsxfrm' perform the mapping explicitly.  If you are making multiple
>>comparisons using the same string or set of strings, it is likely to be
>>more efficient to use `strxfrm' or `wcsxfrm' to transform all the
>>strings just once, and subsequently compare the transformed strings
>>with `strcmp' or `wcscmp'.
>>
>>
>
>Given this it should be easy to write a collation_value(string,locale) C
>function that switches the collation order, calls strxfrm and then restores
>the collation order.
>
>I fear memory leaks or performance losses on frequent locale switches like
>this but it should be easy enough to try out. I don't see any problems with
>postgres as long as it's possible to ensure the locale is always switched back
>properly. It might not be thread-safe though.
>
>At worst I could always call strxfrm in the application for each locale I care
>about when inserting the data. That would bloat my tables for nothing though.
>
>So it's looking like I might get my pony after all.
>
>
>


Re: Collation rules and multi-lingual databases

From
Greg Stark
Date:
Dennis Gearon <gearond@fireserve.net> writes:

> I was thinking of INGNORING locale, since it is basically fixed for a DB for
> long periods of time.
>
> If a table/column HAD it's own locale, that could be used,
> but I was more interested in a function taht would allow the explicit
> declaration of the encoding(s) to look for.

Indeed for my purposes that's what I'll have to do.

but the strxfrm function uses the current application locale, so I'll have to
call setlocale to set it, call strxfrm, then call setlocale to set it back.

I fear that some implementations might do a lot of work when setlocale is
called loading large data files and might leak memory expecting it to only be
called once at program initialization. That would suck

> BTW, what is l10n

l10n = localization
i18n = internationalization

arguably i should have said i18n actually.

--
greg

Re: Collation rules and multi-lingual databases

From
Dennis Gearon
Date:
NO No no, do not mess with set locale,
    You will need to hack your own version of strxfrm which takes an
aragument for locale. If you mess with locale,you will corrupt the
indexes I've been told.

Greg Stark wrote:

>Dennis Gearon <gearond@fireserve.net> writes:
>
>
>
>>I was thinking of INGNORING locale, since it is basically fixed for a DB for
>>long periods of time.
>>
>>If a table/column HAD it's own locale, that could be used,
>>but I was more interested in a function taht would allow the explicit
>>declaration of the encoding(s) to look for.
>>
>>
>
>Indeed for my purposes that's what I'll have to do.
>
>but the strxfrm function uses the current application locale, so I'll have to
>call setlocale to set it, call strxfrm, then call setlocale to set it back.
>
>I fear that some implementations might do a lot of work when setlocale is
>called loading large data files and might leak memory expecting it to only be
>called once at program initialization. That would suck
>
>
>
>>BTW, what is l10n
>>
>>
>
>l10n = localization
>i18n = internationalization
>
>arguably i should have said i18n actually.
>
>
>