Thread: ORDER BY collation order

ORDER BY collation order

From

Joe

Date:

18 September 2008, 22:49:42

Hi,

I just found that two identical queries on two PG 8.2.7 databases with 
the same data and same encoding, one running on Debian and the other on 
FreeBSD, returned rows in a different order, even though both queries 
had an ORDER BY clause.  Essentially, on FreeBSD a varchar starting with 
a double-quote character came first, ahead of a string starting with a 
number and a string starting with 'A', whereas on Debian the 
double-quote came last.

Some research led to the following paragraph in the documentation:

Character-string data is sorted according to the locale-specific 
collation order that was established when the database cluster was 
initialized.

I guess that means the encoding of the respective template0 database is 
what determines the sort order (UTF8 on FreeBSD and LATIN1 on Debian), 
right?  Unfortunately, I'm unable to change the FreeBSD cluster since 
it's shared with others at our hosting provider.  Is there some way to 
override the cluster setting, or plans to allow for database-specific 
collation orders?

Joe

Re: ORDER BY collation order

From

"Scott Marlowe"

Date:

18 September 2008, 23:26:37

On Thu, Sep 18, 2008 at 6:48 PM, Joe <dev@freedomcircle.net> wrote:
> Hi,
>
> I just found that two identical queries on two PG 8.2.7 databases with the
> same data and same encoding, one running on Debian and the other on FreeBSD,
> returned rows in a different order, even though both queries had an ORDER BY
> clause.  Essentially, on FreeBSD a varchar starting with a double-quote
> character came first, ahead of a string starting with a number and a string
> starting with 'A', whereas on Debian the double-quote came last.
>
> Some research led to the following paragraph in the documentation:
>
> Character-string data is sorted according to the locale-specific collation
> order that was established when the database cluster was initialized.
>
> I guess that means the encoding of the respective template0 database is what
> determines the sort order (UTF8 on FreeBSD and LATIN1 on Debian), right?
>  Unfortunately, I'm unable to change the FreeBSD cluster since it's shared
> with others at our hosting provider.  Is there some way to override the
> cluster setting, or plans to allow for database-specific collation orders?

no, not encoding, locale, such as en_US or C determine sort order.

You can use varchar_pattern_ops and ~*~ operator.

Search for those in the docs.

Re: ORDER BY collation order

From

Joe

Date:

19 September 2008, 02:35:54

Hi Scott,

Scott Marlowe wrote:
> no, not encoding, locale, such as en_US or C determine sort order.
>   

OK, so I guess you're saying that whatever was in the LC_COLLATE 
environment variable at the time the template0 database was created 
determines the collation/sort order?  Is that stored and visible somewhere?

> You can use varchar_pattern_ops and ~*~ operator.
>
> Search for those in the docs.
>   

What I found 
(http://www.postgresql.org/docs/8.2/static/indexes-opclass.html), talks 
about creating an index with varchar_pattern_ops but that presumably 
won't affect an ORDER BY result.  I'm not quite sure where to find the 
"~*~" operator, although I did find similar ones in 9.7 Pattern 
Matching.  In any case, I'm not sure how an operator helps in changing 
an ORDER BY result from

"quoted"
123
Abc

to

123
Abc
"quoted"

It's even trickier than this simple example, because on Debian which is 
using the en_US locale, the double quotes are disregarded for ordering 
purposes, e.g.,

Medical
"Meet"
Message

Joe

Re: ORDER BY collation order

From

Petr Jelinek

Date:

19 September 2008, 05:10:02

Joe wrote:
> 
> I guess that means the encoding of the respective template0 database is 
> what determines the sort order (UTF8 on FreeBSD and LATIN1 on Debian), 
> right?  Unfortunately, I'm unable to change the FreeBSD cluster since 
> it's shared with others at our hosting provider.  Is there some way to 
> override the cluster setting, or plans to allow for database-specific 
> collation orders?

The problem I see here is not just locale settings as Scott Marlowe 
pointed out but also the fact that the server runs on FreeBSD with UTF8 
encoding.

PostgreSQL relies on system (in this case libc) functions to handle 
locale stuff and FreeBSD does not support UTF8. This led to FreeBSD 
specific patch (which is in ports) that uses ICU library to handle UTF8 
encoding. Sadly, it is not available for PostgreSQL 8.2 (original author 
made patches only up to 8.1 and I ported it to 8.3 later). Also that 
patch has to be enabled (which is not default iirc) when installing 
PostgreSQL package.

So, since you say your cluster is hosted by some hosting company then 
your only options are either to convince them to upgrade to 8.3 
(provided the system locale is ok for your purposes) or wait until per 
database collation is added (it was one of GSoC 08 projects and might 
get into 8.4) and then convince them to upgrade.

-- 
Regards
Petr Jelinek (PJMODOS)

Re: ORDER BY collation order

From

Petr Jelinek

Date:

19 September 2008, 10:59:24

Joe wrote:
> Thanks for the info.   The hosting company already offers 8.3 
> databases.  It's just that other things (we use tsearch2) have taken 
> precedence over migrating to 8.3.  How can I tell if the patch you 
> mention is installed?
As admin (or somebody with shell access) by running "pg_config 
--configure" should show "--with-icu" as one of configure parameters.
As user you could try SELECTing upper() or lower() on some text with 
more exotic chars (anything non-ASCII should do) in database which has 
UTF8 encoding. If it works then ICU support is there.

-- 
Regards
Petr Jelinek (PJMODOS)

Re: ORDER BY collation order

From

"Scott Marlowe"

Date:

19 September 2008, 11:26:12

On Thu, Sep 18, 2008 at 10:35 PM, Joe <dev@freedomcircle.net> wrote:
> Hi Scott,
>
> Scott Marlowe wrote:
>>
>> no, not encoding, locale, such as en_US or C determine sort order.
>>
>
> OK, so I guess you're saying that whatever was in the LC_COLLATE environment
> variable at the time the template0 database was created determines the
> collation/sort order?  Is that stored and visible somewhere?

It's set at the time of init.d and can't be changed without a
dump/initdb/restore cycle.

You can see by typing
show lc_collate ;lc_collate
------------en_US

in psql.

>> You can use varchar_pattern_ops and ~*~ operator.
>> Search for those in the docs.
>
> What I found
> (http://www.postgresql.org/docs/8.2/static/indexes-opclass.html), talks
> about creating an index with varchar_pattern_ops but that presumably won't
> affect an ORDER BY result.  I'm not quite sure where to find the "~*~"
> operator, although I did find similar ones in 9.7 Pattern Matching.  In any
> case, I'm not sure how an operator helps in changing an ORDER BY result from
>
> "quoted"
> 123
> Abc
>
> to
>
> 123
> Abc
> "quoted"

Sorry, I mentioned the wrong operator before, it's ~>~ and ~<~ (asc
versus desc):

smarlowe=# create table col_test (a text);
CREATE TABLE
smarlowe=# insert into col_test (a) values ('"quoted"'),('Abc'),('123');
INSERT 0 3
smarlowe=# select * from col_test order by a;   a
----------123Abc"quoted"
(3 rows)

smarlowe=# select * from col_test order by a using ~<~;   a
----------"quoted"123Abc
(3 rows)

smarlowe=# select * from col_test order by a using ~>~;   a
----------Abc123"quoted"
(3 rows)

Re: ORDER BY collation order

From

Joe

Date:

21 September 2008, 11:59:19

Hi Scott,

Scott Marlowe wrote:
> Sorry, I mentioned the wrong operator before, it's ~>~ and ~<~ (asc
> versus desc):
>
> smarlowe=# create table col_test (a text);
> CREATE TABLE
> smarlowe=# insert into col_test (a) values ('"quoted"'),('Abc'),('123');
> INSERT 0 3
> smarlowe=# select * from col_test order by a;
>     a
> ----------
>  123
>  Abc
>  "quoted"
> (3 rows)
>
> smarlowe=# select * from col_test order by a using ~<~;
>     a
> ----------
>  "quoted"
>  123
>  Abc
> (3 rows)
>
> smarlowe=# select * from col_test order by a using ~>~;
>     a
> ----------
>  Abc
>  123
>  "quoted"
> (3 rows)

Those operators give me "C"-style collation in the database that is 
using "en_US" collation, but what I would really prefer is the reverse.  
BTW, where are those operators documented?  Neither Google nor Yahoo nor 
postgresql.org search return anything.

Joe