Thread: ORDER BY collation order
Hi, I just found that two identical queries on two PG 8.2.7 databases with the same data and same encoding, one running on Debian and the other on FreeBSD, returned rows in a different order, even though both queries had an ORDER BY clause. Essentially, on FreeBSD a varchar starting with a double-quote character came first, ahead of a string starting with a number and a string starting with 'A', whereas on Debian the double-quote came last. Some research led to the following paragraph in the documentation: Character-string data is sorted according to the locale-specific collation order that was established when the database cluster was initialized. I guess that means the encoding of the respective template0 database is what determines the sort order (UTF8 on FreeBSD and LATIN1 on Debian), right? Unfortunately, I'm unable to change the FreeBSD cluster since it's shared with others at our hosting provider. Is there some way to override the cluster setting, or plans to allow for database-specific collation orders? Joe
On Thu, Sep 18, 2008 at 6:48 PM, Joe <dev@freedomcircle.net> wrote: > Hi, > > I just found that two identical queries on two PG 8.2.7 databases with the > same data and same encoding, one running on Debian and the other on FreeBSD, > returned rows in a different order, even though both queries had an ORDER BY > clause. Essentially, on FreeBSD a varchar starting with a double-quote > character came first, ahead of a string starting with a number and a string > starting with 'A', whereas on Debian the double-quote came last. > > Some research led to the following paragraph in the documentation: > > Character-string data is sorted according to the locale-specific collation > order that was established when the database cluster was initialized. > > I guess that means the encoding of the respective template0 database is what > determines the sort order (UTF8 on FreeBSD and LATIN1 on Debian), right? > Unfortunately, I'm unable to change the FreeBSD cluster since it's shared > with others at our hosting provider. Is there some way to override the > cluster setting, or plans to allow for database-specific collation orders? no, not encoding, locale, such as en_US or C determine sort order. You can use varchar_pattern_ops and ~*~ operator. Search for those in the docs.
Hi Scott, Scott Marlowe wrote: > no, not encoding, locale, such as en_US or C determine sort order. > OK, so I guess you're saying that whatever was in the LC_COLLATE environment variable at the time the template0 database was created determines the collation/sort order? Is that stored and visible somewhere? > You can use varchar_pattern_ops and ~*~ operator. > > Search for those in the docs. > What I found (http://www.postgresql.org/docs/8.2/static/indexes-opclass.html), talks about creating an index with varchar_pattern_ops but that presumably won't affect an ORDER BY result. I'm not quite sure where to find the "~*~" operator, although I did find similar ones in 9.7 Pattern Matching. In any case, I'm not sure how an operator helps in changing an ORDER BY result from "quoted" 123 Abc to 123 Abc "quoted" It's even trickier than this simple example, because on Debian which is using the en_US locale, the double quotes are disregarded for ordering purposes, e.g., Medical "Meet" Message Joe
Joe wrote: > > I guess that means the encoding of the respective template0 database is > what determines the sort order (UTF8 on FreeBSD and LATIN1 on Debian), > right? Unfortunately, I'm unable to change the FreeBSD cluster since > it's shared with others at our hosting provider. Is there some way to > override the cluster setting, or plans to allow for database-specific > collation orders? The problem I see here is not just locale settings as Scott Marlowe pointed out but also the fact that the server runs on FreeBSD with UTF8 encoding. PostgreSQL relies on system (in this case libc) functions to handle locale stuff and FreeBSD does not support UTF8. This led to FreeBSD specific patch (which is in ports) that uses ICU library to handle UTF8 encoding. Sadly, it is not available for PostgreSQL 8.2 (original author made patches only up to 8.1 and I ported it to 8.3 later). Also that patch has to be enabled (which is not default iirc) when installing PostgreSQL package. So, since you say your cluster is hosted by some hosting company then your only options are either to convince them to upgrade to 8.3 (provided the system locale is ok for your purposes) or wait until per database collation is added (it was one of GSoC 08 projects and might get into 8.4) and then convince them to upgrade. -- Regards Petr Jelinek (PJMODOS)
Joe wrote: > Thanks for the info. The hosting company already offers 8.3 > databases. It's just that other things (we use tsearch2) have taken > precedence over migrating to 8.3. How can I tell if the patch you > mention is installed? As admin (or somebody with shell access) by running "pg_config --configure" should show "--with-icu" as one of configure parameters. As user you could try SELECTing upper() or lower() on some text with more exotic chars (anything non-ASCII should do) in database which has UTF8 encoding. If it works then ICU support is there. -- Regards Petr Jelinek (PJMODOS)
On Thu, Sep 18, 2008 at 10:35 PM, Joe <dev@freedomcircle.net> wrote: > Hi Scott, > > Scott Marlowe wrote: >> >> no, not encoding, locale, such as en_US or C determine sort order. >> > > OK, so I guess you're saying that whatever was in the LC_COLLATE environment > variable at the time the template0 database was created determines the > collation/sort order? Is that stored and visible somewhere? It's set at the time of init.d and can't be changed without a dump/initdb/restore cycle. You can see by typing show lc_collate ;lc_collate ------------en_US in psql. >> You can use varchar_pattern_ops and ~*~ operator. >> Search for those in the docs. > > What I found > (http://www.postgresql.org/docs/8.2/static/indexes-opclass.html), talks > about creating an index with varchar_pattern_ops but that presumably won't > affect an ORDER BY result. I'm not quite sure where to find the "~*~" > operator, although I did find similar ones in 9.7 Pattern Matching. In any > case, I'm not sure how an operator helps in changing an ORDER BY result from > > "quoted" > 123 > Abc > > to > > 123 > Abc > "quoted" Sorry, I mentioned the wrong operator before, it's ~>~ and ~<~ (asc versus desc): smarlowe=# create table col_test (a text); CREATE TABLE smarlowe=# insert into col_test (a) values ('"quoted"'),('Abc'),('123'); INSERT 0 3 smarlowe=# select * from col_test order by a; a ----------123Abc"quoted" (3 rows) smarlowe=# select * from col_test order by a using ~<~; a ----------"quoted"123Abc (3 rows) smarlowe=# select * from col_test order by a using ~>~; a ----------Abc123"quoted" (3 rows)
Hi Scott, Scott Marlowe wrote: > Sorry, I mentioned the wrong operator before, it's ~>~ and ~<~ (asc > versus desc): > > smarlowe=# create table col_test (a text); > CREATE TABLE > smarlowe=# insert into col_test (a) values ('"quoted"'),('Abc'),('123'); > INSERT 0 3 > smarlowe=# select * from col_test order by a; > a > ---------- > 123 > Abc > "quoted" > (3 rows) > > smarlowe=# select * from col_test order by a using ~<~; > a > ---------- > "quoted" > 123 > Abc > (3 rows) > > smarlowe=# select * from col_test order by a using ~>~; > a > ---------- > Abc > 123 > "quoted" > (3 rows) Those operators give me "C"-style collation in the database that is using "en_US" collation, but what I would really prefer is the reverse. BTW, where are those operators documented? Neither Google nor Yahoo nor postgresql.org search return anything. Joe