Thread: Very puzzling sort behavior

Very puzzling sort behavior

From
Ken Tanzer
Date:
Hi.  In a table that includes these columns:

my_db=> \d tbl_client
...
 name_last                   | character varying(40)          | not null
 name_first                  | character varying(30)          | not null
...

I am extremely puzzled by the sorting of the "CLARKE"s in this list:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE 'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', ' || name_first;
 name_last | length | clark | clarke 
-----------+--------+-------+--------
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMSON   |      7 | f     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
(17 rows)

The ADAMS are included just to show a similar example is ordering correctly.  I put the length and equality test columns in to try to make sure there weren't some bizarre characters in the data.  This is only happening on one particular database.  I did a reindex on the table just for good measure.  If I remove the name_first piece of the ORDER BY (which doesn't seem like it should matter), it sorts as expected:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE 'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', '               
;
 name_last | length | clark | clarke 
-----------+--------+-------+--------
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMSON   |      7 | f     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
(17 rows)

I tried selecting those 17 rows from tbl_client into a new table, and get the same odd behavior.  However, if I run with new data I get an expected order:

CREATE TEMP TABLE test (
  name_first VARCHAR(40),
  name_last VARCHAR(30)

);

INSERT INTO test VALUES ('JOE','CLARKE'),('BILL','CLARK');

SELECT * FROM test ORDER BY name_last;
SELECT * FROM test ORDER BY name_last || ', ' || name_first;

Any thoughts about what's going on, what to do about it, or what obvious point I missing?  Thanks in advance!

my_db=> SELECT version();
                                                    version                                                     
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.2.13 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit
(1 row)

Ken



--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: Very puzzling sort behavior

From
Peter Geoghegan
Date:
On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
> Any thoughts about what's going on, what to do about it, or what obvious point I missing?  Thanks in advance!

This is the expected behavior. Locale rules will weigh the punctuation
character and space you added after primary alphabetical ordering.
Therefore, the 'E' in 'CLARKE' is more or less compared with the first
character of the first name in the case that you show, because those
code points are both what Unicode calls "primary weights".

You're not displaying the string you're ordering by anyway, so you can
just not concatenate the ', ' string (so name_last || name_first), and
it will work as you expect, I believe.

--
Regards,
Peter Geoghegan


Re: Very puzzling sort behavior

From
Steve Crawford
Date:
Any null values in first name??

-Steve

On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
Hi.  In a table that includes these columns:

my_db=> \d tbl_client
...
 name_last                   | character varying(40)          | not null
 name_first                  | character varying(30)          | not null
...

I am extremely puzzled by the sorting of the "CLARKE"s in this list:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE 'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', ' || name_first;
 name_last | length | clark | clarke 
-----------+--------+-------+--------
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMSON   |      7 | f     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
(17 rows)

The ADAMS are included just to show a similar example is ordering correctly.  I put the length and equality test columns in to try to make sure there weren't some bizarre characters in the data.  This is only happening on one particular database.  I did a reindex on the table just for good measure.  If I remove the name_first piece of the ORDER BY (which doesn't seem like it should matter), it sorts as expected:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE 'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', '               
;
 name_last | length | clark | clarke 
-----------+--------+-------+--------
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMS     |      5 | f     | f
 ADAMSON   |      7 | f     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARK     |      5 | t     | f
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
 CLARKE    |      6 | f     | t
(17 rows)

I tried selecting those 17 rows from tbl_client into a new table, and get the same odd behavior.  However, if I run with new data I get an expected order:

CREATE TEMP TABLE test (
  name_first VARCHAR(40),
  name_last VARCHAR(30)

);

INSERT INTO test VALUES ('JOE','CLARKE'),('BILL','CLARK');

SELECT * FROM test ORDER BY name_last;
SELECT * FROM test ORDER BY name_last || ', ' || name_first;

Any thoughts about what's going on, what to do about it, or what obvious point I missing?  Thanks in advance!

my_db=> SELECT version();
                                                    version                                                     
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.2.13 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit
(1 row)

Ken



--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: Very puzzling sort behavior

From
Tom Lane
Date:
Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
>> Any thoughts about what's going on, what to do about it, or what obvious point I missing?  Thanks in advance!

> This is the expected behavior.

If you don't like it, sort in C locale ...

            regards, tom lane


Re: Very puzzling sort behavior

From
Ken Tanzer
Date:
Alright never mind, I guess I see what's going on.  Thanks!

Ken

On Thu, Sep 10, 2015 at 12:51 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:


On Thu, Sep 10, 2015 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
>> Any thoughts about what's going on, what to do about it, or what obvious point I missing?  Thanks in advance!

> This is the expected behavior.

If you don't like it, sort in C locale ...

                        regards, tom lane

OK, can one of you help me out in understanding this?  I would have thought that given "CLARK," and "CLARKE" that the comma would get compared against the E and come first.  End of story, before we even get to anything farther in the string.  What am I missing?

--
learn more about AGENCY or
follow the discussion.



--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: Very puzzling sort behavior

From
Peter Geoghegan
Date:
On Thu, Sep 10, 2015 at 12:51 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
> OK, can one of you help me out in understanding this?  I would have thought that given "CLARK," and "CLARKE" that the
commawould get compared against the E and come first.  End of story, before we even get to anything farther in the
string. What am I missing? 

That's only how it works with the C locale. Otherwise, there are
complicated rules to weigh things like space and punctuation (and
accents/diacritics) less prominently than primary alphabetical
ordering. This is often useful. Anyway, based on what you say here, I
think you should actually "ORDER BY name_last, name_first".

--
Regards,
Peter Geoghegan


Re: Very puzzling sort behavior

From
Ken Tanzer
Date:


On Thu, Sep 10, 2015 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
>> Any thoughts about what's going on, what to do about it, or what obvious point I missing?  Thanks in advance!

> This is the expected behavior.

If you don't like it, sort in C locale ...

                        regards, tom lane

OK, can one of you help me out in understanding this?  I would have thought that given "CLARK," and "CLARKE" that the comma would get compared against the E and come first.  End of story, before we even get to anything farther in the string.  What am I missing?

--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: Very puzzling sort behavior

From
Ken Tanzer
Date:
On Thu, Sep 10, 2015 at 12:56 PM, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote:
On Thu, Sep 10, 2015 at 12:51 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
> OK, can one of you help me out in understanding this?  I would have thought that given "CLARK," and "CLARKE" that the comma would get compared against the E and come first.  End of story, before we even get to anything farther in the string.  What am I missing?

That's only how it works with the C locale. Otherwise, there are
complicated rules to weigh things like space and punctuation (and
accents/diacritics) less prominently than primary alphabetical
ordering. This is often useful. Anyway, based on what you say here, I
think you should actually "ORDER BY name_last, name_first".

--
Regards,
Peter Geoghegan


Thanks.  A little more help would be appreciated.  First a little context:

What I mailed out what a boiled down example.  In reality, what I have is a ton of tables with a client_id in them, and a convenience function client_name(client_id) that returns the name_last, name_first string (plus an alias if it exists).  client_name is used all over the place in both views and in an app that uses the database.  There is a similar, also heavily used, staff_name function.  Eliminating the use of these functions is a non-starter for me--I'd much rather live with the existing sort behavior, especially as no one has ever pointed this out despite over a decade of use.

I'm hoping to sort change this behavior with as minimal a change as possible (e.g., minimal potential for unexpected side effects or breakage).  I was hoping to just add a COLLATE "C" within the function:

CREATE OR REPLACE FUNCTION client_name( cid int4 ) RETURNS text AS $$
     -- client is a view that draws from tbl_client.  name_full is the field with the name_last, name_first data in it
     SELECT name_full COLLATE "C" FROM client WHERE client_id=$1;
$$

but that seems to have no effect.  And sure enough the documentation seems to back that up. ("The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type.")  So this may be wishful thinking, but is there any other way to specify the collation of a function result?  Specifying the collation every time the function is used is likely a no-go for me too.

Alternatively, it seems I could create new databases with a C collation and then move my data into them.  This seems a bit drastic, although possible.  I'd again be worried about the breakage/side effects.  And actually, will this work?  (i.e., can you use pg_dump to populate a new database with a different locale?)

Are there any other potential solutions, pitfalls or considerations that come to mind?  Any thoughts welcome.  And as I said, if there's not a good way to do this I'll probably leave it alone.  Thanks.

Ken



--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: Very puzzling sort behavior

From
Alvaro Herrera
Date:
Ken Tanzer wrote:

> Are there any other potential solutions, pitfalls or considerations that
> come to mind?  Any thoughts welcome.  And as I said, if there's not a good
> way to do this I'll probably leave it alone.

In part, it boils down to what you use the in ORDER BY clause.  If you
concatenate the last name and first name, they will be considered as a
single string and run afoul of funny behavior of dictionary sorting,
which ignores non-alphanumeric chars in the first pass.  But if you keep
them separate by using "ORDER BY last_name, first_name" then sorting
will consider the last name separately from first name, and you'd get
the results you want (I think).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Very puzzling sort behavior

From
Ken Tanzer
Date:
On Thu, Sep 10, 2015 at 2:02 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Ken Tanzer wrote:

> Are there any other potential solutions, pitfalls or considerations that
> come to mind?  Any thoughts welcome.  And as I said, if there's not a good
> way to do this I'll probably leave it alone.

In part, it boils down to what you use the in ORDER BY clause.  If you
concatenate the last name and first name, they will be considered as a
single string and run afoul of funny behavior of dictionary sorting,
which ignores non-alphanumeric chars in the first pass.  But if you keep
them separate by using "ORDER BY last_name, first_name" then sorting
will consider the last name separately from first name, and you'd get
the results you want (I think).

Thanks, but I guess I should have been clearer.  Thanks to y'all wonderful mailing list folks, I get it now as to why the two sorts are not the same.  I'm hoping for practical suggestions or advice about how to get C locale sorting without having to rewrite all my existing queries.  I still might be clinging to futile hope, but is there really no way to specify a collation for the return value of a function?  And are there any pitfalls associated with pg_dump-ing into a new database with a different collation?

Cheers,
Ken


--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: Very puzzling sort behavior

From
Andrew Sullivan
Date:
On Thu, Sep 10, 2015 at 02:54:31PM -0700, Ken Tanzer wrote:
> Thanks, but I guess I should have been clearer.  Thanks to y'all wonderful
> mailing list folks, I get it now as to why the two sorts are not the same.
> I'm hoping for practical suggestions or advice about how to get C locale
> sorting without having to rewrite all my existing queries.

Why not change the collation for the column?
http://www.postgresql.org/docs/9.4/interactive/sql-altertable.html

> be clinging to futile hope, but is there really no way to specify a
> collation for the return value of a function?

I don't believe so.  I think you need to specify the collation for the
data itself.

It strikes me that you might be able to similate this with a
materialized view or something like that, which has a different
collation than the source table.  That seems like it'd be pretty
awkward, but if there's some reason you can't use C collation on the
source table itself that might work.  I guess you could do the same
thing with a temporary table inside the function, thereby getting a
different collation on the data than in the source table.  These both
seem like hacks, but if you need a bugfix prior to a real solution
they'd give you a path.

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: Very puzzling sort behavior

From
Joe Conway
Date:
On 09/10/2015 03:03 PM, Andrew Sullivan wrote:
> On Thu, Sep 10, 2015 at 02:54:31PM -0700, Ken Tanzer wrote:
>> Thanks, but I guess I should have been clearer.  Thanks to y'all wonderful
>> mailing list folks, I get it now as to why the two sorts are not the same.
>> I'm hoping for practical suggestions or advice about how to get C locale
>> sorting without having to rewrite all my existing queries.
>
> Why not change the collation for the column?
> http://www.postgresql.org/docs/9.4/interactive/sql-altertable.html
>
>> be clinging to futile hope, but is there really no way to specify a
>> collation for the return value of a function?
>
> I don't believe so.  I think you need to specify the collation for the
> data itself.

I have not read the entire thread, and not sure I am following the issue
correctly, but isn't this what you want?

8<---------------
create table t2(a text);
insert into t2 values('abc,def'),('abcd');

show lc_collate;
 lc_collate
-------------
 en_US.UTF-8
(1 row)

select a from t2 order by a;
    a
---------
 abcd
 abc,def
(2 rows)

select a from t2 order by a COLLATE "C";
    a
---------
 abc,def
 abcd
(2 rows)
8<---------------

HTH,

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Attachment

Re: Very puzzling sort behavior

From
Ken Tanzer
Date:
On Thu, Sep 10, 2015 at 3:03 PM, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
On Thu, Sep 10, 2015 at 02:54:31PM -0700, Ken Tanzer wrote:
> Thanks, but I guess I should have been clearer.  Thanks to y'all wonderful
> mailing list folks, I get it now as to why the two sorts are not the same.
> I'm hoping for practical suggestions or advice about how to get C locale
> sorting without having to rewrite all my existing queries.

Why not change the collation for the column?
http://www.postgresql.org/docs/9.4/interactive/sql-altertable.html


Hmm... that sounds promising.  And less drastic than rewriting the whole database.  I'll have to ponder that one, and thanks for the suggestion!

Cheers,
Ken