Thread: BUG #15651: Collation setting en_US.utf8 breaking sort order

BUG #15651: Collation setting en_US.utf8 breaking sort order

From

PG Bug reporting form

Date:

22 February 2019, 16:34:43

The following bug has been logged on the website:

Bug reference:      15651
Logged by:          Kaleb Akalework
Email address:      kaleb.akalework@asg.com
PostgreSQL version: 11.0
Operating system:   Linux
Description:

I have PostgresSQL database on Windows. I created database with Collation of
en_US.utf8. Then I created table (The steps to reproduce are below). I
inserted a few rows into this table one of which was row with special
characters "~!@#$^&(". The insert worked fine but then when I do a select on
the column for values >=' ' (Space), I get back all the rows except for the
row that contains 
"~!@#$^&(" . In the UTF8 table
https://www.utf8-chartable.de/unicode-utf8-table.pl, I can see that space is
the lowest printable character so technically every printable character
showed be greater than space but it isn't in this case. I create another
database and set collation to 'C'. then the same select query returns
"~!@#$^&(". The problem is only apparent in a Linux environment where
postgreSQL  is running and the database has a collation of en_US.utf8. Can
someone help please?

create table test (
    name_c varchar(14)
)

insert into test (name_c) values ('AAA')
insert into test (name_c) values ('BAA')
insert into test (name_c) values ('CAA')
insert into test (name_c) values ('DAA')
insert into test (name_c) values ('~!@#$^&(')

select * from test where name_c >= ' '

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

Tom Lane

Date:

22 February 2019, 18:03:26

PG Bug reporting form <noreply@postgresql.org> writes:
> I have PostgresSQL database on Windows. I created database with Collation of
> en_US.utf8.

Really?  AFAIK, Windows doesn't support collation names that look like
that.

> Then I created table (The steps to reproduce are below). I
> inserted a few rows into this table one of which was row with special
> characters "~!@#$^&(". The insert worked fine but then when I do a select on
> the column for values >=' ' (Space), I get back all the rows except for the
> row that contains 
> "~!@#$^&(" .

This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:

$ LANG=C sort stuff.txt 
 
AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt 
~!@#$^&(
 
AAA
BAA
CAA
DAA

(The first line in my test file contains one space.)

            regards, tom lane

RE: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

Kaleb Akalework

Date:

22 February 2019, 18:18:16

--> Really?  AFAIK, Windows doesn't support collation names that look like that.
I meant to say Linux. Not Windows.


-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, February 22, 2019 1:03 PM
To: Kaleb Akalework <kaleb.akalework@asg.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

*** External email: Verify sender before opening attachments or links ***


PG Bug reporting form <noreply@postgresql.org> writes:
> I have PostgresSQL database on Windows. I created database with
> Collation of en_US.utf8.

Really?  AFAIK, Windows doesn't support collation names that look like that.

> Then I created table (The steps to reproduce are below). I inserted a
> few rows into this table one of which was row with special characters
> "~!@#$^&(". The insert worked fine but then when I do a select on the
> column for values >=' ' (Space), I get back all the rows except for
> the row that contains "~!@#$^&(" .

This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:

$ LANG=C sort stuff.txt

AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(

AAA
BAA
CAA
DAA

(The first line in my test file contains one space.)

                        regards, tom lane

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

Peter Geoghegan

Date:

22 February 2019, 18:40:58

On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This appears to be the intended behavior of en_US sorting.

Right. UCA style algorithms tend to give the least weight of all to
whitespace characters. They're tertiary weight, if memory serves.

-- 
Peter Geoghegan

RE: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

Kaleb Akalework

Date:

22 February 2019, 19:17:06

Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation
settingof C, does it mean I won't be able to support multiple languages?

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: Friday, February 22, 2019 1:41 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Kaleb Akalework <kaleb.akalework@asg.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

*** External email: Verify sender before opening attachments or links ***

On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This appears to be the intended behavior of en_US sorting.

Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if
memoryserves.

--
Peter Geoghegan

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

Tom Lane

Date:

22 February 2019, 19:53:58

Kaleb Akalework <kaleb.akalework@asg.com> writes:
> Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation
settingof C, does it mean I won't be able to support multiple languages? 

Collation only determines sort order, I believe.

            regards, tom lane

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

Peter Eisentraut

Date:

23 February 2019, 08:36:39

On 2019-02-22 19:03, Tom Lane wrote:
> $ LANG=en_US sort stuff.txt 
> ~!@#$^&(
>  
> AAA
> BAA
> CAA
> DAA

With ICU (COLLATE "und-x-icu"), I get the line with the space first.  I
took a bit of a look around the various Unicode documents and I don't
find anything that would defend the glibc behavior.

<obscure detail>
However, since some of those special characters are variable collating
elements and some are not, there might well be an explanation.
</obscure detail>

So, maybe try ICU.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

"Daniel Verite"

Date:

23 February 2019, 17:30:01

    Peter Eisentraut wrote:

> With ICU (COLLATE "und-x-icu"), I get the line with the space first.  I
> took a bit of a look around the various Unicode documents and I don't
> find anything that would defend the glibc behavior.

The glibc result is now version-dependent. With glibc-2.28,
on Debian buster:

  buster$ LC_COLLATE=en_US.utf8 sort stuff.txt

  ~!@#$^&(
  AAA
  BAA
  CAA

  buster$ apt-cache show libc-bin | grep Version
  Version: 2.28-7

As opposed to the current Debian stable, with glibc-2.24:

  stretch$ LC_COLLATE=en_US.utf8 sort stuff.txt
  ~!@#$^&(

  AAA
  BAA
  CAA
  DAA

  stretch$ apt-cache show libc-bin|grep Version
  Version: 2.24-11+deb9u3


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

RE: BUG #15651: Collation setting en_US.utf8 breaking sort order

From

"Daniel Verite"

Date:

23 February 2019, 17:48:55

    Kaleb Akalework wrote:

> Ok so if this is intended behavior of UTF8 then I understand. My last
> question then would be if I use a collation setting of C, does it mean I
> won't be able to support multiple languages?

You seem to want to the sort order of C, but be aware that you might
have to decide whether you want this:

=> select upper('é' collate "C");
 upper
-------
 é
(1 row)

or that:

=> select upper('é' collate "en_US");
 upper
-------
 É
(1 row)

To get the sort order of C but the interpretation of characters closer
to what you'd expect from Unicode, it's possible for the database
to have LC_COLLATE to "C", and LC_CTYPE to, say, en_US.UTF-8.
See CREATE DATABASE.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite