Thread: BUG #15651: Collation setting en_US.utf8 breaking sort order
The following bug has been logged on the website: Bug reference: 15651 Logged by: Kaleb Akalework Email address: kaleb.akalework@asg.com PostgreSQL version: 11.0 Operating system: Linux Description: I have PostgresSQL database on Windows. I created database with Collation of en_US.utf8. Then I created table (The steps to reproduce are below). I inserted a few rows into this table one of which was row with special characters "~!@#$^&(". The insert worked fine but then when I do a select on the column for values >=' ' (Space), I get back all the rows except for the row that contains "~!@#$^&(" . In the UTF8 table https://www.utf8-chartable.de/unicode-utf8-table.pl, I can see that space is the lowest printable character so technically every printable character showed be greater than space but it isn't in this case. I create another database and set collation to 'C'. then the same select query returns "~!@#$^&(". The problem is only apparent in a Linux environment where postgreSQL is running and the database has a collation of en_US.utf8. Can someone help please? create table test ( name_c varchar(14) ) insert into test (name_c) values ('AAA') insert into test (name_c) values ('BAA') insert into test (name_c) values ('CAA') insert into test (name_c) values ('DAA') insert into test (name_c) values ('~!@#$^&(') select * from test where name_c >= ' '
PG Bug reporting form <noreply@postgresql.org> writes: > I have PostgresSQL database on Windows. I created database with Collation of > en_US.utf8. Really? AFAIK, Windows doesn't support collation names that look like that. > Then I created table (The steps to reproduce are below). I > inserted a few rows into this table one of which was row with special > characters "~!@#$^&(". The insert worked fine but then when I do a select on > the column for values >=' ' (Space), I get back all the rows except for the > row that contains > "~!@#$^&(" . This appears to be the intended behavior of en_US sorting. On a Linux machine I can reproduce it outside Postgres: $ LANG=C sort stuff.txt AAA BAA CAA DAA ~!@#$^&( $ LANG=en_US sort stuff.txt ~!@#$^&( AAA BAA CAA DAA (The first line in my test file contains one space.) regards, tom lane
--> Really? AFAIK, Windows doesn't support collation names that look like that. I meant to say Linux. Not Windows. -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Friday, February 22, 2019 1:03 PM To: Kaleb Akalework <kaleb.akalework@asg.com> Cc: pgsql-bugs@lists.postgresql.org Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order *** External email: Verify sender before opening attachments or links *** PG Bug reporting form <noreply@postgresql.org> writes: > I have PostgresSQL database on Windows. I created database with > Collation of en_US.utf8. Really? AFAIK, Windows doesn't support collation names that look like that. > Then I created table (The steps to reproduce are below). I inserted a > few rows into this table one of which was row with special characters > "~!@#$^&(". The insert worked fine but then when I do a select on the > column for values >=' ' (Space), I get back all the rows except for > the row that contains "~!@#$^&(" . This appears to be the intended behavior of en_US sorting. On a Linux machine I can reproduce it outside Postgres: $ LANG=C sort stuff.txt AAA BAA CAA DAA ~!@#$^&( $ LANG=en_US sort stuff.txt ~!@#$^&( AAA BAA CAA DAA (The first line in my test file contains one space.) regards, tom lane
On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This appears to be the intended behavior of en_US sorting. Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if memory serves. -- Peter Geoghegan
Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation settingof C, does it mean I won't be able to support multiple languages? -----Original Message----- From: Peter Geoghegan <pg@bowt.ie> Sent: Friday, February 22, 2019 1:41 PM To: Tom Lane <tgl@sss.pgh.pa.us> Cc: Kaleb Akalework <kaleb.akalework@asg.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order *** External email: Verify sender before opening attachments or links *** On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This appears to be the intended behavior of en_US sorting. Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if memoryserves. -- Peter Geoghegan
Kaleb Akalework <kaleb.akalework@asg.com> writes: > Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation settingof C, does it mean I won't be able to support multiple languages? Collation only determines sort order, I believe. regards, tom lane
On 2019-02-22 19:03, Tom Lane wrote: > $ LANG=en_US sort stuff.txt > ~!@#$^&( > > AAA > BAA > CAA > DAA With ICU (COLLATE "und-x-icu"), I get the line with the space first. I took a bit of a look around the various Unicode documents and I don't find anything that would defend the glibc behavior. <obscure detail> However, since some of those special characters are variable collating elements and some are not, there might well be an explanation. </obscure detail> So, maybe try ICU. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > With ICU (COLLATE "und-x-icu"), I get the line with the space first. I > took a bit of a look around the various Unicode documents and I don't > find anything that would defend the glibc behavior. The glibc result is now version-dependent. With glibc-2.28, on Debian buster: buster$ LC_COLLATE=en_US.utf8 sort stuff.txt ~!@#$^&( AAA BAA CAA buster$ apt-cache show libc-bin | grep Version Version: 2.28-7 As opposed to the current Debian stable, with glibc-2.24: stretch$ LC_COLLATE=en_US.utf8 sort stuff.txt ~!@#$^&( AAA BAA CAA DAA stretch$ apt-cache show libc-bin|grep Version Version: 2.24-11+deb9u3 Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Kaleb Akalework wrote: > Ok so if this is intended behavior of UTF8 then I understand. My last > question then would be if I use a collation setting of C, does it mean I > won't be able to support multiple languages? You seem to want to the sort order of C, but be aware that you might have to decide whether you want this: => select upper('é' collate "C"); upper ------- é (1 row) or that: => select upper('é' collate "en_US"); upper ------- É (1 row) To get the sort order of C but the interpretation of characters closer to what you'd expect from Unicode, it's possible for the database to have LC_COLLATE to "C", and LC_CTYPE to, say, en_US.UTF-8. See CREATE DATABASE. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite