Unicode FFFF Special Codepoint should always collate high. - Mailing list pgsql-bugs

From Telford Tendys
Subject Unicode FFFF Special Codepoint should always collate high.
Date
Msg-id 20210622083918.GA12063@mail
Whole thread Raw
Responses Re: Unicode FFFF Special Codepoint should always collate high.  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
Here is an example.

    SELECT ENCODE(x::bytea, 'hex')
    FROM (
        SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
                             U&'x\+01FFFF', 'x '])::text AS x
    ) z
    ORDER BY x COLLATE "en_US.utf8";


Which gives the following:

   encode
------------
 78
 78efbfbf
 78f09fbfbf
 78f0afbfbf
 7820
 782f
(6 rows)


The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
after the non-character codepoint 0xffff, which is supposed to always have
the highest possible primary weight in all locales, and it is the only
codepoint available to serve this purpose. The other 4-byte non-character
codepoints also incorrectly sort lower than real characters.

Tested on Centos 8 with the following package.

    postgresql12-server-12.7-2PGDG.rhel8.x86_64


The "locale -a" command shows that "en_US.utf8" is in the system,
although, for this purpose, all UTF-8 locales should do the same thing.

Database default locale is also en_US.utf8 and system /etc/locale.conf has
the same LANG="en_US.UTF-8".

Testing the same thing with a simple file and using the "sort" command
from command line prompt gives similar results.


$ od -tx1c /tmp/test.utf8
0000000  78  0a  78  f0  af  bf  bf  0a  78  f0  9f  bf  bf  0a  78  2f
          x  \n   x 360 257 277 277  \n   x 360 237 277 277  \n   x   /
0000020  0a  78  ef  bf  bf  0a  78  20  0a
         \n   x 357 277 277  \n   x      \n
0000031


$ LC_ALL=C sort /tmp/test.utf8 | od -tx1c
0000000  78  0a  78  20  0a  78  2f  0a  78  ef  bf  bf  0a  78  f0  9f
          x  \n   x      \n   x   /  \n   x 357 277 277  \n   x 360 237
0000020  bf  bf  0a  78  f0  af  bf  bf  0a
        277 277  \n   x 360 257 277 277  \n
0000031


$ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c
0000000  78  0a  78  ef  bf  bf  0a  78  f0  af  bf  bf  0a  78  f0  9f
          x  \n   x 357 277 277  \n   x 360 257 277 277  \n   x 360 237
0000020  bf  bf  0a  78  20  0a  78  2f  0a
        277 277  \n   x      \n   x   /  \n
0000031


You can see that it once again sorts the non-character codepoint lower
than real characters. For reference, these are the Centos 8 glibc packages.

    glibc-2.28-151.el8.x86_64
    glibc-langpack-en-2.28-151.el8.x86_64


Tested the same thing on sqlfiddle.

    http://sqlfiddle.com/#!15/1841e2/2/0


That is using PostgreSQL 9.3 and the system underneath is not made visible.
It gives the sort order that I would expect, special codepoints are sorting
on the high side of real characters.

I then went back to an old Centos 7 machine and tested the same SQL query.
The result is like this:

   encode
------------
 78
 7820
 782f
 78efbfbf
 78f09fbfbf
 78f0afbfbf
(6 rows)

That's what I expect the correct answer to be, also a correct match to the
sqlfiddle test. For reference these are the centos packages where I got it
working properly.

    postgresql12-server-12.7-1PGDG.rhel7.x86_64

    glibc-common-2.17-260.el7_6.6.x86_64
    glibc-2.17-260.el7_6.6.x86_64

Same postgres version, but running on an older system. Perhaps this
would implicate a bug introduced in the system itself during the
transition from Centos 7 to Centos 8. The old glibc did not have separate
langpacks and all locales were installed by default.

I'm willing to take this up with Redhat if that's useful.



pgsql-bugs by date:

Previous
From: Neil Chen
Date:
Subject: Re: BUG #17066: Cache lookup failed when null (iso-8859-1) is passed as anycompatiblemultirange
Next
From: Heikki Linnakangas
Date:
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"