Here is an example.
SELECT ENCODE(x::bytea, 'hex')
FROM (
SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
U&'x\+01FFFF', 'x '])::text AS x
) z
ORDER BY x COLLATE "en_US.utf8";
Which gives the following:
encode
------------
78
78efbfbf
78f09fbfbf
78f0afbfbf
7820
782f
(6 rows)
The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
after the non-character codepoint 0xffff, which is supposed to always have
the highest possible primary weight in all locales, and it is the only
codepoint available to serve this purpose. The other 4-byte non-character
codepoints also incorrectly sort lower than real characters.
Tested on Centos 8 with the following package.
postgresql12-server-12.7-2PGDG.rhel8.x86_64
The "locale -a" command shows that "en_US.utf8" is in the system,
although, for this purpose, all UTF-8 locales should do the same thing.
Database default locale is also en_US.utf8 and system /etc/locale.conf has
the same LANG="en_US.UTF-8".
Testing the same thing with a simple file and using the "sort" command
from command line prompt gives similar results.
$ od -tx1c /tmp/test.utf8
0000000 78 0a 78 f0 af bf bf 0a 78 f0 9f bf bf 0a 78 2f
x \n x 360 257 277 277 \n x 360 237 277 277 \n x /
0000020 0a 78 ef bf bf 0a 78 20 0a
\n x 357 277 277 \n x \n
0000031
$ LC_ALL=C sort /tmp/test.utf8 | od -tx1c
0000000 78 0a 78 20 0a 78 2f 0a 78 ef bf bf 0a 78 f0 9f
x \n x \n x / \n x 357 277 277 \n x 360 237
0000020 bf bf 0a 78 f0 af bf bf 0a
277 277 \n x 360 257 277 277 \n
0000031
$ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c
0000000 78 0a 78 ef bf bf 0a 78 f0 af bf bf 0a 78 f0 9f
x \n x 357 277 277 \n x 360 257 277 277 \n x 360 237
0000020 bf bf 0a 78 20 0a 78 2f 0a
277 277 \n x \n x / \n
0000031
You can see that it once again sorts the non-character codepoint lower
than real characters. For reference, these are the Centos 8 glibc packages.
glibc-2.28-151.el8.x86_64
glibc-langpack-en-2.28-151.el8.x86_64
Tested the same thing on sqlfiddle.
http://sqlfiddle.com/#!15/1841e2/2/0
That is using PostgreSQL 9.3 and the system underneath is not made visible.
It gives the sort order that I would expect, special codepoints are sorting
on the high side of real characters.
I then went back to an old Centos 7 machine and tested the same SQL query.
The result is like this:
encode
------------
78
7820
782f
78efbfbf
78f09fbfbf
78f0afbfbf
(6 rows)
That's what I expect the correct answer to be, also a correct match to the
sqlfiddle test. For reference these are the centos packages where I got it
working properly.
postgresql12-server-12.7-1PGDG.rhel7.x86_64
glibc-common-2.17-260.el7_6.6.x86_64
glibc-2.17-260.el7_6.6.x86_64
Same postgres version, but running on an older system. Perhaps this
would implicate a bug introduced in the system itself during the
transition from Centos 7 to Centos 8. The old glibc did not have separate
langpacks and all locales were installed by default.
I'm willing to take this up with Redhat if that's useful.