Thread: Unicode FFFF Special Codepoint should always collate high.

Unicode FFFF Special Codepoint should always collate high.

From

Telford Tendys

Date:

22 June 2021, 08:39:18

Here is an example.

    SELECT ENCODE(x::bytea, 'hex')
    FROM (
        SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
                             U&'x\+01FFFF', 'x '])::text AS x
    ) z
    ORDER BY x COLLATE "en_US.utf8";


Which gives the following:

   encode
------------
 78
 78efbfbf
 78f09fbfbf
 78f0afbfbf
 7820
 782f
(6 rows)


The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
after the non-character codepoint 0xffff, which is supposed to always have
the highest possible primary weight in all locales, and it is the only
codepoint available to serve this purpose. The other 4-byte non-character
codepoints also incorrectly sort lower than real characters.

Tested on Centos 8 with the following package.

    postgresql12-server-12.7-2PGDG.rhel8.x86_64


The "locale -a" command shows that "en_US.utf8" is in the system,
although, for this purpose, all UTF-8 locales should do the same thing.

Database default locale is also en_US.utf8 and system /etc/locale.conf has
the same LANG="en_US.UTF-8".

Testing the same thing with a simple file and using the "sort" command
from command line prompt gives similar results.


$ od -tx1c /tmp/test.utf8
0000000  78  0a  78  f0  af  bf  bf  0a  78  f0  9f  bf  bf  0a  78  2f
          x  \n   x 360 257 277 277  \n   x 360 237 277 277  \n   x   /
0000020  0a  78  ef  bf  bf  0a  78  20  0a
         \n   x 357 277 277  \n   x      \n
0000031


$ LC_ALL=C sort /tmp/test.utf8 | od -tx1c
0000000  78  0a  78  20  0a  78  2f  0a  78  ef  bf  bf  0a  78  f0  9f
          x  \n   x      \n   x   /  \n   x 357 277 277  \n   x 360 237
0000020  bf  bf  0a  78  f0  af  bf  bf  0a
        277 277  \n   x 360 257 277 277  \n
0000031


$ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c
0000000  78  0a  78  ef  bf  bf  0a  78  f0  af  bf  bf  0a  78  f0  9f
          x  \n   x 357 277 277  \n   x 360 257 277 277  \n   x 360 237
0000020  bf  bf  0a  78  20  0a  78  2f  0a
        277 277  \n   x      \n   x   /  \n
0000031


You can see that it once again sorts the non-character codepoint lower
than real characters. For reference, these are the Centos 8 glibc packages.

    glibc-2.28-151.el8.x86_64
    glibc-langpack-en-2.28-151.el8.x86_64


Tested the same thing on sqlfiddle.

    http://sqlfiddle.com/#!15/1841e2/2/0


That is using PostgreSQL 9.3 and the system underneath is not made visible.
It gives the sort order that I would expect, special codepoints are sorting
on the high side of real characters.

I then went back to an old Centos 7 machine and tested the same SQL query.
The result is like this:

   encode
------------
 78
 7820
 782f
 78efbfbf
 78f09fbfbf
 78f0afbfbf
(6 rows)

That's what I expect the correct answer to be, also a correct match to the
sqlfiddle test. For reference these are the centos packages where I got it
working properly.

    postgresql12-server-12.7-1PGDG.rhel7.x86_64

    glibc-common-2.17-260.el7_6.6.x86_64
    glibc-2.17-260.el7_6.6.x86_64

Same postgres version, but running on an older system. Perhaps this
would implicate a bug introduced in the system itself during the
transition from Centos 7 to Centos 8. The old glibc did not have separate
langpacks and all locales were installed by default.

I'm willing to take this up with Redhat if that's useful.

Re: Unicode FFFF Special Codepoint should always collate high.

From

Thomas Munro

Date:

22 June 2021, 11:17:15

On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote:
> The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
> after the non-character codepoint 0xffff, which is supposed to always have
> the highest possible primary weight in all locales, and it is the only
> codepoint available to serve this purpose. The other 4-byte non-character
> codepoints also incorrectly sort lower than real characters.

Not an expert in this subject (and to make things more interesting,
unicode.org has temporarily fallen off the internet, as mentioned in
another thread nearby), but definitely curious...  I guess this might
refer to TR35:

  U+FFFF: This code point is tailored to have a primary weight higher
than all other characters. This allows the reliable specification of a
range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings
starting with "sch" or equivalent.
  U+FFFE: This code point produces a CE with minimal, unique weights
on primary and identical levels. For details see the CLDR Collation
Algorithm above.

Considering the squirrelly definition of noncharacters and their
status as special values for internal use (internal to what?) and not
for data interchange, and the specification of that rule with in the
document controlling markup of collation rules (is it also specified
somewhere else?), is this actually required to work the way you expect
when external users of a conforming collation algorithm sort them?
That's not a rhetorical question, I don't know the answer.

In any case, this is not a PostgreSQL issue, surely.  We simply call
out to libc (see your manufacturer for more details) or ICU.

Just for interest, I tried your query on a FreeBSD 13 system (its libc
is based on UCA/CLDR too) and the output matched your RHEL8 example
(using COLLATE "en_US.UTF-8").  I tried it on ICU (using COLLATE
"en-x-icu") and it gave a third output, matching neither RHEL7 nor
RHEL8 but instead putting U&'x\FFFF' last, which seems to match what
you expect:

tmunro=> SELECT ENCODE(x::bytea, 'hex')
    FROM (
        SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
                             U&'x\+01FFFF', 'x '])::text AS x
    ) z
    ORDER BY x COLLATE "en-x-icu";
   encode
------------
 78
 7820
 782f
 78f09fbfbf
 78f0afbfbf
 78efbfbf
(6 rows)

Re: Unicode FFFF Special Codepoint should always collate high.

From

Telford Tendys

Date:

23 June 2021, 03:24:18

On 21-06-22 23:17, Thomas Munro wrote:
> On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote:
> > The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
> > after the non-character codepoint 0xffff, which is supposed to always have
> > the highest possible primary weight in all locales, and it is the only
> > codepoint available to serve this purpose. The other 4-byte non-character
> > codepoints also incorrectly sort lower than real characters.
> 
> Not an expert in this subject (and to make things more interesting,
> unicode.org has temporarily fallen off the internet, as mentioned in
> another thread nearby), but definitely curious...  I guess this might
> refer to TR35:
> 
>   U+FFFF: This code point is tailored to have a primary weight higher
> than all other characters. This allows the reliable specification of a
> range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings
> starting with "sch" or equivalent.
>   U+FFFE: This code point produces a CE with minimal, unique weights
> on primary and identical levels. For details see the CLDR Collation
> Algorithm above.

Thank you for taking a look at it, you seem to have confirmed that
this is coming from the system itself. Yes, my purpose is to do
prefix searching on strings by specifying a range and taking advantage
of a B-Tree index, exactly as described in the quote above.

Personally, I'm not overly worried about the sort order between the
4 byte and the 3 byte special codepoints, but for consistency you would
hope there is one answer only, and it applies everywhere. Largely defeats
the purpose of having a standard unless it is indeed standardized.

I expect this kind of range searching is exactly what SQL people do all
day every day, while users of most other applications probably won't
notice a change in sort order between major versions of an operating system.

    https://bugzilla.redhat.com/show_bug.cgi?id=1975045

There is a RedHat bugzilla link, let's see where that goes. With IBM owning
Redhat now they might have both the expertise and incentive to get the
UTF-8 subsystem up to a respectable level.

> Considering the squirrelly definition of noncharacters and their
> status as special values for internal use (internal to what?) and not
> for data interchange, and the specification of that rule with in the
> document controlling markup of collation rules (is it also specified
> somewhere else?), is this actually required to work the way you expect
> when external users of a conforming collation algorithm sort them?
> That's not a rhetorical question, I don't know the answer.

Seems obvious to me that if you can't use it for range searching then
it's broken, because that is the primary intended use. As for the definition
of internal vs external data interchange that would come down to who owns
the ends of the data pipe. If you want the other guy to take responsibility
then you better send a clean stream sans special codepoints.

For my application, I have no need to transmit or receive this codepoint.

>    encode
> ------------
>  78
>  7820
>  782f
>  78f09fbfbf
>  78f0afbfbf
>  78efbfbf
> (6 rows)

That is actually what I expect based on the standard, but the way RHEL7
does it also seems close enough. Getting consistency and portability out
of Unicode remains a roulette wheel after decades of work have gone into it.

Re: Unicode FFFF Special Codepoint should always collate high.

From

Thomas Munro

Date:

23 June 2021, 06:18:39

On Wed, Jun 23, 2021 at 3:00 PM Telford Tendys <psql@lnx-bsp.net> wrote:
> Thank you for taking a look at it, you seem to have confirmed that
> this is coming from the system itself. Yes, my purpose is to do
> prefix searching on strings by specifying a range and taking advantage
> of a B-Tree index, exactly as described in the quote above.

Just in case you didn't know, PostgreSQL knows how to convert a prefix
search for 'aar%' into a range search with a related trick in some
limited circumstances:

tmunro=> create table t (name text);
CREATE TABLE
tmunro=> insert into t values ('aardvark'), ('buffalo'), ('cat');
INSERT 0 3
tmunro=> create index on t(name text_pattern_ops);
CREATE INDEX
tmunro=> analyze t;
ANALYZE
tmunro=> explain select * from t where name like 'aar%';
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Only Scan using t_name_idx on t  (cost=0.13..8.15 rows=1 width=7)
   Index Cond: ((name ~>=~ 'aar'::text) AND (name ~<~ 'aas'::text))
   Filter: (name ~~ 'aar%'::text)
(3 rows)

>     https://bugzilla.redhat.com/show_bug.cgi?id=1975045

Seems a little light on references and justifications for the
expectiation.  ICU's results could be useful in the discussion.  It's
interesting that more recent Unicode versions removed the prohibition
on using these code points at all (some earlier version said that if
your string contained them, they weren't Unicode, apparently,
according to my quick read of
https://en.wikipedia.org/wiki/Specials_(Unicode_block) but I didn't
have time to look into any source documents).

Re: Unicode FFFF Special Codepoint should always collate high.

From

Telford Tendys

Date:

23 June 2021, 10:21:38

On 21-06-23 18:18, Thomas Munro wrote:
> Just in case you didn't know, PostgreSQL knows how to convert a prefix
> search for 'aar%' into a range search with a related trick in some
> limited circumstances:

Thanks for the suggestion, I probably should learn those tricks but
for the time being, would prefer to live with a slower search and get it
eventually done the way I want.

> Seems a little light on references and justifications for the
> expectiation.  ICU's results could be useful in the discussion.

I trust those guys, they will figure it out. I strongly predict that
they will keep the behaviour consistent with RHEL 7.

Is there an easy way to make normal Linux glibc utilities (e.g. sort)
use a locale from the ICU library? There's a package availble one Centos-8
here:

    libicu-60.3-2.el8_1.x86_64

Trouble is that only a few applications use it, and I can't find any way
to plug-in / plug-out this functionality. Introducing postgresql details to
the bugzilla ticket will muddy the water and create a aura of diffuse
responsibility. What I've found is generally where there's a lot of words,
people don't read them.

Re: Unicode FFFF Special Codepoint should always collate high.

From

Thomas Munro

Date:

23 June 2021, 22:29:15

On Wed, Jun 23, 2021 at 9:57 PM Telford Tendys <psql@lnx-bsp.net> wrote:
> I trust those guys, they will figure it out. I strongly predict that
> they will keep the behaviour consistent with RHEL 7.

I'd doubt that.  It's well known that glibc 2.28 (what RHEL8 upgraded
to) included changes that affected everybody by changing the sort
order of common symbols like '-' (though every upgrade potentially
contains subtle changes affecting just a few specific languages), but
I consider the recent big change an improvement because it now agrees
more often with other operating systems and libraries that use CLDR.
Even if you are right that FFFF's sort-high rule should be exposed to
users (need references), RHEL7 was also wrong in that case.

> Is there an easy way to make normal Linux glibc utilities (e.g. sort)
> use a locale from the ICU library? There's a package availble one Centos-8
> here:
>
>     libicu-60.3-2.el8_1.x86_64
>
> Trouble is that only a few applications use it, and I can't find any way
> to plug-in / plug-out this functionality. Introducing postgresql details to
> the bugzilla ticket will muddy the water and create a aura of diffuse
> responsibility. What I've found is generally where there's a lot of words,
> people don't read them.

I don't know, but since you know perl, it might be easy to make a
demonstration with https://metacpan.org/pod/Unicode::ICU::Collator.
Looks as simple as $collator->sort(my_list).

Re: Unicode FFFF Special Codepoint should always collate high.

From

Thomas Munro

Date:

28 June 2021, 06:29:11

On Thu, Jun 24, 2021 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jun 23, 2021 at 9:57 PM Telford Tendys <psql@lnx-bsp.net> wrote:
> > I trust those guys, they will figure it out. I strongly predict that
> > they will keep the behaviour consistent with RHEL 7.
>
> I'd doubt that.  It's well known that glibc 2.28 (what RHEL8 upgraded
> to) included changes that affected everybody by changing the sort
> order of common symbols like '-' (though every upgrade potentially
> contains subtle changes affecting just a few specific languages), but
> I consider the recent big change an improvement because it now agrees
> more often with other operating systems and libraries that use CLDR.
> Even if you are right that FFFF's sort-high rule should be exposed to
> users (need references), RHEL7 was also wrong in that case.

Oh (following along with your bug report)... so glibc 2.28+ is based
on ISO 14651, not UCA/CLDR.  They are related but different[1] (I was
confused about that).  That rule is coming from a CLDR document.  So
one question is whether anything similar is in the ISO document[2].

/me wanders away wondering if an OS that *is* using CLDR is supposed
to be collating \uFFFF the way ICU does

[1] https://unicode.org/reports/tr10/tr10-34.html#Synch_ISO14651
[2] https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html