Re: Unicode FFFF Special Codepoint should always collate high. - Mailing list pgsql-bugs

From Telford Tendys
Subject Re: Unicode FFFF Special Codepoint should always collate high.
Date
Msg-id 20210623032418.GB12063@mail
Whole thread Raw
In response to Re: Unicode FFFF Special Codepoint should always collate high.  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On 21-06-22 23:17, Thomas Munro wrote:
> On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote:
> > The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
> > after the non-character codepoint 0xffff, which is supposed to always have
> > the highest possible primary weight in all locales, and it is the only
> > codepoint available to serve this purpose. The other 4-byte non-character
> > codepoints also incorrectly sort lower than real characters.
> 
> Not an expert in this subject (and to make things more interesting,
> unicode.org has temporarily fallen off the internet, as mentioned in
> another thread nearby), but definitely curious...  I guess this might
> refer to TR35:
> 
>   U+FFFF: This code point is tailored to have a primary weight higher
> than all other characters. This allows the reliable specification of a
> range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings
> starting with "sch" or equivalent.
>   U+FFFE: This code point produces a CE with minimal, unique weights
> on primary and identical levels. For details see the CLDR Collation
> Algorithm above.

Thank you for taking a look at it, you seem to have confirmed that
this is coming from the system itself. Yes, my purpose is to do
prefix searching on strings by specifying a range and taking advantage
of a B-Tree index, exactly as described in the quote above.

Personally, I'm not overly worried about the sort order between the
4 byte and the 3 byte special codepoints, but for consistency you would
hope there is one answer only, and it applies everywhere. Largely defeats
the purpose of having a standard unless it is indeed standardized.

I expect this kind of range searching is exactly what SQL people do all
day every day, while users of most other applications probably won't
notice a change in sort order between major versions of an operating system.

    https://bugzilla.redhat.com/show_bug.cgi?id=1975045

There is a RedHat bugzilla link, let's see where that goes. With IBM owning
Redhat now they might have both the expertise and incentive to get the
UTF-8 subsystem up to a respectable level.


> Considering the squirrelly definition of noncharacters and their
> status as special values for internal use (internal to what?) and not
> for data interchange, and the specification of that rule with in the
> document controlling markup of collation rules (is it also specified
> somewhere else?), is this actually required to work the way you expect
> when external users of a conforming collation algorithm sort them?
> That's not a rhetorical question, I don't know the answer.

Seems obvious to me that if you can't use it for range searching then
it's broken, because that is the primary intended use. As for the definition
of internal vs external data interchange that would come down to who owns
the ends of the data pipe. If you want the other guy to take responsibility
then you better send a clean stream sans special codepoints.

For my application, I have no need to transmit or receive this codepoint.

>    encode
> ------------
>  78
>  7820
>  782f
>  78f09fbfbf
>  78f0afbfbf
>  78efbfbf
> (6 rows)

That is actually what I expect based on the standard, but the way RHEL7
does it also seems close enough. Getting consistency and portability out
of Unicode remains a roulette wheel after decades of work have gone into it.



pgsql-bugs by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Next
From: Thomas Munro
Date:
Subject: Re: Unicode FFFF Special Codepoint should always collate high.