Re: Unicode FFFF Special Codepoint should always collate high. - Mailing list pgsql-bugs
From | Telford Tendys |
---|---|
Subject | Re: Unicode FFFF Special Codepoint should always collate high. |
Date | |
Msg-id | 20210623032418.GB12063@mail Whole thread Raw |
In response to | Re: Unicode FFFF Special Codepoint should always collate high. (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-bugs |
On 21-06-22 23:17, Thomas Munro wrote: > On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote: > > The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting > > after the non-character codepoint 0xffff, which is supposed to always have > > the highest possible primary weight in all locales, and it is the only > > codepoint available to serve this purpose. The other 4-byte non-character > > codepoints also incorrectly sort lower than real characters. > > Not an expert in this subject (and to make things more interesting, > unicode.org has temporarily fallen off the internet, as mentioned in > another thread nearby), but definitely curious... I guess this might > refer to TR35: > > U+FFFF: This code point is tailored to have a primary weight higher > than all other characters. This allows the reliable specification of a > range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings > starting with "sch" or equivalent. > U+FFFE: This code point produces a CE with minimal, unique weights > on primary and identical levels. For details see the CLDR Collation > Algorithm above. Thank you for taking a look at it, you seem to have confirmed that this is coming from the system itself. Yes, my purpose is to do prefix searching on strings by specifying a range and taking advantage of a B-Tree index, exactly as described in the quote above. Personally, I'm not overly worried about the sort order between the 4 byte and the 3 byte special codepoints, but for consistency you would hope there is one answer only, and it applies everywhere. Largely defeats the purpose of having a standard unless it is indeed standardized. I expect this kind of range searching is exactly what SQL people do all day every day, while users of most other applications probably won't notice a change in sort order between major versions of an operating system. https://bugzilla.redhat.com/show_bug.cgi?id=1975045 There is a RedHat bugzilla link, let's see where that goes. With IBM owning Redhat now they might have both the expertise and incentive to get the UTF-8 subsystem up to a respectable level. > Considering the squirrelly definition of noncharacters and their > status as special values for internal use (internal to what?) and not > for data interchange, and the specification of that rule with in the > document controlling markup of collation rules (is it also specified > somewhere else?), is this actually required to work the way you expect > when external users of a conforming collation algorithm sort them? > That's not a rhetorical question, I don't know the answer. Seems obvious to me that if you can't use it for range searching then it's broken, because that is the primary intended use. As for the definition of internal vs external data interchange that would come down to who owns the ends of the data pipe. If you want the other guy to take responsibility then you better send a clean stream sans special codepoints. For my application, I have no need to transmit or receive this codepoint. > encode > ------------ > 78 > 7820 > 782f > 78f09fbfbf > 78f0afbfbf > 78efbfbf > (6 rows) That is actually what I expect based on the standard, but the way RHEL7 does it also seems close enough. Getting consistency and portability out of Unicode remains a roulette wheel after decades of work have gone into it.
pgsql-bugs by date: