Thread: Unicode FFFF Special Codepoint should always collate high.
Here is an example. SELECT ENCODE(x::bytea, 'hex') FROM ( SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF', U&'x\+01FFFF', 'x '])::text AS x ) z ORDER BY x COLLATE "en_US.utf8"; Which gives the following: encode ------------ 78 78efbfbf 78f09fbfbf 78f0afbfbf 7820 782f (6 rows) The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting after the non-character codepoint 0xffff, which is supposed to always have the highest possible primary weight in all locales, and it is the only codepoint available to serve this purpose. The other 4-byte non-character codepoints also incorrectly sort lower than real characters. Tested on Centos 8 with the following package. postgresql12-server-12.7-2PGDG.rhel8.x86_64 The "locale -a" command shows that "en_US.utf8" is in the system, although, for this purpose, all UTF-8 locales should do the same thing. Database default locale is also en_US.utf8 and system /etc/locale.conf has the same LANG="en_US.UTF-8". Testing the same thing with a simple file and using the "sort" command from command line prompt gives similar results. $ od -tx1c /tmp/test.utf8 0000000 78 0a 78 f0 af bf bf 0a 78 f0 9f bf bf 0a 78 2f x \n x 360 257 277 277 \n x 360 237 277 277 \n x / 0000020 0a 78 ef bf bf 0a 78 20 0a \n x 357 277 277 \n x \n 0000031 $ LC_ALL=C sort /tmp/test.utf8 | od -tx1c 0000000 78 0a 78 20 0a 78 2f 0a 78 ef bf bf 0a 78 f0 9f x \n x \n x / \n x 357 277 277 \n x 360 237 0000020 bf bf 0a 78 f0 af bf bf 0a 277 277 \n x 360 257 277 277 \n 0000031 $ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c 0000000 78 0a 78 ef bf bf 0a 78 f0 af bf bf 0a 78 f0 9f x \n x 357 277 277 \n x 360 257 277 277 \n x 360 237 0000020 bf bf 0a 78 20 0a 78 2f 0a 277 277 \n x \n x / \n 0000031 You can see that it once again sorts the non-character codepoint lower than real characters. For reference, these are the Centos 8 glibc packages. glibc-2.28-151.el8.x86_64 glibc-langpack-en-2.28-151.el8.x86_64 Tested the same thing on sqlfiddle. http://sqlfiddle.com/#!15/1841e2/2/0 That is using PostgreSQL 9.3 and the system underneath is not made visible. It gives the sort order that I would expect, special codepoints are sorting on the high side of real characters. I then went back to an old Centos 7 machine and tested the same SQL query. The result is like this: encode ------------ 78 7820 782f 78efbfbf 78f09fbfbf 78f0afbfbf (6 rows) That's what I expect the correct answer to be, also a correct match to the sqlfiddle test. For reference these are the centos packages where I got it working properly. postgresql12-server-12.7-1PGDG.rhel7.x86_64 glibc-common-2.17-260.el7_6.6.x86_64 glibc-2.17-260.el7_6.6.x86_64 Same postgres version, but running on an older system. Perhaps this would implicate a bug introduced in the system itself during the transition from Centos 7 to Centos 8. The old glibc did not have separate langpacks and all locales were installed by default. I'm willing to take this up with Redhat if that's useful.
On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote: > The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting > after the non-character codepoint 0xffff, which is supposed to always have > the highest possible primary weight in all locales, and it is the only > codepoint available to serve this purpose. The other 4-byte non-character > codepoints also incorrectly sort lower than real characters. Not an expert in this subject (and to make things more interesting, unicode.org has temporarily fallen off the internet, as mentioned in another thread nearby), but definitely curious... I guess this might refer to TR35: U+FFFF: This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent. U+FFFE: This code point produces a CE with minimal, unique weights on primary and identical levels. For details see the CLDR Collation Algorithm above. Considering the squirrelly definition of noncharacters and their status as special values for internal use (internal to what?) and not for data interchange, and the specification of that rule with in the document controlling markup of collation rules (is it also specified somewhere else?), is this actually required to work the way you expect when external users of a conforming collation algorithm sort them? That's not a rhetorical question, I don't know the answer. In any case, this is not a PostgreSQL issue, surely. We simply call out to libc (see your manufacturer for more details) or ICU. Just for interest, I tried your query on a FreeBSD 13 system (its libc is based on UCA/CLDR too) and the output matched your RHEL8 example (using COLLATE "en_US.UTF-8"). I tried it on ICU (using COLLATE "en-x-icu") and it gave a third output, matching neither RHEL7 nor RHEL8 but instead putting U&'x\FFFF' last, which seems to match what you expect: tmunro=> SELECT ENCODE(x::bytea, 'hex') FROM ( SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF', U&'x\+01FFFF', 'x '])::text AS x ) z ORDER BY x COLLATE "en-x-icu"; encode ------------ 78 7820 782f 78f09fbfbf 78f0afbfbf 78efbfbf (6 rows)
On 21-06-22 23:17, Thomas Munro wrote: > On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote: > > The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting > > after the non-character codepoint 0xffff, which is supposed to always have > > the highest possible primary weight in all locales, and it is the only > > codepoint available to serve this purpose. The other 4-byte non-character > > codepoints also incorrectly sort lower than real characters. > > Not an expert in this subject (and to make things more interesting, > unicode.org has temporarily fallen off the internet, as mentioned in > another thread nearby), but definitely curious... I guess this might > refer to TR35: > > U+FFFF: This code point is tailored to have a primary weight higher > than all other characters. This allows the reliable specification of a > range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings > starting with "sch" or equivalent. > U+FFFE: This code point produces a CE with minimal, unique weights > on primary and identical levels. For details see the CLDR Collation > Algorithm above. Thank you for taking a look at it, you seem to have confirmed that this is coming from the system itself. Yes, my purpose is to do prefix searching on strings by specifying a range and taking advantage of a B-Tree index, exactly as described in the quote above. Personally, I'm not overly worried about the sort order between the 4 byte and the 3 byte special codepoints, but for consistency you would hope there is one answer only, and it applies everywhere. Largely defeats the purpose of having a standard unless it is indeed standardized. I expect this kind of range searching is exactly what SQL people do all day every day, while users of most other applications probably won't notice a change in sort order between major versions of an operating system. https://bugzilla.redhat.com/show_bug.cgi?id=1975045 There is a RedHat bugzilla link, let's see where that goes. With IBM owning Redhat now they might have both the expertise and incentive to get the UTF-8 subsystem up to a respectable level. > Considering the squirrelly definition of noncharacters and their > status as special values for internal use (internal to what?) and not > for data interchange, and the specification of that rule with in the > document controlling markup of collation rules (is it also specified > somewhere else?), is this actually required to work the way you expect > when external users of a conforming collation algorithm sort them? > That's not a rhetorical question, I don't know the answer. Seems obvious to me that if you can't use it for range searching then it's broken, because that is the primary intended use. As for the definition of internal vs external data interchange that would come down to who owns the ends of the data pipe. If you want the other guy to take responsibility then you better send a clean stream sans special codepoints. For my application, I have no need to transmit or receive this codepoint. > encode > ------------ > 78 > 7820 > 782f > 78f09fbfbf > 78f0afbfbf > 78efbfbf > (6 rows) That is actually what I expect based on the standard, but the way RHEL7 does it also seems close enough. Getting consistency and portability out of Unicode remains a roulette wheel after decades of work have gone into it.
On Wed, Jun 23, 2021 at 3:00 PM Telford Tendys <psql@lnx-bsp.net> wrote: > Thank you for taking a look at it, you seem to have confirmed that > this is coming from the system itself. Yes, my purpose is to do > prefix searching on strings by specifying a range and taking advantage > of a B-Tree index, exactly as described in the quote above. Just in case you didn't know, PostgreSQL knows how to convert a prefix search for 'aar%' into a range search with a related trick in some limited circumstances: tmunro=> create table t (name text); CREATE TABLE tmunro=> insert into t values ('aardvark'), ('buffalo'), ('cat'); INSERT 0 3 tmunro=> create index on t(name text_pattern_ops); CREATE INDEX tmunro=> analyze t; ANALYZE tmunro=> explain select * from t where name like 'aar%'; QUERY PLAN ------------------------------------------------------------------------- Index Only Scan using t_name_idx on t (cost=0.13..8.15 rows=1 width=7) Index Cond: ((name ~>=~ 'aar'::text) AND (name ~<~ 'aas'::text)) Filter: (name ~~ 'aar%'::text) (3 rows) > https://bugzilla.redhat.com/show_bug.cgi?id=1975045 Seems a little light on references and justifications for the expectiation. ICU's results could be useful in the discussion. It's interesting that more recent Unicode versions removed the prohibition on using these code points at all (some earlier version said that if your string contained them, they weren't Unicode, apparently, according to my quick read of https://en.wikipedia.org/wiki/Specials_(Unicode_block) but I didn't have time to look into any source documents).
On 21-06-23 18:18, Thomas Munro wrote: > Just in case you didn't know, PostgreSQL knows how to convert a prefix > search for 'aar%' into a range search with a related trick in some > limited circumstances: Thanks for the suggestion, I probably should learn those tricks but for the time being, would prefer to live with a slower search and get it eventually done the way I want. > Seems a little light on references and justifications for the > expectiation. ICU's results could be useful in the discussion. I trust those guys, they will figure it out. I strongly predict that they will keep the behaviour consistent with RHEL 7. Is there an easy way to make normal Linux glibc utilities (e.g. sort) use a locale from the ICU library? There's a package availble one Centos-8 here: libicu-60.3-2.el8_1.x86_64 Trouble is that only a few applications use it, and I can't find any way to plug-in / plug-out this functionality. Introducing postgresql details to the bugzilla ticket will muddy the water and create a aura of diffuse responsibility. What I've found is generally where there's a lot of words, people don't read them.
On Wed, Jun 23, 2021 at 9:57 PM Telford Tendys <psql@lnx-bsp.net> wrote: > I trust those guys, they will figure it out. I strongly predict that > they will keep the behaviour consistent with RHEL 7. I'd doubt that. It's well known that glibc 2.28 (what RHEL8 upgraded to) included changes that affected everybody by changing the sort order of common symbols like '-' (though every upgrade potentially contains subtle changes affecting just a few specific languages), but I consider the recent big change an improvement because it now agrees more often with other operating systems and libraries that use CLDR. Even if you are right that FFFF's sort-high rule should be exposed to users (need references), RHEL7 was also wrong in that case. > Is there an easy way to make normal Linux glibc utilities (e.g. sort) > use a locale from the ICU library? There's a package availble one Centos-8 > here: > > libicu-60.3-2.el8_1.x86_64 > > Trouble is that only a few applications use it, and I can't find any way > to plug-in / plug-out this functionality. Introducing postgresql details to > the bugzilla ticket will muddy the water and create a aura of diffuse > responsibility. What I've found is generally where there's a lot of words, > people don't read them. I don't know, but since you know perl, it might be easy to make a demonstration with https://metacpan.org/pod/Unicode::ICU::Collator. Looks as simple as $collator->sort(my_list).
On Thu, Jun 24, 2021 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Jun 23, 2021 at 9:57 PM Telford Tendys <psql@lnx-bsp.net> wrote: > > I trust those guys, they will figure it out. I strongly predict that > > they will keep the behaviour consistent with RHEL 7. > > I'd doubt that. It's well known that glibc 2.28 (what RHEL8 upgraded > to) included changes that affected everybody by changing the sort > order of common symbols like '-' (though every upgrade potentially > contains subtle changes affecting just a few specific languages), but > I consider the recent big change an improvement because it now agrees > more often with other operating systems and libraries that use CLDR. > Even if you are right that FFFF's sort-high rule should be exposed to > users (need references), RHEL7 was also wrong in that case. Oh (following along with your bug report)... so glibc 2.28+ is based on ISO 14651, not UCA/CLDR. They are related but different[1] (I was confused about that). That rule is coming from a CLDR document. So one question is whether anything similar is in the ISO document[2]. /me wanders away wondering if an OS that *is* using CLDR is supposed to be collating \uFFFF the way ICU does [1] https://unicode.org/reports/tr10/tr10-34.html#Synch_ISO14651 [2] https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html