Re: Windows UTF-8, non-ICU collation trouble - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Windows UTF-8, non-ICU collation trouble |
Date | |
Msg-id | 20191210092919.GA1760025@rfd.leadboat.com Whole thread Raw |
In response to | Re: Windows UTF-8, non-ICU collation trouble (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Windows UTF-8, non-ICU collation trouble
|
List | pgsql-hackers |
On Tue, Dec 10, 2019 at 03:41:15PM +1300, Thomas Munro wrote: > On Fri, Dec 6, 2019 at 8:33 PM Noah Misch <noah@leadboat.com> wrote: > > On Fri, Dec 06, 2019 at 07:56:08PM +1300, Thomas Munro wrote: > > > On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah@leadboat.com> wrote: > > > > We use system UTF-16 collation to implement UTF-8 collation on Windows. The > > > > PostgreSQL security team received a report, from Timothy Kuun, that this > > > > collation does not uphold the "symmetric law" and "transitive law" that we > > > > require for btree operator classes. The attached test program demonstrates > > > > this. http://www.delphigroups.info/2/62/478610.html quotes reports of that > > > > problem going back eighteen years. Most code points are unaffected. Indexing > > > > an affected code point using such a collation can cause btree index scans to not > > > > find a row they should find and can make a UNIQUE or PRIMARY KEY constraint > > > > admit a duplicate. The security team determined that this doesn't qualify as a > > > > security vulnerability, but it's still a bug. > > > > > > Huh. Does this apply in modern times? Since Windows 10, I thought > > > they adopted[1] CLDR data to drive that, the same definitions used (or > > > somewhere in the process of being adopted by) GNU, Illumos, FreeBSD > > > etc. Basically, everyone gave up on trying to own this rats nest of a > > > problem and deferred to the experts. > > > > Based on my test program, it applies to Windows Server 2016. I didn't test > > newer versions. > > I ran a variation of your program on Appveyor's Studio/Server 2019 > image, and the result was the same: it thinks that cmp(s1, s2) == 0, > cmp(s2, s3) == 0, but cmp(s1, s3) == 1, so the operator fails to be > transitive. If that test is captured in self-contained artifacts (a few config files, a public git repository, etc.), could you share them? If not, no need to assemble such artifacts. I probably won't use them, but I'd be curious to browse them if you've already assembled them. > > > If you can still get > > > index-busting behaviour out of modern Windows collations, wouldn't > > > that be a bug that someone can file against SQL Server, Windows etc > > > and get fixed? > > > > Perhaps. I wouldn't have high hopes, given the behavior's long tenure and the > > risk of breaking a different set of applications. > > I found a SQL Server test website[3] and tried to get it to do > something strange, using "Windows" collations (the ones that are > supposed to be compatible with CompareString() AKA strcoll(), much > like our "libc" provider). For Latin1_General_100_CI_AS_SC_UTF8 and > Korean_100_CS_AS it insisted that cmp(s1, s2) == 1, cmp(s2, s3) == 0, > cmp(s1, s3) == 1, while for Korean_90_CS_AS it said -1, 0, -1, all > self-consistent answers, matching neither your results nor the results > of other implementations. This does suggest some set of CompareString* parameters is free from the problem. If that's right, we could offer collations based on that. (I'm not sure it would be worth offering; ICU may be enough.) Thanks for this extensive testing.
pgsql-hackers by date: