Re: Windows UTF-8, non-ICU collation trouble - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Windows UTF-8, non-ICU collation trouble
Date
Msg-id 20191210092919.GA1760025@rfd.leadboat.com
Whole thread Raw
In response to Re: Windows UTF-8, non-ICU collation trouble  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Windows UTF-8, non-ICU collation trouble
List pgsql-hackers
On Tue, Dec 10, 2019 at 03:41:15PM +1300, Thomas Munro wrote:
> On Fri, Dec 6, 2019 at 8:33 PM Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Dec 06, 2019 at 07:56:08PM +1300, Thomas Munro wrote:
> > > On Fri, Dec 6, 2019 at 7:34 PM Noah Misch <noah@leadboat.com> wrote:
> > > > We use system UTF-16 collation to implement UTF-8 collation on Windows.  The
> > > > PostgreSQL security team received a report, from Timothy Kuun, that this
> > > > collation does not uphold the "symmetric law" and "transitive law" that we
> > > > require for btree operator classes.  The attached test program demonstrates
> > > > this.  http://www.delphigroups.info/2/62/478610.html quotes reports of that
> > > > problem going back eighteen years.  Most code points are unaffected.  Indexing
> > > > an affected code point using such a collation can cause btree index scans to not
> > > > find a row they should find and can make a UNIQUE or PRIMARY KEY constraint
> > > > admit a duplicate.  The security team determined that this doesn't qualify as a
> > > > security vulnerability, but it's still a bug.
> > >
> > > Huh.  Does this apply in modern times?  Since Windows 10, I thought
> > > they adopted[1] CLDR data to drive that, the same definitions used (or
> > > somewhere in the process of being adopted by) GNU, Illumos, FreeBSD
> > > etc.  Basically, everyone gave up on trying to own this rats nest of a
> > > problem and deferred to the experts.
> >
> > Based on my test program, it applies to Windows Server 2016.  I didn't test
> > newer versions.
> 
> I ran a variation of your program on Appveyor's Studio/Server 2019
> image, and the result was the same: it thinks that cmp(s1, s2) == 0,
> cmp(s2, s3) == 0, but cmp(s1, s3) == 1, so the operator fails to be
> transitive.

If that test is captured in self-contained artifacts (a few config files, a
public git repository, etc.), could you share them?  If not, no need to
assemble such artifacts.  I probably won't use them, but I'd be curious to
browse them if you've already assembled them.

> > > If you can still get
> > > index-busting behaviour out of modern Windows collations, wouldn't
> > > that be a bug that someone can file against SQL Server, Windows etc
> > > and get fixed?
> >
> > Perhaps.  I wouldn't have high hopes, given the behavior's long tenure and the
> > risk of breaking a different set of applications.
> 
> I found a SQL Server test website[3] and tried to get it to do
> something strange, using "Windows" collations (the ones that are
> supposed to be compatible with CompareString() AKA strcoll(), much
> like our "libc" provider).  For Latin1_General_100_CI_AS_SC_UTF8 and
> Korean_100_CS_AS it insisted that cmp(s1, s2) == 1, cmp(s2, s3) == 0,
> cmp(s1, s3) == 1, while for Korean_90_CS_AS it said -1, 0, -1, all
> self-consistent answers, matching neither your results nor the results
> of other implementations.

This does suggest some set of CompareString* parameters is free from the
problem.  If that's right, we could offer collations based on that.  (I'm not
sure it would be worth offering; ICU may be enough.)

Thanks for this extensive testing.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Wrong assert in TransactionGroupUpdateXidStatus
Next
From: jiankang liu
Date:
Subject: Start Walreceiver completely before shut down it on standby server.