Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Date
Msg-id CAM3SWZSzE13i=9pDseTn9XzE21kQ_qHnb7JOkDNUs3akH=jswQ@mail.gmail.com
Whole thread Raw
In response to Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
List pgsql-bugs
On Tue, Mar 22, 2016 at 3:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, if we implement a compatibility GUC that shuts off our
> dependency on strxfrm(), people can go back to having 9.5 be no more
> broken than 9.4 was.  I vote we do that and go home.

I don't have a problem with that idea, but I fear "no more broken than
9.4 was" might be a very low bar for certain systems and collations.
Abbreviated key may have simply unmasked the problem in some cases.

Consider:

[vagrant@localhost ~]$ LC_COLLATE=en_us sort strings.txt <-- correct
x xx
x xx"
xxx
xxx"
[vagrant@localhost ~]$ LC_COLLATE=de_DE sort strings.txt <-- wrong
xxx
xxx"
x xx
x xx"
[vagrant@localhost ~]$ ./strxfrm-binary de_DE.UTF-8 'xxx' 'x xx'
"xxx" -> 2323230108080801020202 (11 bytes)
"x xx" -> 2323230108080801020202010235 (14 bytes)
strcmp(arg1, arg2) result: -1
strcoll(arg1, arg2) result: 6

My concern was not merely "academic" (i.e. it was not limited in scope
to things that don't make B-Tree indexes corrupt). Pretty sure that we
need to start thinking of this as a problem with strcoll() that
strxfrm() does not have for more fundamental reasons, because
strcoll() says that the first string in the de_DE sorted list is
*greater* than the third string. That's wrong, and not just because
strxfrm() gives an intuitively correct answer -- it's wrong
specifically because the transitive law has been broken.

--
Peter Geoghegan

pgsql-bugs by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Next
From: Tom Lane
Date:
Subject: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)