Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Date
Msg-id CAM3SWZTums8PyGAHUpj=3gpt-xiGbbLqUun3mKK8-aTh-z9R2A@mail.gmail.com
Whole thread Raw
In response to Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
List pgsql-bugs
On Mon, Mar 21, 2016 at 9:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Can you look at generating a textual representation of the strxfrm()
> blobs in question, using Robert's tool?:
>
> http://www.postgresql.org/message-id/CA+TgmoaOCyQpo8HK9yr6VTuyknWWvqgo7JeXi2kb=gpNveKR+g@mail.gmail.com

I played with this tool myself, on an affected CentOS 6.7 VM:

[vagrant@localhost ~]$ ldd --version
ldd (GNU libc) 2.12
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

I now think that we have this backwards: This isn't a bug in glibc's
strxfrm(); it's a bug in glibc's strcoll(). Minimal testcase with
modified tool, simplified to use ascii-safe strings:

[vagrant@localhost ~]$ ./a.out de_DE.UTF-8 'xxx' 'x xx'
"xxx" -> 2323230108080801020202 (11 bytes)
"x xx" -> 2323230108080801020202010235 (14 bytes)
strcmp(arg1, arg2) result: -1
strcoll(arg1, arg2) result: 6

If we assume for the sake of argument that this is a strxfrm() bug and
strcoll() is a reliable source of truth, then I find it very curious
that Germany's Austrian neighbors differ on this point about how text
should be collated:

[vagrant@localhost ~]$ ./a.out de_AT.UTF-8 'xxx' 'x xx'
"xxx" -> 2323230108080801020202 (11 bytes)
"x xx" -> 2323230108080801020202010235 (14 bytes)
strcmp(arg1, arg2) result: -1
strcoll(arg1, arg2) result: -1

This surely adds doubt to the idea that strxfrm() in particular is broken.

I find something else inconsistent with the strxfrm() theory: even the
de_DE collation gives strxfrm()/strcoll() self-consistent answers when
we move the rhs argument's space to the far side of its center 'x'
char:

[vagrant@localhost ~]$ ./a.out de_DE.UTF-8 'xxx' 'xx x'
"xxx" -> 2323230108080801020202 (11 bytes)
"xx x" -> 2323230108080801020202010335 (14 bytes)
strcmp(arg1, arg2) result: -1
strcoll(arg1, arg2) result: -1

It seems very unlikely that this is because of a legitimate
consideration that strcoll() makes about how German should be collated
(one that strxfrm() fails to make, say).

This is probably a worse situation for affected Postgres systems,
though, because now they have no scope to turn the faulty part of the
system off. I have a hard time believing that it's a good idea to
trust strcoll() to be wrong in a consistent way that has collatable
type opclasses at least follow "Notes to Operator Class Implementors".
I'd like to hear more opinions on that, though, because it's a tricky
thing to reason about.

--
Peter Geoghegan

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: problem in sql - sum()
Next
From: Tom Lane
Date:
Subject: Re: BUG #14034: Select for update with inner select doesn't return value after committing by other transaction.