Dealing with collation and strcoll/strxfrm/etc - Mailing list pgsql-hackers

From Stephen Frost
Subject Dealing with collation and strcoll/strxfrm/etc
Date
Msg-id 20160328145704.GP3127@tamriel.snowman.net
Whole thread Raw
In response to Re: Draft release notes for next week's releases  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Dealing with collation and strcoll/strxfrm/etc
Re: Dealing with collation and strcoll/strxfrm/etc
List pgsql-hackers
All,

Changed the thread name (we're no longer talking about release
notes...).

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Oleg Bartunov <obartunov@gmail.com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings.  No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

We have a wiki page about ICU.  I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:

https://wiki.postgresql.org/wiki/Todo:ICU

If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes.  I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.

Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree.  Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Anastasia Lubennikova
Date:
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.
Next
From: Robert Haas
Date:
Subject: Re: Draft release notes for next week's releases