Home > mailing lists

Dealing with collation and strcoll/strxfrm/etc - Mailing list pgsql-hackers

From	Stephen Frost
Subject	Dealing with collation and strcoll/strxfrm/etc
Date	March 28, 2016 14:57:14
Msg-id	20160328145704.GP3127@tamriel.snowman.net Whole thread Raw
In response to	Re: Draft release notes for next week's releases (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Dealing with collation and strcoll/strxfrm/etc Re: Dealing with collation and strcoll/strxfrm/etc
List	pgsql-hackers

Tree view

All,

Changed the thread name (we're no longer talking about release
notes...).

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Oleg Bartunov <obartunov@gmail.com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings.  No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

We have a wiki page about ICU.  I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:

https://wiki.postgresql.org/wiki/Todo:ICU

If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes.  I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.

Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree.  Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.

Thanks!

Stephen

pgsql-hackers by date:

From: Anastasia Lubennikova
Date: 28 March 2016, 14:30:06
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.

From: Robert Haas
Date: 28 March 2016, 15:08:12
Subject: Re: Draft release notes for next week's releases

Dealing with collation and strcoll/strxfrm/etc - Mailing list pgsql-hackers

Previous

Next