Re: Dealing with collation and strcoll/strxfrm/etc - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: Dealing with collation and strcoll/strxfrm/etc
Date
Msg-id CAF4Au4yjYVMDdNtMAAr4e=Ut-JAjocFkNW3DA0KJFNAbx6ky2w@mail.gmail.com
Whole thread Raw
In response to Dealing with collation and strcoll/strxfrm/etc  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers


On Mon, Mar 28, 2016 at 5:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
All,

Changed the thread name (we're no longer talking about release
notes...).

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Oleg Bartunov <obartunov@gmail.com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings.  No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

We have a wiki page about ICU.  I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:

https://wiki.postgresql.org/wiki/Todo:ICU


Good point, I forget about this page.

 
If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes.  I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.

agree.
 

Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree.  Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.

Ideally, we should benchmarking all locales on all platforms for all kind indexes. But that's  big project.
 

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Relation extension scalability
Next
From: Oleg Bartunov
Date:
Subject: Re: Draft release notes for next week's releases