Re: [pgsql-packagers] Palle Girgensohn's ICU patch - Mailing list pgsql-hackers

From Dave Page
Subject Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Date
Msg-id CA+OCxozt5BjdggE9-QnYxCQBROATXQV6ytWES8XPdG77f=5kbw@mail.gmail.com
Whole thread Raw
In response to Re: [pgsql-packagers] Palle Girgensohn's ICU patch  (Jakob Egger <jakob@eggerapps.at>)
Responses Re: [pgsql-packagers] Palle Girgensohn's ICU patch  (Palle Girgensohn <girgen@pingpong.net>)
List pgsql-hackers


On Thu, Nov 27, 2014 at 9:09 AM, Jakob Egger <jakob@eggerapps.at> wrote:
Am 26.11.2014 um 17:46 schrieb Geoff Montee <geoff.montee@gmail.com>:
> This topic reminds me of a thread from a couple months ago:
>
> http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@tripadvisor.com
>
> It sounds like adding ICU support to core may also allow for adding
> collation versioning to indexes.

Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and the only problem is that no one has yet volunteered for it :)

I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.

MINIMUM TODO
============

* Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collator for the default collation and stores it in a static variable. We would need to change this to create collators for each collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache()

* There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale() in varlena.c

These two changes would allow using ICU for collation. This has two major advantages:
1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementations of the glibc strcoll function


NEXT STEPS: Support for more collations
=======================================

ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook". Adding support for these is a bit more involved.

* initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.

* A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in initdb to support collations provided by ICU

* pg_perm_setlocale() must get a special case to handle ICU collations

* the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide what collation to send to perl)

* convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, this is not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisation in corner cases

These changes would probably have an even bigger impact, because then people would no longer be limited to the collations supported by the locales installed on their OS.

NEXT STEPS: Collation versioning in indices
===========================================

Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changing implementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change with potentially big impact.

Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stop using the index until it is rebuilt.


I'll take a shot at the MINIMUM TODO as outlined above.


We've already included ICU support in our Postgres Plus Advanced Server product. Before you spend too much time on this, give me a few days to see if we can get that change contributed back. The people I need to speak to are OOO for Thanksgiving at the moment though, so it may be a few days.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

Previous
From: Ian Barwick
Date:
Subject: pg_regress and --dbname option / multiple databases
Next
From: Simon Riggs
Date:
Subject: Re: [v9.5] Custom Plan API