Re: [pgsql-packagers] Palle Girgensohn's ICU patch - Mailing list pgsql-hackers

From Jakob Egger
Subject Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Date
Msg-id 72AE2E04-CD4E-4E7A-9303-49DE5354B4B3@eggerapps.at
Whole thread Raw
In response to Re: [pgsql-packagers] Palle Girgensohn's ICU patch  (Geoff Montee <geoff.montee@gmail.com>)
Responses Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Re: [pgsql-packagers] Palle Girgensohn's ICU patch
List pgsql-hackers
Am 26.11.2014 um 17:46 schrieb Geoff Montee <geoff.montee@gmail.com>:
> This topic reminds me of a thread from a couple months ago:
>
> http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@tripadvisor.com
>
> It sounds like adding ICU support to core may also allow for adding
> collation versioning to indexes.

Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and
theonly problem is that no one has yet volunteered for it :) 

I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.

MINIMUM TODO
============

* Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU
collatorfor the default collation and stores it in a static variable. We would need to change this to create collators
foreach collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache() 

* There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch
bttextfastcmp_locale()in varlena.c 

These two changes would allow using ICU for collation. This has two major advantages:
1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing
implementationsof the glibc strcoll function 


NEXT STEPS: Support for more collations
=======================================

ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook".
Addingsupport for these is a bit more involved. 

* initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.

* A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in
pg_upgrade,check_locale_name() in initdb to support collations provided by ICU 

* pg_perm_setlocale() must get a special case to handle ICU collations

* the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide
whatcollation to send to perl) 

* convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand,
thisis not absolutely required as this is only used by the query planner and would in the worst case prevent some
optimisationin corner cases 

These changes would probably have an even bigger impact, because then people would no longer be limited to the
collationssupported by the locales installed on their OS. 

NEXT STEPS: Collation versioning in indices
===========================================

Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by
changingimplementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change
withpotentially big impact. 

Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and
stopusing the index until it is rebuilt. 


I'll take a shot at the MINIMUM TODO as outlined above.




pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: copy.c handling for RLS is insecure
Next
From: Ian Barwick
Date:
Subject: pg_regress and --dbname option / multiple databases