Re: [pgsql-packagers] Palle Girgensohn's ICU patch - Mailing list pgsql-hackers
From | Jakob Egger |
---|---|
Subject | Re: [pgsql-packagers] Palle Girgensohn's ICU patch |
Date | |
Msg-id | 72AE2E04-CD4E-4E7A-9303-49DE5354B4B3@eggerapps.at Whole thread Raw |
In response to | Re: [pgsql-packagers] Palle Girgensohn's ICU patch (Geoff Montee <geoff.montee@gmail.com>) |
Responses |
Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Re: [pgsql-packagers] Palle Girgensohn's ICU patch |
List | pgsql-hackers |
Am 26.11.2014 um 17:46 schrieb Geoff Montee <geoff.montee@gmail.com>: > This topic reminds me of a thread from a couple months ago: > > http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@tripadvisor.com > > It sounds like adding ICU support to core may also allow for adding > collation versioning to indexes. Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and theonly problem is that no one has yet volunteered for it :) I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done. MINIMUM TODO ============ * Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collatorfor the default collation and stores it in a static variable. We would need to change this to create collators foreach collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache() * There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale()in varlena.c These two changes would allow using ICU for collation. This has two major advantages: 1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting 2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementationsof the glibc strcoll function NEXT STEPS: Support for more collations ======================================= ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook". Addingsupport for these is a bit more involved. * initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog. * A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade,check_locale_name() in initdb to support collations provided by ICU * pg_perm_setlocale() must get a special case to handle ICU collations * the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide whatcollation to send to perl) * convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, thisis not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisationin corner cases These changes would probably have an even bigger impact, because then people would no longer be limited to the collationssupported by the locales installed on their OS. NEXT STEPS: Collation versioning in indices =========================================== Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changingimplementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change withpotentially big impact. Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stopusing the index until it is rebuilt. I'll take a shot at the MINIMUM TODO as outlined above.
pgsql-hackers by date: