Re: [HACKERS] ICU integration - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] ICU integration
Date
Msg-id CAEepm=3W4kUoAMYG3XOKPqv-=k1TL5A8zJvF6M93DNaHRoQeyw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] ICU integration  (Peter Geoghegan <pg@heroku.com>)
Responses Re: [HACKERS] ICU integration  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Tue, Jan 10, 2017 at 9:45 AM, Peter Geoghegan <pg@heroku.com> wrote:
> * I think it's worth looking into ucol_nextSortKeyPart(), and using
> that as an alternative to ucol_getSortKey(). It doesn't seem any
> harder, and when I tested it it was clearly faster. (I think that
> ucol_nextSortKeyPart() is more or less intended to be used for
> abbreviated keys.)

+1

I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
the start of the string via the UCharIterator passed in, unless you
have the rare reverse-accent-sort feature enabled (maybe used only in
fr_CA, it looks like it is required to scan the whole string looking
for the last accent).  So I assume that uiter_setUTF8 and
ucol_nextSortKeyPart would usually do a small fixed amount of work,
whereas this patch's icu_to_uchar allocates space and converts the
whole variable length string every time.

On the other hand, I see that this patch's icu_to_uchar can deal with
encodings other than UTF8.  At first glance, the UCharIterator
interface provides a way to make a UCharIterator that iterates over a
string of UChar or a string of UTF8, but I'm not sure how to make it
do character-by-character transcoding from arbitrary encodings via the
C API.  It seems like that should be possible using ucnv_getNextUChar
as the source of transcoded characters, but I'm not sure how to wire
those two things together (in C++ I think you'd subclass
icu::CharacterIterator).

That's about abbreviation, but I note that you can also compare
strings using iterators with ucol_strcollIter, avoiding the need to
allocate and transcode up front.  I have no idea whether that'd pay
off.

+   A change in collation definitions can lead to corrupt indexes and other
+   problems where the database system relies on stored objects having a
+   certain sort order.  Generally, this should be avoided, but it can happen
+   in legitimate circumstances, such as when
+   using <command>pg_upgrade</command> to upgrade to server binaries linked
+   with a newer version of ICU.  When this happens, all objects depending on
+   the collation should be rebuilt, for example,
+   using <command>REINDEX</command>.  When that is done, the collation version
+   can be refreshed using the command <literal>ALTER COLLATION ... REFRESH
+   VERSION</literal>.  This will update the system catalog to record the
+   current collator version and will make the warning go away.  Note that this
+   does not actually check whether all affected objects have been rebuilt

I think this is a pretty reasonable first approach to this problem.
It's a simple way to flag up a problem to the DBA, but leaves all the
responsibility for figuring out how to fix it to the DBA.  I think we
should considering going further in later patches (tracking the
version used at last rebuild per index etc as discussed, so that the
condition is cleared only by rebuilding the affected things).

(REPARTITION anyone?)

As far as I know there are two things moving: ICU code and CLDR data.
Here we can see the CLDR versions being pulled into ICU:

http://bugs.icu-project.org/trac/log/icu/trunk/source/data/locales?rev=39273

Clearly when you upgrade your system from (say) Debian 8 to Debian 9
and the ICU major version changes we expect to have to REINDEX, but
does anyone know if such data changes are ever pulled into the minor
version package upgrades you get from regular apt-get update of (say)
a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
expect collversion changes to happen basically any time in response to
regular system updates, or only when you're doing a major upgrade of
some kind, as the above-quoted documentation patch implies?

+               collversion = ntohl(*((uint32 *) versioninfo));

UVersionInfo is an array of four uint8_t.  That doesn't sound like
something that needs to be bit-swizzled... is it really?  Even if it
is arranged differently on different architectures, I'm not sure why
you care since we only ever use it to compare for equality on the same
system.  But aside from that, I don't love this cast to uint32.  I
wonder if we should use u_versionToString to respect boundaries a bit
better?

I have another motivation for wanting to model collation versions as
strings: I have been contemplating a version check for system-provided
collations too, piggy-backing on your work here.  Obviously PostgreSQL
can't directly know anything about system collation versions, but I'm
thinking of a GUC system_collation_version_command which you could
optionally set to a script that knows how to inspect the local
operating system.  For example, a package maintainer might be
interested in writing such a script that knows how to ask the package
system for a locale data version.  A brute force approach that works
on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.
A string would provide more leeway than a measly int32.  That's
independent of this patch and you might hate the whole idea, but seems
to be the kind of thing you anticipated when you described collversion
as "[p]rovider-specific version of the collation".

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Craig de Stigter
Date:
Subject: [HACKERS] Should `pg_upgrade --check` check relation filenodes are present?
Next
From: Jim Nasby
Date:
Subject: Re: [HACKERS] Time to up bgwriter_lru_maxpages?