Thread: Collation version tracking for macOS

Collation version tracking for macOS

From
Peter Eisentraut
Date:
During development, I have been using the attached patch to simulate 
libc collation versions on macOS.  It just uses the internal major OS 
version number.  I don't know to what the extend the libc locales on 
macOS are maintained or updated at all, so I don't know what practical 
effect this would have.  Again, it's mainly for development.  If there 
is interest from others, I think we could add this, maybe disabled by 
default, or we just keep it in the mailing list archives for interested 
parties.
Attachment

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Mon, Feb 14, 2022 at 10:00 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> During development, I have been using the attached patch to simulate
> libc collation versions on macOS.  It just uses the internal major OS
> version number.  I don't know to what the extend the libc locales on
> macOS are maintained or updated at all, so I don't know what practical
> effect this would have.  Again, it's mainly for development.  If there
> is interest from others, I think we could add this, maybe disabled by
> default, or we just keep it in the mailing list archives for interested
> parties.

Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order.  In other words
there was effectively nothing to version.  I remember that other old
Unixes used to be like that, and I suspect that they might be using
old pre-UTF-8 FreeBSD code for locales based on a quick peek at [1]
(though FreeBSD itself has since learned to do CLDR-based UTF-8
sorting with a completely new implementation shared with other OSes).
This makes me wonder if Apple is hiding another collation
implementation somewhere up its sleeve -- surely that libc support is
not good enough for the world's shiny globalised macOS/iOS apps?
Maybe UCCompareText() and friends (UnicodeUtilitiesCoreLib) and the
various Obj-C NSString comparison stuff, all of which probably
predates Unixoid macOS (google tells me that UnicodeUtilities.h was
present in macOS 9).  It wouldn't be surprising if it shares nothing
with the modern OS's C runtime stuff that came via NeXT.  Just
mentioning this as a curiosity, because I was trying to figure out how
that could be left non-working without anyone complaining...

[1] https://github.com/apple-open-source-mirror/Libc/tree/master/locale



Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 07.05.22 02:31, Thomas Munro wrote:
>> During development, I have been using the attached patch to simulate
>> libc collation versions on macOS.  It just uses the internal major OS
>> version number.  I don't know to what the extend the libc locales on
>> macOS are maintained or updated at all, so I don't know what practical
>> effect this would have.  Again, it's mainly for development.  If there
>> is interest from others, I think we could add this, maybe disabled by
>> default, or we just keep it in the mailing list archives for interested
>> parties.
> Last time I looked into this it seemed like macOS's strcoll() gave
> sensible answers in the traditional single-byte encodings, but didn't
> understand UTF-8 at all so you get C/strcmp() order.  In other words
> there was effectively nothing to version.

Someone recently told me that collations in macOS have actually changed 
recently and that this is a live problem.  See explanation here:

https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66

So I think we should reconsider this patch, even for PG15.




Re: Collation version tracking for macOS

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> On 07.05.22 02:31, Thomas Munro wrote:
>> Last time I looked into this it seemed like macOS's strcoll() gave
>> sensible answers in the traditional single-byte encodings, but didn't
>> understand UTF-8 at all so you get C/strcmp() order.  In other words
>> there was effectively nothing to version.

> Someone recently told me that collations in macOS have actually changed
> recently and that this is a live problem.  See explanation here:

> https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66

> So I think we should reconsider this patch, even for PG15.

According to that document, they changed it in macOS 11, which came out
a year and a half ago.  Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update).

            regards, tom lane



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
On 6/3/22 9:21 AM, Tom Lane wrote:
> 
> According to that document, they changed it in macOS 11, which came out
> a year and a half ago.  Given the lack of complaints, it doesn't seem
> like this is urgent enough to mandate a post-beta change that would
> have lots of downside (namely, false-positive warnings for every other
> macOS update).


Sorry, I'm going to rant for a minute... it is my very strong opinion
that using language like "false positive" here is misguided and dangerous.

If new version of sort order is released, for example when they recently
updated backwards-secondary sorting in french [CLDR-2905] or matching of
v and w in swedish and finnish [CLDR-7088], it is very dangerous to use
language like “false positive” to describe a database where there just
didn't happen to be any rows with accented french characters at the
point in time where PostgreSQL magically changed which version of sort
order it was using from the 2010 french version to the 2020 french version.

No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.

I have personally seen PostgreSQL databases where an update to an old
version of glibc was applied (I'm not even talking 2.28 here) and it
resulted in data loss b/c crash recovery couldn't replay WAL records and
the user had to do a PITR. That's aside from the more common issues of
segfaults or duplicate records that violate unique constraints or wrong
query results like missing data. And it's not just updates - people can
set up a hot standby on a different version and see many of these
problems too.

Collation versioning absolutely must be first class and directly
controlled by users, and it's very dangerous to allow users - at all -
to take an index and then use a different version than what the index
was built with.

Not to mention all the other places in the DB where collation is used...
partitioning, constraints, and any other place where persisted data can
make an assumption about any sort of string comparison.

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

-Jeremy Schneider


-- 
http://about.me/jeremy_schneider



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Jun 4, 2022 at 12:17 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 07.05.22 02:31, Thomas Munro wrote:
> > Last time I looked into this it seemed like macOS's strcoll() gave
> > sensible answers in the traditional single-byte encodings, but didn't
> > understand UTF-8 at all so you get C/strcmp() order.  In other words
> > there was effectively nothing to version.
>
> Someone recently told me that collations in macOS have actually changed
> recently and that this is a live problem.  See explanation here:
>
> https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66

How can I see evidence of this?  I'm comparing Debian, FreeBSD and
macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
/usr/share/dict/words" I get upper and lower case mixed together on
the other OSes, but on the Mac the upper case comes first, which is my
usual smoke test for "am I looking at binary sort order?"



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> No other piece of software that calls itself a database would do what
> PostgreSQL is doing: just give users a "warning" after suddenly changing
> the sort order algorithm (most users won't even read warnings in their
> logs). Oracle, DB2, SQL Server and even MySQL carefully version
> collation data, hardcode a pseudo-linguistic collation into the DB (like
> PG does for timezones), and if they provide updates to linguistic sort
> order (from Unicode CLDR) then they allow the user to explicitly specify
> which version of french or german ICU sorting they are want to use.
> Different versions are treated as different sort orders; they are not
> conflated.

I guess you know this but for the record, there have been discussions
before about supporting multiple versions of ICU concurrently, like
DB2.  For example, one idea was that each ICU version could be a
separate "provider" in PostgreSQL, so you can concurrently use
multiple versions.

One of several places this came up:
https://www.postgresql.org/message-id/CADE5jYJTnYaTNXMFKOK-0p44%2BDm5LMcRcJ5kVi1MVHomb2QTkQ%40mail.gmail.com

> It feels to me like we're still not really thinking clearly about this
> within the PG community, and that the seriousness of this issue is not
> fully understood.

FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
> <schneider@ardentperf.com> wrote:
>> It feels to me like we're still not really thinking clearly about this
>> within the PG community, and that the seriousness of this issue is not
>> fully understood.

> FWIW A couple of us tried quite hard to make smarter warnings, and
> that thread and others discussed a lot of those topics, like the
> relevance to constraints and so forth.

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously.  Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening.  Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.

Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.

            regards, tom lane



Re: Collation version tracking for macOS

From
Jim Nasby
Date:
On 6/3/22 3:58 PM, Tom Lane wrote

> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
>> <schneider@ardentperf.com> wrote:
>>> It feels to me like we're still not really thinking clearly about this
>>> within the PG community, and that the seriousness of this issue is not
>>> fully understood.
>> FWIW A couple of us tried quite hard to make smarter warnings, and
>> that thread and others discussed a lot of those topics, like the
>> relevance to constraints and so forth.
> I think the real problem here is that the underlying software mostly
> doesn't take this issue seriously.  Unfortunately, that leads one to
> the conclusion that we need to maintain our own collation code and
> data (e.g., our own fork of ICU), and that isn't happening.  Unlike
> say Oracle, we do not have the manpower; nor do we want to bloat our
> code base that much.
>
> Short of maintaining our own fork, ranting about the imperfections
> of the situation is a waste of time.
The first step to a solution is admitting that the problem exists. 
Ignoring broken backups, segfaults and data corruption as a "rant" 
implies that we simply throw in the towel and tell users to suck it up 
or switch engines. There are other ways to address this short of the 
community doing all the work itself. One simple example would be to 
refuse to start if the collation provider has changed since initdb 
(which we'd need to allow users to override). A more sophisticated 
option would be to provide the machinery for supporting multiple 
collation libraries. Both of those at least ensure that users are aware 
any time there's a problem, which IMO is *enormously* better than 
letting core functionality silently stop working.



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Jim Nasby <nasbyj@amazon.com> writes:
>> I think the real problem here is that the underlying software mostly
>> doesn't take this issue seriously.

> The first step to a solution is admitting that the problem exists. 
> Ignoring broken backups, segfaults and data corruption as a "rant" 
> implies that we simply throw in the towel and tell users to suck it up 
> or switch engines. There are other ways to address this short of the 
> community doing all the work itself. One simple example would be to 
> refuse to start if the collation provider has changed since initdb 
> (which we'd need to allow users to override).

You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has).  Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.

            regards, tom lane



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
> On Jun 6, 2022, at 17:10, Jim Nasby <nasbyj@amazon.com> wrote:
> Ignoring broken backups, segfaults and data corruption as a "rant" implies that we simply throw in the towel and
tellusers to suck it up or switch engines. 


Well now, let’s be clear, I was the one who called my email a “rant”.  🙂

And I do apologize for that - it was grumpy and impulsive and Tom isn’t wrong that rants don’t usually help move things
forward.

Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of
material;I haven’t read the whole thread yet. If you have some others that would also be particularly good background,
letme know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I
waspulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple
years,so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in
theconversation here. 

Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn
onversion mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is
prettydamn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things
tohappen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades
oflarge & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I
havea lot of reading to do, to really understand how that happened and where the dialogue is today. 

Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but
Istill feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough”
for99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think
glibcneeds to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t
GPLthen I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels
likemy idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?) 

Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for
thenon-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>). 

-Jeremy


Sent from my TI-83




Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Jun 7, 2022 at 12:10 PM Jim Nasby <nasbyj@amazon.com> wrote:
> On 6/3/22 3:58 PM, Tom Lane wrote
> > Thomas Munro <thomas.munro@gmail.com> writes:
> >> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
> >> <schneider@ardentperf.com> wrote:
> >>> It feels to me like we're still not really thinking clearly about this
> >>> within the PG community, and that the seriousness of this issue is not
> >>> fully understood.
> >> FWIW A couple of us tried quite hard to make smarter warnings, and
> >> that thread and others discussed a lot of those topics, like the
> >> relevance to constraints and so forth.
> > I think the real problem here is that the underlying software mostly
> > doesn't take this issue seriously.  Unfortunately, that leads one to
> > the conclusion that we need to maintain our own collation code and
> > data (e.g., our own fork of ICU), and that isn't happening.  Unlike
> > say Oracle, we do not have the manpower; nor do we want to bloat our
> > code base that much.
> >
> > Short of maintaining our own fork, ranting about the imperfections
> > of the situation is a waste of time.
> The first step to a solution is admitting that the problem exists.

We've been discussing this topic for years and I don't think anyone
thinks the case is closed...

> Ignoring broken backups, segfaults and data corruption as a "rant"
> implies that we simply throw in the towel and tell users to suck it up
> or switch engines. There are other ways to address this short of the
> community doing all the work itself. One simple example would be to
> refuse to start if the collation provider has changed since initdb
> (which we'd need to allow users to override).

Yeah, it's been discussed, but never proposed.  The problem is that
you need to start up to fix the problem.  Another option is not to use
affected indexes, but that doesn't help with other forms of the
problem (partition constraints, etc).

> A more sophisticated
> option would be to provide the machinery for supporting multiple
> collation libraries.

Earlier I mentioned distinct "providers" but I take that back, that's
too complicated.  Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions?  Or some
cleverer, shorter notation.  Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not.  For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with.  You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK.  Maintaining our own fork(s) of ICU would seem like massive
overkill and I don't think anyone has suggested that; the question on
my mind is  whether we could rely on existing packages.  Then you'd be
exposed only to changes that happen within (say) the ICU 63 package's
lifetime... I recall looking into whether that can happen but ... I
don't recall the answer.



Re: Collation version tracking for macOS

From
Rod Taylor
Date:


On Mon, Jun 6, 2022 at 8:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <nasbyj@amazon.com> writes:
>> I think the real problem here is that the underlying software mostly
>> doesn't take this issue seriously.

> The first step to a solution is admitting that the problem exists.
> Ignoring broken backups, segfaults and data corruption as a "rant"
> implies that we simply throw in the towel and tell users to suck it up
> or switch engines. There are other ways to address this short of the
> community doing all the work itself. One simple example would be to
> refuse to start if the collation provider has changed since initdb
> (which we'd need to allow users to override).

You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has).  Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.

Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.

--
Rod Taylor

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:
> Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then
runningcrc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few
databasesuse more than a couple different collations. 

Collation rules have multiple levels and all kinds of quirks, so that
won't work.



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Fri, Jun 3, 2022 at 4:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the real problem here is that the underlying software mostly
> doesn't take this issue seriously.  Unfortunately, that leads one to
> the conclusion that we need to maintain our own collation code and
> data (e.g., our own fork of ICU), and that isn't happening.  Unlike
> say Oracle, we do not have the manpower; nor do we want to bloat our
> code base that much.

You don't, but that opinion isn't universally held, or at least not
with the same vigor that you hold it. See e.g.
https://www.postgresql.org/message-id/a4019c5e570d4dbb5e3f816c080fb57c76ab604a.camel%40cybertec.at
and subsequent discussion, for example.

In fact, I'd go so far as to argue that you're basically sticking your
head in the sand here. You wrote:

"Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update)."

But you wrote that to Peter, who was essentially complaining that we
hadn't done anything, and linked to another source, which was also
complaining about the problem, and then Jeremy Schneider replied to
your email and complained some more.

Complaining about "false positives" doesn't really make sense to me.
It's true that we don't have any false positives right now, but we
also have no true positives. Even a stopped clock is right twice a
day, but not in a useful way. People want to be notified when a
problem might exist, even if sometimes it doesn't actually. The
alternative is having no idea at all that things might be broken,
which is not better.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:
>> Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then
runningcrc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few
databasesuse more than a couple different collations. 

> Collation rules have multiple levels and all kinds of quirks, so that
> won't work.

Yeah, and it's exactly at the level of quirks that things are likely
to change.  Nobody's going to suddenly start sorting B before A.
They might, say, change their minds about where the digram "cz"
sorts relative to single letters, in languages where special rules
for that are a thing.

The idea of fingerprinting a collation's behavior is interesting,
but I've got doubts about whether we can make a sufficiently thorough
fingerprint.

            regards, tom lane



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> In fact, I'd go so far as to argue that you're basically sticking your
> head in the sand here. You wrote:

No, I quite agree that we have a problem.  What I don't agree is that
issuing a lot of false-positive warnings is a solution.  That will
just condition people to ignore the warnings, and then when their
platform really does change behavior, they're still screwed.  If we
could *accurately* report collation behavioral changes, I'd be all
for that.

Rod's idea upthread is certainly way too simplistic, but could we
build a set of test cases that do detect known changes in collation
behaviors?  We'd be shooting at a moving target; but even if we're
late in noticing that platform X changed the behavior of collation Y,
we could help users who run in the problem afterwards.

            regards, tom lane



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Tue, Jun 7, 2022 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It's true that we don't have any false positives right now, but we
> also have no true positives. Even a stopped clock is right twice a
> day, but not in a useful way. People want to be notified when a
> problem might exist, even if sometimes it doesn't actually.

Collations by their very nature are unlikely to change all that much.
Obviously they can and do change, but the details are presumably
pretty insignificant to a native speaker. Stands to reason that the
issue (which is fundamentally a problem for natural language experts)
would have been resolved far sooner if there really was a significant
controversy about something that tends to come up often.

It's pretty clear that glibc as a project doesn't take the issue very
seriously, because they see it as a problem of the GUI sorting a table
in a way that seems slightly suboptimal to scholars of a natural
language. Clearly that isn't actually a big deal. But the latent
possibility of wrong answers to queries is a very big deal. Both are
true. It's just a matter of priorities in each case.

I agree that "false positive" is not a valid way of describing a
breaking change in a Postgres collation that happens to not affect one
index in particular, due to the current phase of the moon. It's
probably very likely that most individual indexes that we warn about
will be so-called false positives. I bet Postgres that there are many
near-misses that we never get to hear about already. That's rather
beside the point. The index must be assumed to be corrupt.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> I agree that "false positive" is not a valid way of describing a
> breaking change in a Postgres collation that happens to not affect one
> index in particular, due to the current phase of the moon. It's
> probably very likely that most individual indexes that we warn about
> will be so-called false positives.

This is not the concern that I have.  I agree that if we tell a user
that collation X changed behavior and he'd better reindex his indexes
that use collation X, but none of them actually contain any cases that
changed behavior, that's not a "false positive" --- that's "it's cheaper
to reindex than to try to identify whether there's a problem".  What
I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never.  We will soon lose those users' attention ---
see fable about boy crying wolf --- and then when Apple actually does
change something, we've got a problem.  So if we give collation-change
warnings, they'd better have some measurable connection to reality.

            regards, tom lane



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Earlier I mentioned distinct "providers" but I take that back, that's
> too complicated.  Reprising an old idea that comes up each time we
> talk about this, this time with some more straw-man detail: what about
> teaching our ICU support to understand "libicu18n.so.71:en" to mean
> that it should dlopen() that library and use its functions?  Or some
> cleverer, shorter notation.  Then it's the user's problem to make sure
> the right libraries are installed, and it'll fail if they're not.  For
> example, on Debian bookworm right now you can install libicu63,
> libicu67, libicu71, though only the "current" -dev package, but which
> I'm sure we can cope with.  You're at the mercy of the distro or
> add-on package repos to keep a lot of versions around, but that seems
> OK.

Right. Postgres could link to multiple versions of ICU at the same
time. Right now it doesn't, and right now the ICU C symbol names that
we use are actually versioned (this isn't immediately apparent because
the C preprocessor makes it appear that ICU symbol names are generic).

We could perhaps invent a new indirection that knows about
multiple ICU versions, each of which is an independent collation
provider, or maybe a related collation provider that gets used by
default on REINDEX. ICU is designed for this kind of thing. That
approach more or less puts packagers on the hook for managing
collation stability. But now long term collation stability is at least
feasible -- we at least have a coherent strategy. In the worst case
the community .deb and .rpm repos might continue to support an older
ICU version, or lobby for its continued support by the distro (while
actively discouraging its use in new databases). This isn't the same
thing as forking ICU. It's a compromise between that extreme, and
the current situation.

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Jun 8, 2022 at 7:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The idea of fingerprinting a collation's behavior is interesting,
> but I've got doubts about whether we can make a sufficiently thorough
> fingerprint.

On one of the many threads about this I recall posting a thought
experiment patch that added system_collation_version_command or some
such, so you could train your computer to compute a hash for
/usr/share/locale/XXX/LC_COLLATE (or whatever it's called on your
system), but it all seemed a bit gross to me on various levels.  Most
people don't know or care about collations so they won't set it up, so
to make it useful it'd have to have useful defaults, and it seems like
a bad idea to teach PostgreSQL where all these systems keep their
collation rules.



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
On 6/7/22 12:53 PM, Peter Geoghegan wrote:
> 
> Collations by their very nature are unlikely to change all that much.
> Obviously they can and do change, but the details are presumably
> pretty insignificant to a native speaker. 


This idea does seem to persist. It's not as frequent as timezones, but
collation rules reflect local dialects and customs, and there are
changes quite regularly for a variety of reasons. A brief perusal of
CLDR changelogs and CLDR jiras can give some insight here:

https://github.com/unicode-org/cldr


https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22collation%22%20ORDER%20BY%20created%20DESC

The difference between the unicode consortium and the GNU C Library is
that unicode is maintained by people who are specifically interested in
working with language and internationalization challenges. I've spoken
to a glibc maintainer who directly told me that they dislike working
with the collation code, and try to avoid it. It's not even ISO 14651
anymore with so many custom glibc-specific changes layered on top. I
looked at the first few commits in the glibc source that were
responsible for the big 2.28 changes - there were a serious of quite a
few commits and some were so large they wouldn't even load in the github
API.

Here's one such commit:

https://github.com/bminor/glibc/commit/9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4

It's reasonable to expect that Red Hat and Debian will keep things
stable on one particular major, and to expect that every new major OS
version will update to the latest collation algorithms and locale data
for glibc.

Another misunderstanding that seems to persist is that this only relates
to exotic locales or that it's only the 2.28 version.

My github repo is out-of-date (I know of more cases that I still need to
publish) but the old data already demonstrates changes to the root/DUCET
collation rules (evident in en_US without any tailoring) for glibc
versions 2.13, 2.21 and 2.26

https://github.com/ardentperf/glibc-unicode-sorting/

If a PosgreSQL user is unlucky enough to have one of those unicode
characters stored in a table, they can get broken indexes even if they
only use the default US english locale, and without touching glibc 2.28
- and all you need is an index on a field where end users can type any
string input.


> It's pretty clear that glibc as a project doesn't take the issue very
> seriously, because they see it as a problem of the GUI sorting a table
> in a way that seems slightly suboptimal to scholars of a natural
> language. 

I disagree that glibc maintainers are doing anything wrong.

While the quality of glibc collations aren't great when compared with
CLDR, I think the glibc maintainers have done versioning exactly right:
they are clear about which patches are allowed to contain collation
updates, and the OS distributions are able to ensure stability on major
OS release. I haven't yet found a Red Hat minor release that changed
glibc collation.

-Jeremy


-- 
http://about.me/jeremy_schneider



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Tue, Jun 7, 2022 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is not the concern that I have.  I agree that if we tell a user
> that collation X changed behavior and he'd better reindex his indexes
> that use collation X, but none of them actually contain any cases that
> changed behavior, that's not a "false positive" --- that's "it's cheaper
> to reindex than to try to identify whether there's a problem".  What
> I mean by "false positive" is telling every macOS user that they'd better
> reindex everything every year, when in point of fact Apple changes those
> collations almost never.

That does seem like a meaningful distinction. I'm sorry if I
misrepresented your position on this.

We're talking about macOS here, which is hardly a paragon of lean
software. I think that it's worth revisiting the assumption that the C
standard library collations are the most useful set of collations, and
we shouldn't presume to know better than the operating system.
Couldn't individual packagers establish their own system for managing
collations across multiple ICU versions, as I outlined up-thread?

I think that it's okay (maybe unavoidable) that we keep "lib C
collations are authoritative" as a generic assumption when Postgres is
built from source. We can still have defacto standards that apply on
all mainstream platforms when users install standard packages for
production databases -- I don't see why we can't do both. Maybe the
best place to solve this problem is at the level of each individual
package ecosystem.

There can be some outsourcing to package managers this way, without
relying on the underlying OS, or lib C collations, or ICU in general.
This scheme wouldn't technically be under our direct control, but
would still be something that we could influence. We could have a back
and forth conversation about what's not working in the field.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> This idea does seem to persist. It's not as frequent as timezones, but
> collation rules reflect local dialects and customs, and there are
> changes quite regularly for a variety of reasons. A brief perusal of
> CLDR changelogs and CLDR jiras can give some insight here:

> Another misunderstanding that seems to persist is that this only relates
> to exotic locales or that it's only the 2.28 version.

I'm not defending the status quo, and I think that I'm better informed
than most about the problems in this area. My point was that it hardly
matters that we don't necessarily see outright corruption. This was
based in part on a misunderstanding of Tom's point, though.

> While the quality of glibc collations aren't great when compared with
> CLDR, I think the glibc maintainers have done versioning exactly right:
> they are clear about which patches are allowed to contain collation
> updates, and the OS distributions are able to ensure stability on major
> OS release. I haven't yet found a Red Hat minor release that changed
> glibc collation.

That might be true, but my impression from interacting with Carlos
O'Donnell is that they pretty much don't take the concern about
stability all that seriously. Which I think is reasonable, given his
position!

The fact that we are this sensitive to glibc collation versioning might
be a wholly unique situation (unlike with ICU, which was built with
that in mind). It might be that every other user of glibc collations
sees this as fairly inconsequential, because they don't have to deal
with persistent state that directly relies on the rules in various
ways that are critically important. Even if glibc theoretically does a
perfect job of versioning, I still think that their priorities are
very much unlike our priorities, and that that should be a relevant
consideration for us.

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Tue, Jun 7, 2022 at 4:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> I haven't yet found a Red Hat minor release that changed
> glibc collation.

I feel like this is a thing that happens regularly enough that it's
known to be a gotcha by many of my colleagues here at EDB.

Perhaps that's all pure fiction, but I doubt it. People don't go
around making up stories about things being broken so they can say bad
things about Red Hat. They got told by customers that things are
broken and then go try to figure out how that happened.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
On 6/7/22 1:51 PM, Peter Geoghegan wrote:
> On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider
> <schneider@ardentperf.com> wrote:
>> This idea does seem to persist. It's not as frequent as timezones, but
>> collation rules reflect local dialects and customs, and there are
>> changes quite regularly for a variety of reasons. A brief perusal of
>> CLDR changelogs and CLDR jiras can give some insight here:
> 
>> Another misunderstanding that seems to persist is that this only relates
>> to exotic locales or that it's only the 2.28 version.
> 
> I'm not defending the status quo, and I think that I'm better informed
> than most about the problems in this area. My point was that it hardly
> matters that we don't necessarily see outright corruption. This was
> based in part on a misunderstanding of Tom's point, though.


I think I was guilty of the same misunderstanding - apologies Tom!
Thanks Peter for calling that out explicitly.

For my for my part, gut feeling is that MacOS major releases will be
similar to any other OS major release, which may contain updates to
collation algorithms and locales. ISTM like the same thing PG is looking
for on other OS's to trigger the warning. But it might be good to get an
official reference on MacOS, if someone knows where to find one?  (I don't.)

-Jeremy


-- 
http://about.me/jeremy_schneider



Re: Collation version tracking for macOS

From
Bruce Momjian
Date:
On Tue, Jun  7, 2022 at 03:43:32PM -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:
> >> Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then
runningcrc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few
databasesuse more than a couple different collations.
 
> 
> > Collation rules have multiple levels and all kinds of quirks, so that
> > won't work.
> 
> Yeah, and it's exactly at the level of quirks that things are likely
> to change.  Nobody's going to suddenly start sorting B before A.
> They might, say, change their minds about where the digram "cz"
> sorts relative to single letters, in languages where special rules
> for that are a thing.
> 
> The idea of fingerprinting a collation's behavior is interesting,
> but I've got doubts about whether we can make a sufficiently thorough
> fingerprint.

Rather than trying to figure out if the collations changed, have we ever
considered checking if index additions and lookups don't match the OS
collation and reporting these errors somehow?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Tue, Jun 7, 2022 at 2:13 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> For my for my part, gut feeling is that MacOS major releases will be
> similar to any other OS major release, which may contain updates to
> collation algorithms and locales. ISTM like the same thing PG is looking
> for on other OS's to trigger the warning. But it might be good to get an
> official reference on MacOS, if someone knows where to find one?  (I don't.)

I just don't think that we should be relying on a huge entity like
Apple or even glibc for this -- they don't share our priorities, and
there is no reason for this to change. The advantage of ICU versioning
is that it is just one library, that can coexist with others,
including other versions of ICU.

Imagine a world in which we support multiple ICU versions (for Debian
packages, say), some of which are getting quite old. Maybe we can
lobby for the platform to continue to support that old version of the
library -- there ought to be options. Lobbying Debian to stick with an
older version of glibc is another matter entirely. That has precisely
zero chance of ever succeeding, for reasons that are quite
understandable.

Half the problem here is to detect breaking changes, but the other
half is to not break anything in the first place. Or to give the user
plenty of opportunity to transition incrementally, without needing to
reindex everything at the same time. Obviously the only way that's
possible is by supporting multiple versions of ICU at the same time,
in the same database. This requires indirection that distinguishes
between "physical and logical" collation versions, where the same
nominal collation can have different implementations across multiple
ICU versions.

The rules for standards like BCP47 (the system that defines the name
of an ICU/CLDR locale) are deliberately very tolerant of what they
accept in order to ensure forwards and backwards compatibility in
environments where there isn't just one ICU/CLDR version [1] (most
environments in the world of distributed or web applications). So you
can expect the BCP47 name of a collation to more or less work on any
ICU version, perhaps with some loss of functionality (this is
unavoidable when you downgrade ICU to a version that doesn't have
whatever CLDR customization you might have relied on). It's very
intentionally a "best effort" approach, because throwing a "locale not
found" error message usually isn't helpful from the point of view of
the end user. Note that this is a broader standard than ICU or CLDR or
even Unicode.

[1] https://www.ietf.org/rfc/rfc6067.txt
-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Jun 8, 2022 at 8:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Earlier I mentioned distinct "providers" but I take that back, that's
> > too complicated.  Reprising an old idea that comes up each time we
> > talk about this, this time with some more straw-man detail: what about
> > teaching our ICU support to understand "libicu18n.so.71:en" to mean
> > that it should dlopen() that library and use its functions?  Or some
> > cleverer, shorter notation.  Then it's the user's problem to make sure
> > the right libraries are installed, and it'll fail if they're not.  For
> > example, on Debian bookworm right now you can install libicu63,
> > libicu67, libicu71, though only the "current" -dev package, but which
> > I'm sure we can cope with.  You're at the mercy of the distro or
> > add-on package repos to keep a lot of versions around, but that seems
> > OK.
>
> Right. Postgres could link to multiple versions of ICU at the same
> time. Right now it doesn't, and right now the ICU C symbol names that
> we use are actually versioned (this isn't immediately apparent because
> the C preprocessor makes it appear that ICU symbol names are generic).

Yeah, it's possible to link against multiple versions in theory and
that might be a way to do it if we were shipping our own N copies of
ICU like DB2 does, but that's hard in practice for shared libraries on
common distros (and vendoring or static linking of such libraries was
said to be against many distros' rules, since it would be a nightmare
if everyone did that, though I don't have a citation for that).  I
suspect it's better to use dlopen() to load them, because (1) I
believe that the major distros only have -dev/-devel packages for the
"current" version, even though they let you install the packages
containing the .so files for multiple versions at the same time so
that binaries linked against older versions keep working and (2) I
think it'd be cool if users were free to find more ICU versions in
add-on package repos and be able to use them to get a version that the
packager of PostgreSQL didn't anticipate.

> We could perhaps invent a new indirection that knows about
> multiple ICU versions, each of which is an independent collation
> provider, or maybe a related collation provider that gets used by
> default on REINDEX. ICU is designed for this kind of thing. That
> approach more or less puts packagers on the hook for managing
> collation stability. But now long term collation stability is at least
> feasible -- we at least have a coherent strategy. In the worst case
> the community .deb and .rpm repos might continue to support an older
> ICU version, or lobby for its continued support by the distro (while
> actively discouraging its use in new databases). This isn't the same
> thing as forking ICU. It's a compromise between that extreme, and
> the current situation.

Yeah, I've flip-flopped a couple of times on the question of whether
ICU63 and ICU67 should be different collation providers, or
individual collations should somehow specify the library they want to
use (admittedly what I showed above with a raw library name is pretty
ugly and some indirection scheme might be nice).  It would be good to
drill into the pros and cons of those two choices.  As for getting
sane defaults, I don't know if this is a good idea, but it's an idea:
perhaps schemas and search paths could be used,  you avoid having to
include ugly version strings in the collation identifiers, and the
search path effectively controls default when you don't want to be
explicit (= most users)?



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Yeah, it's possible to link against multiple versions in theory and
> that might be a way to do it if we were shipping our own N copies of
> ICU like DB2 does, but that's hard in practice for shared libraries on
> common distros (and vendoring or static linking of such libraries was
> said to be against many distros' rules, since it would be a nightmare
> if everyone did that, though I don't have a citation for that).

I'm not saying that it's going to be easy, but I can't see why it
should be impossible. I use Debian unstable for most of my work. It
supports multiple versions of LLVM/clang, not just one (though there
is a virtual package with a default version, I believe). What's the
difference, really?

Packaging standards certainly matter, but they're not immutable laws
of the universe. It seems reasonable to suppose that the people that
define these standards would be willing to hear us out -- this is
hardly a trifling matter, or something that only affects a small
minority of *their* users.

We don't need to support a huge number of versions on each OS -- just
enough to make it feasible for everybody to avoid the need to ever
reindex every index on a collatable type (maybe ICU versions that were
the default for the last several major versions of the OS are
available through special packages). We don't necessarily have to have
a hard dependency on every supported version from the point of view of
the package manager. And all of this would ultimately be the
responsibility of each individual packager; they'd need to figure out
how to make it work within the context of the platform that they're
targeting. We'd facilitate that important work, but would defer to
them on the final details. There could be a hands-off approach to the
whole thing, so it wouldn't be a total departure from what we do
today.

> Yeah, I've flip-flopped a couple of times on the question of whether
> ICU63 and ICU67 should be different collation providers, or
> individual collations should somehow specify the library they want to
> use (admittedly what I showed above with a raw library name is pretty
> ugly and some indirection scheme might be nice).  It would be good to
> drill into the pros and cons of those two choices.

I think that there are pretty good technical reasons why each ICU
version is tied to a particular version of CLDR. Implementing CLDR
correctly and efficiently is a rather difficult process, even if we
ignore figuring out what natural language rules make sense. And so
linking to multiple different ICU versions doesn't really seem like
overkill to me. Or if it is then I can easily think of far better
examples of software bloat. Defining "stable behavior for collations"
as "uses exactly the same software artifact over time" is defensive
(compared to always linking to one ICU version that does it all), but
we have plenty that we need to defend against here.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
 On Wed, Jun 8, 2022 at 10:59 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Yeah, it's possible to link against multiple versions in theory and
> > that might be a way to do it if we were shipping our own N copies of
> > ICU like DB2 does, but that's hard in practice for shared libraries on
> > common distros (and vendoring or static linking of such libraries was
> > said to be against many distros' rules, since it would be a nightmare
> > if everyone did that, though I don't have a citation for that).
>
> I'm not saying that it's going to be easy, but I can't see why it
> should be impossible. I use Debian unstable for most of my work. It
> supports multiple versions of LLVM/clang, not just one (though there
> is a virtual package with a default version, I believe). What's the
> difference, really?

The difference is that Debian has libllvm-{11,12,13,14}-dev packages,
but it does *not* have multiple -dev packages for libicu, just a
single libicu-dev which can be used to compile and link against their
chosen current library version.  They do have multiple packages for
the actual .so and allow them to be installed concurrently.
Therefore, you could install N .sos and dlopen() them, but you *can't*
write a program that compiles and links against N versions at the same
time using their packages (despite IBM's work to make that possible,
perhaps for use in their own databases).

> Packaging standards certainly matter, but they're not immutable laws
> of the universe. It seems reasonable to suppose that the people that
> define these standards would be willing to hear us out -- this is
> hardly a trifling matter, or something that only affects a small
> minority of *their* users.

OK, yeah, I'm thinking within the confines of things we can do easily
right now on existing systems as they are currently packaging software
only by changing our code, not "tell Debian to change their packaging
so we can compile and link against N versions".   Supposing Debian
maintainers (and all the others) agreed, there'd still something else
in favour of dlopen():  wouldn't it be nice if the users were not
limited by the versions that the packager of PostgreSQL decided to
link against?  What if someone has a good reason to want to use ICU
versions that are older than Debian currently ships, that are easily
available in add-on repos?

> > Yeah, I've flip-flopped a couple of times on the question of whether
> > ICU63 and ICU67 should be different collation providers, or
> > individual collations should somehow specify the library they want to
> > use (admittedly what I showed above with a raw library name is pretty
> > ugly and some indirection scheme might be nice).  It would be good to
> > drill into the pros and cons of those two choices.
>
> I think that there are pretty good technical reasons why each ICU
> version is tied to a particular version of CLDR. Implementing CLDR
> correctly and efficiently is a rather difficult process, even if we
> ignore figuring out what natural language rules make sense. And so
> linking to multiple different ICU versions doesn't really seem like
> overkill to me. Or if it is then I can easily think of far better
> examples of software bloat. Defining "stable behavior for collations"
> as "uses exactly the same software artifact over time" is defensive
> (compared to always linking to one ICU version that does it all), but
> we have plenty that we need to defend against here.

I think we're not understanding each other here: I was talking about
the technical choice of whether we'd model the multiple library
versions in our catalogues as different "collprovider" values, or
somehow encode them into the "collcollate" string, or something else.
I'm with you, I'm already sold on the mult-library concept (and have
been in several previous cycles of this recurring discussion), which
is why I'm trying to move to discussing nuts and bolts and packaging
and linking realities that apparently stopped any prototype from
appearing last time around.



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Tue, Jun 7, 2022 at 4:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> The difference is that Debian has libllvm-{11,12,13,14}-dev packages,
> but it does *not* have multiple -dev packages for libicu, just a
> single libicu-dev which can be used to compile and link against their
> chosen current library version.  They do have multiple packages for
> the actual .so and allow them to be installed concurrently.
> Therefore, you could install N .sos and dlopen() them, but you *can't*
> write a program that compiles and links against N versions at the same
> time using their packages (despite IBM's work to make that possible,
> perhaps for use in their own databases).

I know that glibc has various facilities for versioning dynamic
libraries, which includes ways to control symbol visibility. It's
possible that IBM's work on ICU versioning didn't just build on a
generic facility like that because that approach wasn't sufficiently
portable, particularly with platforms like AIX. It's also possible
that we won't have any of these same requirements, and can feasibly
link against multiple ICU versions some other way, and ultimately
achieve the same result -- multiple versions of ICU that can be used
by Postgres at the same time, with long term stable collations across
major OS and Postgres versions.

I now understand that you agree with me on this basic and important
point. Must have been a miscommunication.

> > Packaging standards certainly matter, but they're not immutable laws
> > of the universe. It seems reasonable to suppose that the people that
> > define these standards would be willing to hear us out -- this is
> > hardly a trifling matter, or something that only affects a small
> > minority of *their* users.
>
> OK, yeah, I'm thinking within the confines of things we can do easily
> right now on existing systems as they are currently packaging software
> only by changing our code, not "tell Debian to change their packaging
> so we can compile and link against N versions".

There are lots of specifics here, and I'm certainly not an expert on
packaging. IMV our approach doesn't necessarily need to use the same
original canonical package, though. It just needs to provide a
reasonably smooth experience for users that actually need to keep
their old collations working on upgrade. Either way, the process needs
to be something where all parties understand the concerns of each
other.

Of course Debian doesn't support linking against multiple versions of
ICU right now; why would they? Is there any reason to think that even
one person ever asked about it? Our interest in doing that will
probably be totally unique from their point of view. Can we just ask
somebody about it that has a deep understanding of these things?

> Supposing Debian
> maintainers (and all the others) agreed, there'd still something else
> in favour of dlopen():  wouldn't it be nice if the users were not
> limited by the versions that the packager of PostgreSQL decided to
> link against?  What if someone has a good reason to want to use ICU
> versions that are older than Debian currently ships, that are easily
> available in add-on repos?

I don't consider the ability to support many versions of ICU for the
sake of ICU features to be much of an advantage. I mostly just care
about the simple, common case where a user upgrades and doesn't want
to REINDEX immediately. You may well be right about dlopen(); I just
don't know right now.

> I think we're not understanding each other here: I was talking about
> the technical choice of whether we'd model the multiple library
> versions in our catalogues as different "collprovider" values, or
> somehow encode them into the "collcollate" string, or something else.

ISTM that there are two mostly-distinct questions here:

1. How do we link to multiple versions of ICU at the same time, in a
way that is going to work smoothly on mainstream platforms?

2. What semantics around collations do we want for Postgres once we
gain the ability to use multiple versions of ICU at the same time? For
example, do we want to generalize the definition of a collation, so
that it's associated with one particular ICU version and collation for
the purposes of on-disk compatibility, but isn't necessarily tied to
the same ICU version in other contexts, such as on a dump and restore?

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Jun 8, 2022 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> ISTM that there are two mostly-distinct questions here:
>
> 1. How do we link to multiple versions of ICU at the same time, in a
> way that is going to work smoothly on mainstream platforms?
>
> 2. What semantics around collations do we want for Postgres once we
> gain the ability to use multiple versions of ICU at the same time? For
> example, do we want to generalize the definition of a collation, so
> that it's associated with one particular ICU version and collation for
> the purposes of on-disk compatibility, but isn't necessarily tied to
> the same ICU version in other contexts, such as on a dump and restore?

Yeah.  Well I couldn't resist doing some (very!) experimental hacking.
See attached.  The idea of putting a raw library name in there is just
a straw-man, and I already found a major problem with it: I also need
to get my hands on u_strToLower and friends for formatting.c, but
those functions are in a different library that needs to be dlopen'd
separately, so we need *two* names.  That's not done in the attached
patch, but at least this demonstrates some of the mechanics of a
dlopen() based solution that can do the collating part...  of course
there are all kinds of problems apparent (security of loading
arbitrary libraries, API stability, interaction with the "default" ICU
that our binary is linked against, creation of initial set of
collations in initdb, naming, upgrades, ...).

Demo:

$ sudo apt-get install libicu63 libicu67

postgres=# create schema icu63;
CREATE SCHEMA
postgres=# create schema icu67;
CREATE SCHEMA
postgres=# create collation icu63."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.63:en-US');
CREATE COLLATION
postgres=# create collation icu67."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.67:en-US');
CREATE COLLATION
postgres=# select collname, collnamespace::regnamespace,
colliculocale, collversion
             from pg_collation
            where collname = 'en-US-x-icu';
  collname   | collnamespace |     colliculocale      | collversion
-------------+---------------+------------------------+-------------
 en-US-x-icu | pg_catalog    | en-US                  | 153.14
 en-US-x-icu | icu63         | libicui18n.so.63:en-US | 153.88
 en-US-x-icu | icu67         | libicui18n.so.67:en-US | 153.14
(3 rows)
postgres=# select relname from pg_class order by relname collate
icu63."en-US-x-icu" limit 2;
          relname
---------------------------
 _pg_foreign_data_wrappers
 _pg_foreign_servers
(2 rows)

Attachment

Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Tue, Jun 7, 2022 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> No, I quite agree that we have a problem.  What I don't agree is that
> issuing a lot of false-positive warnings is a solution.  That will
> just condition people to ignore the warnings, and then when their
> platform really does change behavior, they're still screwed.  If we
> could *accurately* report collation behavioral changes, I'd be all
> for that.

I mean, how many false-positive warnings do you think we'll get?

I would argue that if we put out something that's wrong half the time
-- it tells you about all the real problems and an equal number of
imaginary ones -- we'd be way ahead of where we are right now.  If on
the other hand we put out something that's wrong 99% of the time -- it
tells you about all the real problems and ninety-nine times as many
imaginary ones -- that's worse than useless.

There can be some weasel wording in the language e.g. "WARNING:  glibc
has been updated, collation definitions may have changed". It's worth
keeping in mind that the user doesn't necessarily have another source
of information that is more accurate than what we're providing. If
they REINDEX somewhat more often than is really necessary, that may be
painful, but it can still be a lot better than having queries return
wrong answers. If it's not, nobody's forcing them to issue that
REINDEX command.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Tue, Jun 7, 2022 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I mean by "false positive" is telling every macOS user that they'd better
> reindex everything every year, when in point of fact Apple changes those
> collations almost never.

Do we actually know that to be true? Given how fast things seem to be
getting added to Unicode, it wouldn't surprise me at all if they're
updating their Unicode tables for new characters with some regularity,
if nothing else, and that's a breaking change for us.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 7, 2022 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> No, I quite agree that we have a problem.  What I don't agree is that
>> issuing a lot of false-positive warnings is a solution.

> I mean, how many false-positive warnings do you think we'll get?

The proposed patch would result in a warning about every collation-
sensitive index during every macOS major version upgrade, ie about
once a year for most people.  Seeing that Apple only actually touch
their POSIX collations once a decade or so, that's way too far over
on the crying-wolf end of the scale for me.  We need something that
has at least *some* connection to actual changes.

            regards, tom lane



Re: Collation version tracking for macOS

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 7, 2022 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I mean by "false positive" is telling every macOS user that they'd better
>> reindex everything every year, when in point of fact Apple changes those
>> collations almost never.

> Do we actually know that to be true? Given how fast things seem to be
> getting added to Unicode, it wouldn't surprise me at all if they're
> updating their Unicode tables for new characters with some regularity,
> if nothing else, and that's a breaking change for us.

Their POSIX collations seem to be legacy code that's entirely unrelated to
any modern collation support; in particular the "UTF8" ones are that in
name only.  I'm sure that Apple are indeed updating the UTF8 data behind
their proprietary i18n APIs, but the libc APIs are mostly getting benign
neglect.

Maybe the report that started this thread indicates that this is changing,
but I'll believe that when I see it.

            regards, tom lane



Re: Collation version tracking for macOS

From
"Daniel Verite"
Date:
    Tom Lane wrote:

> Yeah, and it's exactly at the level of quirks that things are likely
> to change.  Nobody's going to suddenly start sorting B before A.
> They might, say, change their minds about where the digram "cz"
> sorts relative to single letters, in languages where special rules
> for that are a thing.

Independently of these rules, all Unicode collations change frequently
because each release of Unicode adds new characters. Any string
that contains a code point that was previously unassigned is going
to be sorted differently by all collations when that code point gets
assigned to a character.
Therefore the versions of all collations need to be bumped at every
Unicode release. This is what ICU does.

If the libc in macOS doesn't follow Unicode, that's not relevant
to macOS, but let's assume an OS that tries to be up-to-date.
If major OS upgrades happen every year or less frequently,
each OS upgrade is likely to imply an upgrade of all the collations,
since the interval between Unicode releases tends to be a year
or less:
https://www.unicode.org/history/publicationdates.html



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Wed, Jun 8, 2022 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Their POSIX collations seem to be legacy code that's entirely unrelated to
> any modern collation support; in particular the "UTF8" ones are that in
> name only.  I'm sure that Apple are indeed updating the UTF8 data behind
> their proprietary i18n APIs, but the libc APIs are mostly getting benign
> neglect.

I find that easy to believe. It's consistent with the overall picture
of Apple not caring about the POSIX collations beyond the basic
requirement for compatibility. ISTM that their totally inefficient
approach to implementing strxfrm() is another example of the same
thing. (The Apple strxfrm() produces huge low entropy binary strings,
unlike the glibc version, which is pretty well optimized.)

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Tom Lane
Date:
"Daniel Verite" <daniel@manitou-mail.org> writes:
> Independently of these rules, all Unicode collations change frequently
> because each release of Unicode adds new characters. Any string
> that contains a code point that was previously unassigned is going
> to be sorted differently by all collations when that code point gets
> assigned to a character.
> Therefore the versions of all collations need to be bumped at every
> Unicode release. This is what ICU does.

I'm very skeptical of this process as being a reason to push users
to reindex everything in sight.  If U+NNNN was not a thing last year,
there's no reason to expect that it appears in anyone's existing data,
and therefore the fact that it sorts differently this year is a poor
excuse for sounding time-to-reindex alarm bells.

I'm quite concerned that we are going to be training users to ignore
collation-change warnings.  They have got to be a lot better targeted
than this, or we're just wasting everyone's time, including ours.

            regards, tom lane



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Wed, Jun 8, 2022 at 4:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm very skeptical of this process as being a reason to push users
> to reindex everything in sight.  If U+NNNN was not a thing last year,
> there's no reason to expect that it appears in anyone's existing data,
> and therefore the fact that it sorts differently this year is a poor
> excuse for sounding time-to-reindex alarm bells.

That seems completely wrong to me. It's not like a new character shows
up and people wait to start using it until it makes its way into
everyone's collation data. That is emphatically not what happens, I
would say. What happens is that people upgrade their libc packages at
one times and their postgres packages at another time, and it's
unlikely that they have any idea which order they do or did those
things. Meanwhile, people start using all the latest emojis. The idea
that the average PostgreSQL user has any idea whether a certain emoji
shows up in the data set for the first time before or after they
install the libc version that knows about it seems absurd. We don't
even know how to figure out which emojis the installed libc supports
-- if we did, we could reject data that we don't know how to sort
properly instead of ending up with corrupted indexes later. The user
has no more ability to figure it out than we do, and even if they did,
they probably wouldn't want to compare their stream of input data to
their collate definitions using some process external to the database.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Mark Dilger
Date:

> On Jun 7, 2022, at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> This is not the concern that I have.  I agree that if we tell a user
> that collation X changed behavior and he'd better reindex his indexes
> that use collation X, but none of them actually contain any cases that
> changed behavior, that's not a "false positive" --- that's "it's cheaper
> to reindex than to try to identify whether there's a problem".

I don't see this problem as limited to indexes, though I do understand why that might be the most common place for the
problemto manifest itself. 

As a simple example, text[] constructed using array_agg over sorted data can be corrupted by a collation change, and
reindexwon't fix it. 

If we extend the table-AM interface to allow query quals to be pushed down to the table-AM, we might develop table-AMs
thatcare about sort order, too. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm sure that Apple are indeed updating the UTF8 data behind
> their proprietary i18n APIs, but the libc APIs are mostly getting benign
> neglect.

As for how exactly they might be doing that, I don't know, but a bit
of light googling tells me that a private, headerless,
please-don't-call-me-directly copy of ICU arrived back in macOS
10.3[1].  I don't see it on my 12.4 system, but I also know that 12.x
started hiding system libraries completely (the linker is magic and
pulls libraries from some parallel dimension, there is no
/usr/lib/libSystem.B.dylib file on disk, and yet otool -L
<your_favourite_executable> references it).

It's a lovely client machine, but I don't know if anyone really runs
meaningful database server stuff on macOS.  I think if I did I'd be
very keen to use ICU for everything directly, rather than trying to
unpick any of that and talk to Apple's API...  I think the
how-to-support-multiple-ICUs subrant/subthread is a much more
interesting topic.  I have no idea if the dlopen() concept I mentioned
is the right way forward, but FWIW the experimental patch I posted
seems to work just fine on a Mac, using multiple ICU libraries
installed by MacPorts, which might be useful to developers
contemplating that stuff.

[1] https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
New emoji are getting added with some frequency, it’s a thing lately…

New Unicode chars use existing but unassigned code points. All code points are able to be encoded, claimed or
unclaimed.

Someone on old glibc or ICU can still store the new characters. As long as there’s an input field. You wouldn’t believe
somestuff I’ve seen people enter in the “name” field for web apps… 🙄 It’ll get some undefined or default sort behavior
forunrecognized or unassigned code points. 

When the libs are updated, those new chars begin to sort correctly, which is a change and breaks indexes (and
potentiallyother stuff). 

-Jeremy

Sent from my TI-83

> On Jun 8, 2022, at 16:34, Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm sure that Apple are indeed updating the UTF8 data behind
>> their proprietary i18n APIs, but the libc APIs are mostly getting benign
>> neglect.
>
> As for how exactly they might be doing that, I don't know, but a bit
> of light googling tells me that a private, headerless,
> please-don't-call-me-directly copy of ICU arrived back in macOS
> 10.3[1].  I don't see it on my 12.4 system, but I also know that 12.x
> started hiding system libraries completely (the linker is magic and
> pulls libraries from some parallel dimension, there is no
> /usr/lib/libSystem.B.dylib file on disk, and yet otool -L
> <your_favourite_executable> references it).
>
> It's a lovely client machine, but I don't know if anyone really runs
> meaningful database server stuff on macOS.  I think if I did I'd be
> very keen to use ICU for everything directly, rather than trying to
> unpick any of that and talk to Apple's API...  I think the
> how-to-support-multiple-ICUs subrant/subthread is a much more
> interesting topic.  I have no idea if the dlopen() concept I mentioned
> is the right way forward, but FWIW the experimental patch I posted
> seems to work just fine on a Mac, using multiple ICU libraries
> installed by MacPorts, which might be useful to developers
> contemplating that stuff.
>
> [1] https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
> On Jun 8, 2022, at 03:19, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jun 8, 2022 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> ISTM that there are two mostly-distinct questions here:
>>
>> 1. How do we link to multiple versions of ICU at the same time, in a
>> way that is going to work smoothly on mainstream platforms?
>>
> Yeah.  Well I couldn't resist doing some (very!) experimental hacking.
> See attached.


Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old
versionand replacing it with the new? Can it be done without downtime? Can it be done without modifying a running
application?Avoiding “collate” clauses on SQL statements requires working behind the scenes with defaults and indexes
andpartitions and constraints and everything else. I’m having a hard time coming up with a way this would be possible
inpractice, with all the places collations can show up. 

Is the idea of “alter database” to change the default collation even realistic?

I’m having a bit of trouble picturing what the end game is here

-Jeremy


Sent from my TI-83




Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old
versionand replacing it with the new?
 

They simply REINDEX, without changing anything. The details are still
fuzzy, but at least that's what I was thinking of.

This should be possible by generalizing the definition of a collation
to recognize that different ICU versions can support the same
collation. Of course we'd also have to remember which actual ICU
version and specific "physical collation" was currently in use by each
index. We'd also probably have to have some policy about which ICU
version was the latest (or some suitably generalized version of that
that applies to collation providers more generally).

> Can it be done without downtime? Can it be done without modifying a running application?

Clearly the only way that we can ever transition to a new "physical
collation" is by reindexing using a newer ICU version. And clearly
there is going to be a need to fully deprecate any legacy version of
ICU on a long enough timeline. There is just no getting around that.

The advantage of an approach along the lines that I've laid out is
that everything can be done incrementally, possibly some time after an
initial OS or Posgres upgrade, once everything has settled. Much much
later, even. If the same new ICU version isn't available in your
original/old environment (which is likely), you can avoid reindexing,
and so reserve the option of backing out of a complex upgrade until
very late in the process. You're going to have to do it eventually,
but it can probably just be an afterthought.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Wed, Jun 8, 2022 at 10:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> They simply REINDEX, without changing anything. The details are still
> fuzzy, but at least that's what I was thinking of.

As I said before, BCP47 format tags are incredibly forgiving by
design. So it should be reasonable to assume that anything that has
worked in an earlier version of ICU will continue to work in a way
that's at least as useful in a future version. See:

https://www.postgresql.org/message-id/CAH2-Wz=ZrA5Yf55pKtdJb2pYCVN=2dh__VGR9arQqOHMqWgQPg@mail.gmail.com

That's not strictly guaranteed, because sometimes countries cease to
exist, and their ISO country codes eventually go away too. But that
still tends to fail gracefully. It's mostly only relevant for things
that are part of a locale, which is a broader concept than just
collation. An application that did this and relied on ICU for
localization might then find that the currency sign changed, but I'm
not aware of any impact on locales. You can ask for total nonsense
and mostly get reasonable behaviors, like Japanese as spoken in
Iceland. Even some totally made up (or misspelled) country is
accepted without complaint.

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
> On Jun 8, 2022, at 22:40, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
> <schneider@ardentperf.com> wrote:
>> Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old
versionand replacing it with the new? 
>
> They simply REINDEX, without changing anything. The details are still
> fuzzy, but at least that's what I was thinking of.
>
>> Can it be done without downtime? Can it be done without modifying a running application?
>
> Clearly the only way that we can ever transition to a new "physical
> collation" is by reindexing using a newer ICU version. And clearly
> there is going to be a need to fully deprecate any legacy version of
> ICU on a long enough timeline. There is just no getting around that.


I’m probably just going to end up rehashing the old threads I haven’t read yet…

One challenge with this approach is you have things like sort-merge joins that require the same collation across
multipleobjects. So I think you’d need to keep all the old indexes around until you have new indexes available for all
objectsin a database, and somehow the planner would need to be smart enough to dynamically figure out old vs new
versionson a query-by-query basis. May need an atomic database-wide cutover; running a DB with internally mixed
collationversions doesn’t seem like a small challenge. It would require enough disk space for two copies of all
indexes,and queries would change which indexes they use in a way that wouldn’t be immediately obvious to users or app
dev.Suddenly switching to or from a differently-bloated index could result in confusing and sudden performance changes. 

Also there would still need to be a plan to address all the other non-index objects where collation is used, as has
beenmentioned before. 

And given the current architecture, that final “alter database update default collation” command still seems awful
risky,bug-prone and difficult to get correct. At least it seems that way to me. 

At a minimum, this is a very big project and it seems to me like it may be wise to get more end-to-end fleshing out of
theplans before committing incremental pieces in core (which could end up being misguided if the plan doesn’t work as
wellas assumed). Definitely doesn’t seem to me like anything that will happen in a year or two. 

And my opinion is that the problems caused by depending on OS libraries for collation need to be addressed on a shorter
timelinethan what’s realistic for inventing a new way for a relational database to offer transparent or online upgrades
oflinguistic collation versions. 

Also I still think folks are overcomplicating this by focusing on linguistic collation as the solution. Like 1% of
usersactually need or care about having the latest technically correct local-language-based sorting, at a database
level.MySQL did the right thing here by doing what every other RDBMS did, and just making a simple “good-enough”
collationhardcoded in the DB, same across all platforms, that never changes. 

The 1% of users who need true linguistic collation can probably deal with the trade-off of dump-and-load upgrades for
theirICU indexes and databases for a few more years. 

-Jeremy


Sent from my TI-83




Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> I’m probably just going to end up rehashing the old threads I haven’t read yet…
>
> One challenge with this approach is you have things like sort-merge joins that require the same collation across
multipleobjects. So I think you’d need to keep all the old indexes around until you have new indexes available for all
objectsin a database, and somehow the planner would need to be smart enough to dynamically figure out old vs new
versionson a query-by-query basis. 

I don't think that it would be fundamentally difficult to have the
planner deal with collations at the level required to avoid incorrect
query plans.

I'm not suggesting that this is an easy project, or that the end
result would be totally free of caveats, such as the issue with merge
joins. I am only suggesting that something like this seems doable.
There aren't that many distinct high level approaches that could
possibly decouple upgrading Postgres/the OS from reindexing. This is
one.

> And my opinion is that the problems caused by depending on OS libraries for collation need to be addressed on a
shortertimeline than what’s realistic for inventing a new way for a relational database to offer transparent or online
upgradesof linguistic collation versions. 

But what does that really mean? You can use ICU collations as the
default for the entire cluster now. Where do we still fall short? Do
you mean that there is still a question of actively encouraging using
ICU collations?

I don't understand what you're arguing for. Literally everybody agrees
that the current status quo is not good. That much seems settled to
me.

> Also I still think folks are overcomplicating this by focusing on linguistic collation as the solution.

I don't think that's true; I think that everybody understands that
being on the latest linguistic collation is only very rarely a
compelling feature. The whole way that BCP47 tags are so forgiving is
entirely consistent with that view of things.

But what difference does it make? As long as you accept that any
collation *might* need to be updated, or the default ICU version might
change on OS upgrade, then you have to have some strategy for dealing
with the transition. Not being on a very old obsolete version of ICU
will eventually become a "compelling feature" in its own right.

I believe that EDB adopted ICU many years ago, and stuck with one
vendored version for quite a few years. And eventually being on a very
old version of ICU became a real problem.

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> MySQL did the right thing here by doing what every other RDBMS did, and just making a simple “good-enough” collation
hardcodedin the DB, same across all platforms, that never changes. 

That's not true. Both SQL Server and DB2 have some notion of
collations that are versioned.

Oracle may not, but then Oracle also handles collations by indexing
strxfrm() blobs, with all of the obvious downsides that that entails
(far larger indexes, issues with index-only scans). That seems like an
excellent example of what not to do.

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
"Finnerty, Jim"
Date:
Specifying the library name before the language-country code with a new separator  (":") as you suggested below has
somebenefits. Did you consider making the collation version just another collation attribute, such as colStrength,
colCaseLevel,etc.?  
 
For example, an alternate syntax might be:  

    create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Was the concern that ICU might redefine a new collation property with the same name in a different and incompatible way
(wemight work with the ICU developers to agree on what it should be), or that a version is just not the same kind of
collationproperty as the other collation properties?
 

(in the example above, I'm assuming that for provider = icu, we could translate '63' into  'libicui18n.so.63'
automatically.)


On 6/8/22, 6:22 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

<snip>
    postgres=# create collation icu63."en-US-x-icu" (provider = icu,
    locale = 'libicui18n.so.63:en-US');
    CREATE COLLATION
<snip>


Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim <jfinnert@amazon.com> wrote:
> Specifying the library name before the language-country code with a new separator  (":") as you suggested below has
somebenefits. Did you consider making the collation version just another collation attribute, such as colStrength,
colCaseLevel,etc.?
 
> For example, an alternate syntax might be:
>
>     create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Why would a user want to specify an ICU version in DDL? Wouldn't that
break in the event of a dump and reload of the database, for example?
It also strikes me as being inconsistent with the general philosophy
for ICU and the broader BCP45 IETF standard, which is "interpret the
locale string to the best of our ability, never throw an error".

Your proposed syntax already "works" today! You just need to create a
schema called icu63 -- then the command executes successfully (for
certain values of successfully).

I'm not arguing against the need for something like this. I'm just
pointing out that there are good reasons to imagine that it would
largely be an implementation detail, perhaps only used to
unambiguously identify which specific ICU version and locale string
relate to which on-disk relfilenode structure currently.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 9:20 AM Finnerty, Jim <jfinnert@amazon.com> wrote:
> Specifying the library name before the language-country code with a new separator  (":") as you suggested below has
somebenefits. 

One of the reasons for putting some representation of desired library
into the colliculocale column (rather than, say, adding a new column
pg_collation) is that I think we'd also want to be able to put that
into daticulocale (for the database default collation, when using
ICU).  But really I just did that because it was easy... perhaps, both
pg_collation and pg_database could gain a new column, and that would
be a little more pleasing from a schema design point of view (1NF
atomicity, and it's a sort of foreign key, or at least it would be if
there were another catalog to list library versions...)?

> Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel,
etc.?
> For example, an alternate syntax might be:
>
>     create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Hmm, I hadn't considered that.  (I wouldn't call it "col" version BTW,
it's a library version, and we don't want to overload our terminology
for collation version.  We'd still be on the look out for collversion
changes coming from a single library's minor version changing, for
example an apt-get upgrade can replace the .63 files, which on most
systems are symlinks to .63.1, .63.2 etc. ☠️)

> Was the concern that ICU might redefine a new collation property with the same name in a different and incompatible
way(we might work with the ICU developers to agree on what it should be), or that a version is just not the same kind
ofcollation property as the other collation properties? 

Well my first impression is that we don't really own that namespace,
and since we're using this to decide which library to route calls to,
it seems nicer to put it at a "higher level" than those properties.
So I'd prefer something like "63:en-US", or 63 in a new column.

> (in the example above, I'm assuming that for provider = icu, we could translate '63' into  'libicui18n.so.63'
automatically.)

Yeah.  My patch that jams a library name in there was just the fastest
way I could think of to get something off the ground to test whether I
could route calls to different libraries (yes!), though at one moment
I thought it wasn't terrible.  But aside from any aesthetic complaints
about that way of doing it, it turns out not to be enough: we need to
dlopen() two different libraries, because we also need some ctype-ish
functions from this guy:

$ nm -D -C /usr/lib/x86_64-linux-gnu/libicuuc.so.63.1 | grep u_strToUpper
00000000000d22c0 T u_strToUpper_63

I guess we probably want to just put "63" somewhere in pg_collation,
as you say.  But then, teaching PostgreSQL how to expand that to a
name that is platform/packaging dependent seems bad.  The variations
would probably be minor; on a Mac it's .dylib, on AIX it may be .a,
and the .63 convention may not be universal, I dunno, but some systems
might need absolute paths (depending on ld.so.conf etc), but that's
all stuff that I think an administrator should care about, not us.

Perhaps there could be a new catalog table just for that.  So far I
have imagined there would still be one special ICU library linked at
build time, which doesn't need to be dlopen'd, and works automatically
without administrators having to declare it.  So a system that has one
linked-in library version 67, and then has two extras that have been
added by an administrator running some new DDL commands might have:

postgres=# select * from pg_icu_library order by version;
 version |    libicuuc    |    libicui18n
---------+----------------+------------------
      58 | libicuuc.so.58 | libicui18n.so.58
      63 | libicuuc.so.63 | libicui18n.so.63
      67 |                |
(3 rows)

Suppose you pg_upgrade to something that is linked against 71.
Perhaps you'd need to tell it how to dlopen 67 before you can open any
collations with that library, but once you've done that your
collation-dependent partition constraints etc should all hold.  I
dunno, lots of problems to figure out here, including quite broad ones
about various migration problems.  I haven't understood what Peter G
is suggesting about how upgrades might work, so I'll go and try to do
that...



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Suppose you pg_upgrade to something that is linked against 71.
> Perhaps you'd need to tell it how to dlopen 67 before you can open any
> collations with that library, but once you've done that your
> collation-dependent partition constraints etc should all hold.  I
> dunno, lots of problems to figure out here, including quite broad ones
> about various migration problems.  I haven't understood what Peter G
> is suggesting about how upgrades might work, so I'll go and try to do
> that...

I'm mostly just arguing for the idea that we should treat ICU versions
as essentially interchangeable in terms of their high-level
capabilities around collations and languages/scripts/whatever provided
for by the underlying CLDR version -- tools like pg_dump shouldn't
need to care about ICU versions per se. *ICU itself* should be
versioned, rather than having multiple independent ICU collation
providers. This should work as well as anything like this can ever be
expected to work -- because internationalization is just hard.

These remarks need to be interpreted in the context of how
internationalization is *supposed* to work under standards like BCP47
(again, this is a broad RFC about internationalization, not really an
ICU thing). Natural languages are inherently squishy, messy things.
The "default ICU collations" that initdb puts in pg_collation are not
really special to ICU -- we generate them through a quasi-arbitrary
process that iterates through top-level locales, which results in a
list that is a bit like what you get with libc collations. If you
pg_upgrade, you might have leftover "default ICU collations" that
wouldn't have been the default on a new initdb. It's inherently pretty
chaotic (because humans aren't as predictable as computers), which is
why BCP47 itself is so forgiving -- it literally has to be. Plus there
really isn't much downside to being so lax; as Jeremy pretty much said
already, the important thing is generally to have roughly the right
idea -- which this fuzzy approach mostly manages to do.

Let's not fight that. Let's leave the natural language stuff to the
experts, by versioning a single collation provider (like ICU), and
generalizing the definition of a collation along the same lines --
something that can be implemented using any available version of ICU
(with a preference for the latest on REINDEX, perhaps). It might turn
out that an older version does a slightly better job than a newer
version (regressions cannot be ruled out), but ultimately that's not
our problem. It can't be -- we're not the unicode consortium.

It's theoretically up to the user to make sure they're happy with any
behavioral changes under this scheme, perhaps by testing. They won't
actually test very often, of course, but that shouldn't matter in
practice. This is already what we advise for users that use advanced
tailorings of custom ICU collations, such as a custom collation for
"natural sorting", often used for things like alphanumeric invoice
numbers. That might break if you downgrade ICU version, and maybe even
if you upgrade ICU version.

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 10:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim <jfinnert@amazon.com> wrote:
> > For example, an alternate syntax might be:
> >
> >     create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');
>
> Why would a user want to specify an ICU version in DDL? Wouldn't that
> break in the event of a dump and reload of the database, for example?
> It also strikes me as being inconsistent with the general philosophy
> for ICU and the broader BCP45 IETF standard, which is "interpret the
> locale string to the best of our ability, never throw an error".
>
> Your proposed syntax already "works" today! You just need to create a
> schema called icu63 -- then the command executes successfully (for
> certain values of successfully).

Jim was proposing the @colVersion=63 part, but the schema part came
from my example upthread.  That was from a real transcript, and I
included that  because the way I've been thinking of this so far has
distinct collation OIDs for the "same" collation from different ICU
libraries, and yet I want them to have the same collname.  That is, I
don't want (say) "en-US-x-icu63" and "en-US-x-icu71"... I thought it'd
be nice to keep using "en-US-x-icu" as we do today, so if there are
two of them they'd *have* to be in different schemas.  That has the
nice property that you can use the search_path to avoid mentioning it.
But I'm not at all wedded to that idea, or any other ideas in this
thread, just trying stuff out...

However, since you mentioned that a simple REINDEX would get you from
one library version to another, I think we're making some completely
different assumptions somewhere along the line, and I don't get your
idea yet.  It sounds like you don't want two different collation OIDs
in that case?

The (vastly too) simplistic way I was thinking of it, if you have a
column with an ICU 63 collation, to switch to ICU 67 you first do some
DDL to add ICU 67 to your system and import 67's collations (creating
new collation OIDs), and then eg ALTER TABLE foo ALTER COLUMN bar TYPE
text COLLATE icu67."en-US-x-icu", which will rebuild your indexes.
That's a big job, and doesn't address how you switch the database
default collation.  None of that is very satisfying, much more thought
needed, but it falls out of the decision to have distinct
icu63."en-US-x-icu" and icu67."en-US-x-icu".  You seem to have some
other idea in mind where the system only knows about one
"en-US-x-icu", but somehow, somewhere else (where?), keeps track of
which indexes were built with ICU 63 and which with ICU 67, which I
don't yet grok.  Or did I misunderstand?



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> However, since you mentioned that a simple REINDEX would get you from
> one library version to another, I think we're making some completely
> different assumptions somewhere along the line, and I don't get your
> idea yet.  It sounds like you don't want two different collation OIDs
> in that case?

Not completely sure about the REINDEX behavior, but it's at least an
example of the kind of thing that could be enabled. I'm proposing that
pg_collation-wise collations have the most abstract possible
definitions -- "logical collations", which are decoupled from
"physical collations" that actually describe a particular ICU collator
associated with a particular ICU version (all the information that
keeps how the on-disk structure is organized for a given relfilenode
straight). In other words, the definition of a collation is the user's
own definition. To the user, it's pretty close to (maybe even exactly)
a BCP47 string, now and forever.

You can make arguments against the REINDEX behavior. And maybe those
arguments will turn out to be good arguments. Assuming that they are,
then the solution may just be to have a special option that will make
the REINDEX use the most recent library.

The important point is to make the abstraction as high level as
possible from the point of view of users.

> You seem to have some
> other idea in mind where the system only knows about one
> "en-US-x-icu", but somehow, somewhere else (where?), keeps track of
> which indexes were built with ICU 63 and which with ICU 67, which I
> don't yet grok.  Or did I misunderstand?

That's what I meant, yes -- you got it right.

Another way to put it would be to go as far as we can in the direction
of decoupling the concerns that we have as database people from the
concerns of natural language experts. Let's not step on their toes,
and let's avoid having our toes trampled on.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Tobias Bussmann
Date:
Thanks for picking this up!

> How can I see evidence of this?  I'm comparing Debian, FreeBSD and
> macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
> /usr/share/dict/words" I get upper and lower case mixed together on
> the other OSes, but on the Mac the upper case comes first, which is my
> usual smoke test for "am I looking at binary sort order?"

Perhaps I can shed some light on this matter:

Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient
FreeBSDthey "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are
separated(There are exceptions like 'cs_CZ.ISO8859-2'). The latest public sources I can find are in adv_cmds-119 [1]
whichbelongs to OSX 10.5 [2] - these correspond to the ones used in FreeBSD till v10 [3], whereby the timestamps rather
pointits origin around FreeBSD 5. Further, there are only very few locales actually present on macOS (36 - none of it
supportingUnicode) and these have not changed for a very long time (I verified that from OS X 10.6.8 till macOS 12.4
[4],exception is a 'de_DE-A.ISO8859-1' present only in macOS 10.15). 

What they do instead is symlinking [5] missing collations to similar ones even across encodings, often resulting in
la_LN.US-ASCII('la_LN' seem to stand for a Latin meta language) being used which is exactly byte order [6]. These
symlinkshave not changed [7] from OS X 10.6.8 till macOS 10.15.7. But in macOS 11 many of these symlinks changed their
target.So did the popular 'en_US.UTF-8' from 'la_LN.US-ASCII' to 'la_LN.ISO8859-1' or 'de_DE.UTF-8' from
'la_LN.US-ASCII'to 'de_DE.ISO8859-1'. In effect, about half of the UTF-8 collations change from no collation to
partial/brokencollation support. macOS 12 again shows no changes - tests for macOS 13 are outstanding. 

# tl:dr;

With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and
"en_US.UTF-8"but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "(
echo'5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words)
<(LC_COLLATE=es_ES.UTF-8sort /usr/share/dict/words)" 

The upside is that we don't have to cope with the new characters added in every version of Unicode (although I have not
examinedLC_CTYPE yet). 

best regards
Tobias

[1]: https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef
[2]: https://opensource.apple.com/releases/
[3]: https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef
[4]: find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \;
[5]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile
[6]:
https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src
[7]: find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \;


Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > You seem to have some
> > other idea in mind where the system only knows about one
> > "en-US-x-icu", but somehow, somewhere else (where?), keeps track of
> > which indexes were built with ICU 63 and which with ICU 67, which I
> > don't yet grok.  Or did I misunderstand?
>
> That's what I meant, yes -- you got it right.

OK, I see now.

I think if you design a system to record the library that each index
(and constraint, ...) was built with, it'd surely finish up being at
least conceptually something like the system Julien and I built and
then reverted in ec483147.  Except that it'd be a stronger form of
that, because instead of just squawking when the version is not the
latest/current version, it'd keep working but route collations to the
older library for indexes that haven't been rebuilt yet.

That sounds nice, but introduces subtle problems for the planner.  For
example, pathkeys that look compatible might not be, when
merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
could teach it about that, whereas with my distinct OID concept they
would already be considered non-matching automatically.



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That sounds nice, but introduces subtle problems for the planner.  For
> example, pathkeys that look compatible might not be, when
> merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
> could teach it about that, whereas with my distinct OID concept they
> would already be considered non-matching automatically.

Right -- my proposal is likely to be more difficult to implement.
Seems like it might be worth going to the trouble of teaching the
planner about this difference, though.

That exact issue seems like the true underlying problem to me: we have
two sets of behaviors for a given collation, that are equivalent for
some purposes (the user thinks of them as totally interchangeable),
but not for other purposes (we can't expect old indexes to continue to
work with a new physical collation for their logical collation). So
directly tackling that seems natural to me.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 1:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > That sounds nice, but introduces subtle problems for the planner.  For
> > example, pathkeys that look compatible might not be, when
> > merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
> > could teach it about that, whereas with my distinct OID concept they
> > would already be considered non-matching automatically.
>
> Right -- my proposal is likely to be more difficult to implement.
> Seems like it might be worth going to the trouble of teaching the
> planner about this difference, though.

Well I can report that the system from ec483147 was hellishly
complicated, and not universally loved.  Which isn't to say that there
isn't a simple and loveable way to do it, waiting to be discovered,
and I do think we could fix most of the problems with that work.  It's
just that I was rather thinking of this new line of attack as being a
way to avoid the complications of identifying dependencies on moving
things through complicated analysis of object graphs and AST, by
instead attaching those slippery external things to the floor with a
nail gun.  That is, treating ICU 63 and ICU 67's collations as
completely unrelated.  I understand that that's not ideal from an
end-user perspective, but maybe it's more realistically and robustly
and simply implementable.  Hmm.



Re: Collation version tracking for macOS

From
Tobias Bussmann
Date:
Am 08.06.2022 um 16:16 schrieb Tom Lane <tgl@sss.pgh.pa.us>:
> The proposed patch would result in a warning about every collation-
> sensitive index during every macOS major version upgrade, ie about
> once a year for most people.
> We need something that has at least *some* connection to actual changes.

In Postgres.app we introduced default collation versioning and warnings about possible mismatches from outside the
actualserver. When the user runs initdb with the GUI wrapper, the OS version and a checksum of the LC_COLLATE file of
theused default collation is stored as meta-data. This allows to display a reindex warning on startup if the hash
changesor we hardcode a known incompatible OS change. 

Having collversion support on macOS within postgres would leverage the existing infrastructure for version change
warningsand enables support for multiple collations. But I agree, we need something more specific than the major OS
versionhere. Lacking any collation version information from the provider, a checksum on the binary LC_COLLATE file is
thebest I can come up with.  

Best regards,
Tobias


Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Well I can report that the system from ec483147 was hellishly
> complicated, and not universally loved.  Which isn't to say that there
> isn't a simple and loveable way to do it, waiting to be discovered,
> and I do think we could fix most of the problems with that work.

I admit that I don't have much idea of how difficult it would be to
make it all work. I'm definitely not claiming that it's easy.

> I understand that that's not ideal from an
> end-user perspective, but maybe it's more realistically and robustly
> and simply implementable.  Hmm.

That may be a decisive reason to go with your proposal. I really don't know.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Well I can report that the system from ec483147 was hellishly
> > complicated, and not universally loved.  Which isn't to say that there
> > isn't a simple and loveable way to do it, waiting to be discovered,
> > and I do think we could fix most of the problems with that work.
>
> I admit that I don't have much idea of how difficult it would be to
> make it all work. I'm definitely not claiming that it's easy.

Hrrm... perhaps my memory of ec483147 is confusing me.  I think I'm
starting to come around to your idea a bit more now.  Let me sketch
out some more details here and see where this goes.

I *was* thinking that you'd have to find all references to collations
through static analysis, as we did in that version tracking project.
But perhaps for this you only need to record one ICU library version
for the whole index at build time, without any analysis at all, and it
would be used for any and all ICU collations that are reached while
evaluating anything to do with that index (index navigation, but also
eg WHERE clause for partial index, etc).  That would change to the
"current" value when you REINDEX.

Perhaps that could be modeled with a pg_depend row pointing to a
pg_icu_library row, which you'd probably need anyway, to prevent a
registered ICU library that is needed for a live index from being
dropped.  (That's assuming that the pg_icu_library catalogue concept
has legs...  well if we're going with dlopen(), we'll need *somewhere*
to store the shared object paths.  Perhaps it's not a given that we
really want paths in a table... I guess it might prevent certain
cross-OS streaming rep scenarios, but mostly that'd be solvable with
symlinks...)

One problem is that to drop an old pg_icu_library row, you'd have to
go and REINDEX everything, even indexes that don't really use
collations!  If you want to prove that an index doesn't use
collations, you're back in ec483147 territory.  Perhaps we don't care
about that and we're happy to let useless dependencies on
pg_icu_library rows accumulate, or to require useless work to be able
to drop them.

I'm not sure how we'd know what the "current" library version is.  The
highest numbered one currently in that pg_icu_library catalogue I
sketched?  So if I do whatever new DDL we invent to tell the system
about a new ICU library, and it's got a higher number than any others,
new indexes start using it but old ones keep using whatever they're
using.  Maybe with some way for users to override it, so users who
really want to use an older one when creating a new index can say so.

I suppose it would be the same for constraints.  For those,
considering that they need to be rechecked, the only way to change ICU
version would be to drop the constraint and recreate it.  Same goes
for range partitioned tables, right?  It'd keep using the old ICU
library until you drop the p table and create a new one, at which
point you're using the new current ICU library and it'll recheck all
your partitions against the constraints when you add them.  (Those
constraints are much simpler constants, so for those we could prove no
use of ICU without the general ec483147 beast.)

I think these things would have to survive pg_upgrade, but would be
lost on dump/restore.

There's still the pathkey problem to solve, and maybe some more
problems like that hiding somewhere.

I'm not sold on any particular plan, but working through some examples
helped me see your idea better...  I may try to code that up in a
minimal way so we can kick the tyres...



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 12:48 PM Tobias Bussmann <t.bussmann@gmx.net> wrote:
> Perhaps I can shed some light on this matter:

Hi Tobias,

Oh, thanks for your answers.  Definitely a few bits of interesting
archeology I was not aware of.

> Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient
FreeBSDthey "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are
separated(There are exceptions like 'cs_CZ.ISO8859-2'). 

Wow.  I see that I can sort the English dictionary the way most people
expect by pretending it's Czech.  What a mess!

> With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and
"en_US.UTF-8"but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "(
echo'5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words)
<(LC_COLLATE=es_ES.UTF-8sort /usr/share/dict/words)" 

I see, so it does *something*, just not what anybody wants.



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Fri, Jun 10, 2022 at 4:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm not sold on any particular plan, but working through some examples
> helped me see your idea better...  I may try to code that up in a
> minimal way so we can kick the tyres...

I did a bit of hacking on that idea.  The goal was to stamp each index
with an ICU major version (not sure where, not done in the attached),
and if that doesn't match the library we're linked against, we'd try
to dlopen() libraries via symlinks with known name formats under
PGDATA/pg_icu_lib, which an administrator would have to create.  That
seemed a bit simpler than dealing with new catalogs for now...

See attached unfinished patch, which implements some of that.  It has
a single collation for en-US-x-icu, and routes calls to different
libraries depending on dynamic scope (which in cold hard reality
translates into a nasty global variable "current_icu_library").  The
idea was that it would normally point to the library we're linked
against, but whenever computing anything related to an index stamped
with ICU 63, we'd do pg_icu_activate_major_version(63), and afterwards
undo that.  Performance concerns aside, that now seems a bit too ugly
and fragile to me, and I gave up.  How could we convince ourselves
that we'd set the active ICU library correctly in all the required
dynamic scopes, but not leaked it into any other scopes?  Does that
even make sense?  But if not done like that, how else could we do it?

Better ideas/code welcome.

Executive summary of experiments so far: the "distinct collations"
concept is quite simple and robust, but exposes all the versions to
users and probably makes it really hard to upgrade (details not worked
out), while the "time travelling collations" concept is nice for users
but hard to pin down and prove correctness for since it seems to
require dynamic scoping/global state changes affecting code in far
away places.

Attachment

Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Fri, Jun 10, 2022 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Executive summary of experiments so far: the "distinct collations"
> concept is quite simple and robust, but exposes all the versions to
> users and probably makes it really hard to upgrade (details not worked
> out), while the "time travelling collations" concept is nice for users
> but hard to pin down and prove correctness for since it seems to
> require dynamic scoping/global state changes affecting code in far
> away places.

It didn't really occur to me until now that the pg_dump problems that
come with the approach you outlined ("distinct collations") are likely
to be total blockers, and not just disadvantages. It's not just ICU
that prefers approximately correct behavior over throwing an "unknown
collation" error -- the same could be said for pg_dump itself. After
all, pg_dump doesn't care about collation versions -- except when run
in binary mode, for pg_upgrade, where it must satisfy the requirements
of pg_upgrade.

Even today we could be restoring to a server with an older ICU
version, where in general we might not get exactly the behavior the
user expects (though usually only when they've decided to use advanced
features like custom tailorings). So pg_dump already deliberately
disregards the ICU version, for essentially the same reasons that I
argued were good reasons upthread (when talking about a
multi-ICU-version Postgres via "time travelling collations").

Some more thoughts on "time travelling collations":

Doing a version switch in one atomic operation (like a special
REINDEX) isn't going to be practical. We need to be prepared for cases
where a database has a mix of indexes with old and new physical
collations. We certainly cannot allow queries to give wrong
answers...but I tend to doubt that (say) making merge joins work with
two indexes with different physical collations (though matching
logical collations) actually makes much sense. Maybe we can cut scope
in a pragmatic way instead.

The special REINDEX (or whatever) won't work as an atomic
operation...but that doesn't mean that the system as a whole will have
a mix of old and new physical collations forever, or even for very
long. So while everything still has to work correctly, mediocre
performance with certain kinds of plan shapes might be okay.

As you kind of said yourself yesterday, "time travelling collations"
could naturally have an invariant that worked at the index/constraint
level (not the collation level): any given index needs to use only one
underlying ICU version at the same time, no matter what. The invariant
could perhaps be leveraged in the planner -- perhaps we start out with
a working assumption that *every* index is on the newer ICU version
(when at least one index is on the lastest and "now current" version),
and work backwards by excluding indexes that still have old physical
collations. Defining the problem as a problem with old
indexes/constraints only seems like it might make things a lot easier.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 9:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Perhaps that could be modeled with a pg_depend row pointing to a
> pg_icu_library row, which you'd probably need anyway, to prevent a
> registered ICU library that is needed for a live index from being
> dropped.  (That's assuming that the pg_icu_library catalogue concept
> has legs...  well if we're going with dlopen(), we'll need *somewhere*
> to store the shared object paths.  Perhaps it's not a given that we
> really want paths in a table... I guess it might prevent certain
> cross-OS streaming rep scenarios, but mostly that'd be solvable with
> symlinks...)

Do we even need to store a version for indexes most of the time if
we're versioning ICU itself, as part of the "time travelling
collations" design? For that matter, do we even need to version
collations directly anymore?

I'm pretty sure that the value of pg_collation.collversion is always
the same in practice, or has a lot of redundancy. Because mostly it's
just an ICU version. This is what I see on my system, at least:

pg@regression:5432 [53302]=# select count(*), collversion from
pg_collation where collprovider = 'icu' group by 2;
 count │ collversion
───────┼─────────────
   329 │ 153.112.41
   471 │ 153.112
(2 rows)

(Not sure why there are two different distinct collversion values
offhand, but generally looks like collversion isn't terribly
meaningful at the level of individual pg_collation entries.)

If indexes and constraints with old physical collations are defined as
being the exception to the general rule (the rule meaning "every index
uses the current ICU version for the database as a whole"), and if
those indexes/constraints are enumerated and stored (in a new system
catalog) when a switchover of the database's ICU version is first
initialized, then there might not be any meaningful dependency to
speak of. Not for indexes, at least.

The *database as a whole* is dependent on the current version of ICU
-- it's not any one index. Very occasionally the database will also be
dependent on a single older ICU version that we're still transitioning
away from. There is a "switch-a-roo" going on, but not really at the
level of indexes -- it's a very specialized thing, that works at the
level of the whole database, and involves exactly 2 ICU versions. You
should probably be able to back out of it once it begins, but mostly
it's an inflexible process that just does what we need it to do.

Does something like that seem sensible to you?

--
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Jun 11, 2022 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The special REINDEX (or whatever) won't work as an atomic
> operation...but that doesn't mean that the system as a whole will have
> a mix of old and new physical collations forever, or even for very
> long. So while everything still has to work correctly, mediocre
> performance with certain kinds of plan shapes might be okay.

Yeah.  And if you don't like the negative effects of a system in
transition, you could also create new otherwise identical indexes, and
then only drop the old ones once that's done, and add enough brains to
keep everything working nicely on the old indexes until enough of the
new indexes exist.  Or something.  I was thinking that could also be
true for the "distinct collations" concept, only with different
problems to solve...

> As you kind of said yourself yesterday, "time travelling collations"
> could naturally have an invariant that worked at the index/constraint
> level (not the collation level): any given index needs to use only one
> underlying ICU version at the same time, no matter what. The invariant
> could perhaps be leveraged in the planner -- perhaps we start out with
> a working assumption that *every* index is on the newer ICU version
> (when at least one index is on the lastest and "now current" version),
> and work backwards by excluding indexes that still have old physical
> collations. Defining the problem as a problem with old
> indexes/constraints only seems like it might make things a lot easier.

Yes, that (posited) invariant was an enabling realisation for the
(unfinished, but IMHO useful to grok) v2 patch.  The disabling
realisation that stopped me from finishing it was that I doubt my
ability to find all the right places to wrap with
"pg_icu_activate_major_version(X)" and
"pg_icu_activate_major_version(-1)", and thus the whole approach.  Do
you know where to put all the ICU version switching regions, and how
to verify that they cover exactly all the right code, but don't leak
into any of the wrong code, and do you know where to store/retrieve X?
 It feels... flimsy to me, but how else could you make collations
behave differently when being called on behalf of some index rather
than some other thing, given no other context?  Explicitly passing a
pg_icu_library all over the place also sounds non-fun.

I'm also suspicious that there are more subtle hazards like pathkeys
lurking in the shadows.  We go to great effort to recognise matching
and non-matching collations by OID alone, which is why my first
attempt was "distinct [OIDs]", so that'd keep working.

I wondered if DB2's support looked more like "time travel" or
"distinct".  Based only on a quick glance at their manual[1], it looks
a bit like they have "I don't care" collations which are subject to
weirdness on upgrade when they change underneath your feet, and then
"distinct" collations which have an explicit prefix to pin down the
version (indirectly via CLDR version) and route to a specific library
(N copies of ICU that ship with it), with a note recommending the
latter for indexes and constraints.  So I'd guess you'll stay on the
old versions forever until you explicitly migrate data to a new
collation.

[1] https://www.ibm.com/docs/en/db2/11.1?topic=support-locale-names-sql-xquery



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Jun 11, 2022 at 3:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Do we even need to store a version for indexes most of the time if
> we're versioning ICU itself, as part of the "time travelling
> collations" design? For that matter, do we even need to version
> collations directly anymore?

They're still useful for non-ICU collations (for example FreeBSD and
Windows can tell you about version changes based on open standards),
and they're *maybe* still useful for ICU, considering that there
are minor version upgrades, though I hope that would never actually
detect a change if we built a multi-version system like what we are
discussing here.  Certainly they don't make sense in the current
catalog layout with TT collations, though, there's only one attribute
to cover N libraries (though the reverted version tracking thing would
handle it just fine, because that moved it into a per-index location).

I mention minor upgrade as a topic to poke at because the popular
Linux distros only allow major ICU versions to be installed
concurrently, but minor versions are also released from time to time
and replace the libraries (well, the .68 library is a symlink to
.68.1, and then changes to .68.2, following typical conventions, but
the packages don't let you have .68.1 and .68.2 at the same time).  To
pick a random example, ICU upgraded 68.1 -> 68.2 at one point, which a
bit of googling tells me included CLDR 38 -> CLDR 38.1.  It looks like
they tweaked a few super minor things.  Could such a change affect the
values that ucol_getVersion() reports?  This came up in the last round
of this stuff with Doole[1], but we didn't dig further and I still
don't know what to think about it.

[1] https://www.postgresql.org/message-id/CADE5jYJTnYaTNXMFKOK-0p44%2BDm5LMcRcJ5kVi1MVHomb2QTkQ%40mail.gmail.com



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Fri, Jun 10, 2022 at 8:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'm also suspicious that there are more subtle hazards like pathkeys
> lurking in the shadows.  We go to great effort to recognise matching
> and non-matching collations by OID alone, which is why my first
> attempt was "distinct [OIDs]", so that'd keep working.

It's definitely possible that we won't be able to find a workable
solution that deals with "time travel collations" sensibly from the
planner's perspective. It's certainly not a neat adjunct to what we
have. I think that it *might* be possible to find a way to make it
work that is suboptimal, but works. Without being overly clever.

The DB2 docs say "use the CLDR version prefix to avoid unexpected
changes in behavior when upgrading to future releases of the Db2
database". But if you don't do that, and get a change in behavior,
then surely any related indexes must have been rebuilt too. The
interesting part may be what that upgrade looks like in detail.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Fri, Jun 10, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> They're still useful for non-ICU collations (for example FreeBSD and
> Windows can tell you about version changes based on open standards),
> and they're *maybe* still useful for ICU, considering that there
> are minor version upgrades, though I hope that would never actually
> detect a change if we built a multi-version system like what we are
> discussing here.

Right. I was mostly just asking this as a rhetorical question.

What about "time travel collations", but without the time travel part?
That is, what about supporting multiple ICU versions per cluster, but
not per database? So you could upgrade the OS and Postgres, using
standard packages that typically just use the latest ICU version --
typically, but not always. If you happen to have been on an older
version of ICU on upgrade, then that version of ICU will still work at
the level of a whole database -- your database. Maybe you can create
new databases with old and new ICU versions if you want to.

That obviously runs into the problem of needing to eventually do a
dump and reload -- but I suppose that "eventually" could be a very
long time. At least the OS package doesn't declare one version of ICU
the blessed version, now and forever, effectively vendoring ICU in a
backdoor fashion. At least old databases have significant runway,
while at the same time new databases that want to use the same
standard Postgres package aren't forced to use the same ancient ICU
version.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What about "time travel collations", but without the time travel part?
> That is, what about supporting multiple ICU versions per cluster, but
> not per database? So you could upgrade the OS and Postgres, using
> standard packages that typically just use the latest ICU version --
> typically, but not always. If you happen to have been on an older
> version of ICU on upgrade, then that version of ICU will still work at
> the level of a whole database -- your database. Maybe you can create
> new databases with old and new ICU versions if you want to.
>
> That obviously runs into the problem of needing to eventually do a
> dump and reload -- but I suppose that "eventually" could be a very
> long time. At least the OS package doesn't declare one version of ICU
> the blessed version, now and forever, effectively vendoring ICU in a
> backdoor fashion. At least old databases have significant runway,
> while at the same time new databases that want to use the same
> standard Postgres package aren't forced to use the same ancient ICU
> version.

Hmm.  I think that's effectively what you'd get using my "distinct
collation" patch (v1, or this much better v3, attached), if you put
version prefixes in colliculocale, and updated them in the template
database after an OS upgrade to affect new databases.  I realise you
probably mean something a little more automatic...

I think "pinned forever" ICU versions would be useful, because I think
there are very few expert users who want fine generalised control over
version changes, but almost all other users don't care at all about
any of this stuff -- as long as their indexes keep indexing and their
constraints keep constraining.  So I think you can make a lot of
people happy by ignoring the complexities of upgrades and providing a
way to nail the version down for the lifetime of the database.  Also,
it's not *impossible* to move to a later ICU, it's just a bit tricky;
the key point is that it's under your control if you want to do that,
independently of an OS upgrade, as you said.

Based on my reading of that DB2 manual page, I reckon my v3 "distinct
collation" patch is about as good as what they have.  If you don't
choose to use prefixes then later OS upgrades (ie upgrades that change
the version of ICU that PostgreSQL is linked against) might corrupt
your indexes and constraints -- I think that's what they're saying --
though at least we'll try to warn about that with our weak warning
system.  If you do choose to use prefixes you'll be stuck on that ICU
version forever, even across updates that cause PostgreSQL to be
linked to future releases of ICU, unless you're prepared to do a whole
bunch of careful upgrading work (perhaps with some future tooling to
help with that).

Attached is a much more fleshed out version of the "distinct
collation" patch (to be clear: v3 descends from v1, while v2 was at
attempt at the timelord approach).  Main changes:

1.  I now also route strTo{Upper,Lower,Title} to the right version of
ICU.  That requires dlopen'ing a second library.

2.  You create distinct collations with optional ICU major version
prefixes, along the lines of what Jim was suggesting.  For example
(and I'm not actually proposing -x-icu67 suffixes, just avoiding a
collision in my example):

postgres=# create collation "en-x-icu67" (provider = icu , locale = '67:en');
CREATE COLLATION
postgres=# create collation "en-x-icu63" (provider = icu , locale = '63:en');
CREATE COLLATION

3.  For versions other than the one we are linked against, it tries to
open libraries with typical names inside $libdir.  An administrator
could drop symlinks in there like so:

$ ln -s /usr/lib/x86_64-linux-gnu/libicui18n.so.63 ~/install/lib/postgresql/
$ ln -s /usr/lib/x86_64-linux-gnu/libicuuc.so.63 ~/install/lib/postgresql/

What I like about this $libdir scheme is that I imagine that we could
ask our friends in the packaging teams to create packages for that.
Users would then think of them in much the same way as extensions.
You'd just type:

$ sudo apt-get install postgresql-16-icu71

Happy to keep trying to figure out the competing and rather more
ambitious TT version too (that I sketched some bits of in v2), but I'm
a whole lot fuzzier on how that can work and kinda stuck on the
problems I raised.

Attachment

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sun, Jun 12, 2022 at 11:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > What about "time travel collations", but without the time travel part?
> > That is, what about supporting multiple ICU versions per cluster, but
> > not per database? So you could upgrade the OS and Postgres, using
> > standard packages that typically just use the latest ICU version --
> > typically, but not always. If you happen to have been on an older
> > version of ICU on upgrade, then that version of ICU will still work at
> > the level of a whole database -- your database. Maybe you can create
> > new databases with old and new ICU versions if you want to.
> >
> > That obviously runs into the problem of needing to eventually do a
> > dump and reload -- but I suppose that "eventually" could be a very
> > long time. At least the OS package doesn't declare one version of ICU
> > the blessed version, now and forever, effectively vendoring ICU in a
> > backdoor fashion. At least old databases have significant runway,
> > while at the same time new databases that want to use the same
> > standard Postgres package aren't forced to use the same ancient ICU
> > version.
>
> Hmm.  I think that's effectively what you'd get using my "distinct
> collation" patch (v1, or this much better v3, attached), if you put
> version prefixes in colliculocale, and updated them in the template
> database after an OS upgrade to affect new databases.  I realise you
> probably mean something a little more automatic...

Thinking some more about what you said above: really, most people only
care about the default collation.  I'm not yet sure what I think
initdb should put into pg_collation when importing the initial set of
collation objects in the "distinct" world (perhaps an un-prefixed and
a prefixed variant of each, with names ending -x-icu and -x-icu63?),
but as for the default collation, I should point out that the
"distinct" patch already gives you a nailed-to-the-ground database
approximately as you described above if you just do something like
this:

postgres=# create database db2 locale_provider = icu icu_locale =
'67:en' template = template0 ...;

Small bugfix attached (v3 was accidentally calling uiter_setUTF8() and
u_errorName() directly in a couple of places).

Attachment

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
Hey Jeremy,

On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of
material;I haven’t read the whole thread yet. If you have some others that would also be particularly good background,
letme know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I
waspulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple
years,so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in
theconversation here. 

There were more threads, but they mostly say the same things, hence my
current attempt to move from bloviation to trying out the ideas with
actual code :-D

> Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just
warnon version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is
prettydamn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things
tohappen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades
oflarge & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I
havea lot of reading to do, to really understand how that happened and where the dialogue is today. 

Given that the only thing you could do about it is REINDEX, and yet we
don't even know which indexes needed to be REINDEXed (the problem
Julien and I tried to address, but so far without success), it seemed
highly premature to convert the warning to an error.

I don't think the community consensus is that we have arrived
somewhere, it's more like we're in transit, possibly without a map.
For example 15 gains ICU support for the default collation (= how most
people consume collations), which changes things, and will surely lead
to more people thinking about this problem space.

> Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer),
butI still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good
enough”for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change.
Ithink glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If
MySQLwasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now
though,it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But
maybeI’m wrong?) 

Hmm.  Well I personally wouldn't try to write a collator any more
willingly than I'd try to write a new cryptographic algorithm, just
not my bag.  We don't want to handle complaints about our sort order
(we already bat away complaints about glibc's, and I heard an account
from an OS vendor about the non-stop contradictory crank complaints
about sort order they used to receive before they gave up and adopted
ICU).  ICU really is quite authoritative here.

If you mean that you don't even want to have to specify a language
like "en", then note that you don't have to: ICU has a "root" collator
which you can request with an empty string (all other collators apply
cultural tweaks on top of that).  Like everything else, the root
collator has changed over time, though.

With my "distinct" experimental patch (v4), you could set your
database default collation to a specific ICU major version's root
collator like so:

create database ... locale_provider = icu icu_locale = '71:' template
= template0

That'll keep working, even across pg_upgrades to some version of
PostgreSQL far in the future that is linked against ICU 100, by
dlopen'ing the .71 libraries, for as long as you can get your hands on
a libicu71 package or otherwise build your own, and it'll error out if
it can't open that library, which may be the hard error you were
looking for.  If there's an API change in ICU we'll have to make some
changes, but that's already true.

Review/testing/flames/rants/better ideas welcome.

> Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for
thenon-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>). 

+1



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Thu, Jun 9, 2022 at 11:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm sure that Apple are indeed updating the UTF8 data behind
> > their proprietary i18n APIs, but the libc APIs are mostly getting benign
> > neglect.
>
> As for how exactly they might be doing that, I don't know, but a bit
> of light googling tells me that a private, headerless,
> please-don't-call-me-directly copy of ICU arrived back in macOS
> 10.3[1].  I don't see it on my 12.4 system, but I also know that 12.x
> started hiding system libraries completely (the linker is magic and
> pulls libraries from some parallel dimension, there is no
> /usr/lib/libSystem.B.dylib file on disk, and yet otool -L
> <your_favourite_executable> references it).

The other thread about a macOS linking problem nerd-sniped me back
into here to find out how to see breadcrumbs between hidden libraries
on this super weird UNIX™ and confirm that they are indeed still
shipping a private ICU for use by their Core Foundation stuff that's
used by fancy ObjC/Swift/... etc GUI apps.  The following command was
an interesting discovery for me because otool -L can't see any of the
new kind of ghost libraries:

% dyld_info -dependents
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation

Though I can't get my hands on the hidden ICU library itself to
disassemble (without installing weird extra tools, apparently [1]),
that at least revealed its name, which I could then dlopen out of
curiosity.  It seems they jammed all the ICU sub-libraries into one,
and configured it with --disable-renaming so it doesn't have major
version suffixes on symbol names.

It'd clearly be a terrible idea for us to try to use any of that, and
Mac users should be very happy with the new support for ICU as DB
default.

[1] https://lapcatsoftware.com/articles/bigsur.html



Re: Collation version tracking for macOS

From
Peter Geoghegan
Date:
On Mon, Jun 13, 2022 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> It'd clearly be a terrible idea for us to try to use any of that, and
> Mac users should be very happy with the new support for ICU as DB
> default.

This suggests something that I already suspected: nobody particularly
expects the system lib C to be authoritative for the OS as a whole, in
the way that Postgres supposes. At least in the case of Mac OS, which
is after all purely a desktop operating system.

-- 
Peter Geoghegan



Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 11.06.22 05:35, Peter Geoghegan wrote:
> Do we even need to store a version for indexes most of the time if
> we're versioning ICU itself, as part of the "time travelling
> collations" design? For that matter, do we even need to version
> collations directly anymore?

Conversely, why are we looking at the ICU version instead of the 
collation version.  If we have recorded the collation as being version 
1234, we need to look through the available ICU versions (assuming we 
can load multiple ones somehow) and pick the one that provides 1234.  It 
doesn't matter whether it's the same ICU version that the collation was 
originally created with, as long as the collation version stays the same.



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
> On Jun 14, 2022, at 14:10, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
> 
> Conversely, why are we looking at the ICU version instead of the collation version.  If we have recorded the
collationas being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones
somehow)and pick the one that provides 1234.  It doesn't matter whether it's the same ICU version that the collation
wasoriginally created with, as long as the collation version stays the same. 

Does Unicode CDLR provide (or even track) versioning of collation or other i18n functionality for individual locale
settings?I’m thinking it might not even have that concept in the original source repo/data, but I might be remembering
wrong.

It would require not only watching for changes in the per-locale tailoring rules but also being cognizant of changes in
root/DUCETbehavior and understanding the impact of changes there. 

(Common mistake I’ve seen folks make when comparing OS glibc versions is only looking at locale data, not realizing
therehave been changes to root behavior that didn’t involve any changes to local data files) 

-Jeremy


Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Jun 15, 2022 at 7:10 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> > On Jun 14, 2022, at 14:10, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
> > Conversely, why are we looking at the ICU version instead of the collation version.  If we have recorded the
collationas being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones
somehow)and pick the one that provides 1234.  It doesn't matter whether it's the same ICU version that the collation
wasoriginally created with, as long as the collation version stays the same. 

One difference would be the effect if ICU ever ships a minor library
version update that changes the reported collversion.

1.  With the code I proposed in my v4 patch, our version mismatch
warnings would kick in, but otherwise everything would continue to
work (and corrupt indexes, if they really moved anything around).
2.  With a system that (somehow) opens all available libraries and
looks for match, it would fail to find one.  That is assuming that you
are using the typical major-versioned packages we can see in software
distributions like Debian.

I don't know if minor version changes actually do that, though have
wondered out loud a few times in these threads.  I might go and poke
at some ancient packages to see if that's happened before.  To defend
against that, we could instead do major + minor versioning, but so far
I worried about major only because that's they way they ship 'em in
Debian and (AFAICS) RHEL etc, so if you can't easily install 68.0 and
68.1 at the same time.  On the other hand, you could always "pin" (or
similar concepts) the libicu68 package to a specific minor release, to
fix the problem (whether you failed like 1 or like 2 above).

> (Common mistake I’ve seen folks make when comparing OS glibc versions is only looking at locale data, not realizing
therehave been changes to root behavior that didn’t involve any changes to local data files) 

Yeah, I've wondered idly before if libc projects and ICU couldn't just
offer a way to ask for versions explicitly, and ship historical data.
With some system of symlinks to make it all work with defaults for
those who don't care, a libc could have
/usr/share/locale/en_US@CLDR34.UTF-8 etc so you could
setlocale(LC_COLLATE, "en_US@CLDR34"), or something.  I suppose they
don't want to promise to be able to interpret the old data in future
releases, and, as you say, sometimes the changes are in C code, due to
bugs or algorithm changes, not the data.



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:



On Jun 14, 2022, at 19:06, Thomas Munro <thomas.munro@gmail.com> wrote:

One difference would be the effect if ICU ever ships a minor library
version update that changes the reported collversion.

If I’m reading it correctly, ICU would not change collation in major versions, as an explicit matter of policy around DUCET stability and versioning.



With some system of symlinks to make it all work with defaults for
those who don't care, a libc could have
/usr/share/locale/en_US@CLDR34.UTF-8 etc so you could
setlocale(LC_COLLATE, "en_US@CLDR34"), or something.  I suppose they
don't want to promise to be able to interpret the old data in future
releases, and, as you say, sometimes the changes are in C code, due to
bugs or algorithm changes, not the data.

If I understand correctly, files in /usr/share/locale aren’t enough because those only have the tailoring rules, and core algorithm and data (before applying locale-specific tweaks) also change between versions. I’m pretty sure glibc works similar to UCA in this regard (albeit based on ISO 14651 and not CDLR), and the Unicode link above is a good illustration of default collation rules that underly the locale-specific tweaks.

-Jeremy

Sent from my TI-83

Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 14.06.22 21:10, Jeremy Schneider wrote:
> Does Unicode CDLR provide (or even track) versioning of collation or other i18n functionality for individual locale
settings?

Yes.  You can see that in PostgreSQL as various pre-seeded ICU 
collations having different versions.



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
Hi,

Here is a rebase of this experimental patch.  I think the basic
mechanics are promising, but we haven't agreed on a UX.  I hope we can
figure this out.

Restating the choice made in this branch of the experiment:  Here I
try to be just like DB2 (if I understood its manual correctly).
In DB2, you can use names like "en_US" if you don't care about
changes, and names like "CLDR181_en_US" if you do.  It's the user's
choice to use the second kind to avoid "unexpected effects on
applications or database objects" after upgrades.  Translated to
PostgreSQL concepts, you can use a database default ICU locale like
"en-US" if you don't care and "67:en-US" if you do, and for COLLATION
objects it's the same.  The convention I tried in this patch is that
you use either "en-US-x-icu" (which points to "en-US") or
"en-US-x-icu67" (which points to "67:en-US") depending on whether you
care about this problem.

I recognise that this is a bit cheesy, it's all the user's problem to
deal with or ignore.

An alternative mentioned by Peter E was that the locale names
shouldn't carry the prefix, but somehow we should have a list of ICU
versions to search for a matching datcollversion/collversion.  How
would that look?  Perhaps a GUC, icu_library_versions = '63, 67, 71'?
There is a currently natural and smallish range of supported versions,
probably something like 54 ... U_ICU_VERSION_MAJOR_NUM, but it seems a
bit weird to try to dlopen ~25 libraries or whatever it might be...
Do you think we should try to code this up?

I haven't tried it, but the main usability problem I predict with that
idea is this:  It can cope with a scenario where you created a
database with ICU 63 and started using a default of "en" and maybe
some explicit fr-x-icu or whatever, and then you upgrade to a new
postgres binary using ICU 71, and, as long as you still have ICU 63
installed it'll just magicaly keep using 63, now via dlopen().  But it
doesn't provide a way for me to create a new database that uses 63 on
purpose when I know what I'm doing.  There are various reasons I might
want to do that.

Maybe the ideas could be combined?  Perhaps "en" means "create using
binary's linked ICU, open using search-by-collversion", while "67:en"
explicitly says which to use?

Changes since last version:

 * Now it just uses the default dlopen() search path, unless you set
icu_library_path.  Is that a security problem?  It's pretty
convenient, because it means you can just "apt-get install libicu63"
(or local equivalent) and that's all, now 63 is available.

 * To try the idea out, I made it automatically create "*-x-icu67"
alongside the regular "-x-icu" collation objects at initdb time.

Attachment

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Oct 22, 2022 at 10:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... But it
> doesn't provide a way for me to create a new database that uses 63 on
> purpose when I know what I'm doing.  There are various reasons I might
> want to do that.

Thinking some more about this, I guess that could be addressed by
having an explicit way to request either the library version or
collversion-style version when creating a database or collation, but
not actually storing it in daticulocale/colliculocale.  That could be
done either as part of the string that is trimmed off before storing
it (so it's only used briefly during creation to find a non-default
library)... Perhaps that'd look like initdb --icu-locale "67:en" (ICU
library version) or "154.14:en" (individual collation version) or some
new syntax in a few places.  Thereafter, it would always be looked up
by searching for the right library by [dat]collversion as Peter E
suggested.

Let me try harder to vocalise some more thoughts that have stopped me
from trying to code the search-by-collversion design so far:

Suppose your pgdata encounters a PostgreSQL linked against a later ICU
library, most likely after an OS upgrade or migratoin, a pg_upgrade,
or via streaming replication.  You might get a new error "can't find
ICU collation 'en' with version '153.14'; HINT: install missing ICU
library version", and somehow you'll have to work out which one might
contain 'en' v153.14 and install it with apt-get etc.  Then it'll
magically work: your postgres linked against (say) 71 will happily
work with the dlopen'd 67.  This is enough if you want to stay on 67
until the heat death of the universe.  So far so good.

Problem 1:  Suppose you're ready to start using (say) v72.  I guess
you'd use the REFRESH command, which would open the main linked ICU's
collversion and stamp that into the catalogue, at which point new
sessions would start using that, and then you'd have to rebuild all
your indexes (with no help from PG to tell you how to find everything
that needs to be rebuilt, as belaboured in previous reverted work).
Aside from the possibility of getting the rebuilding job wrong (as
belaboured elsewhere), it's not great, because there is still a
transitional period where you can be using the wrong version for your
data.  So this requires some careful planning and understanding from
the administrator.

I admit that the upgrade story is a tiny bit better than the v5
DB2-style patch, which starts using the new version immediately if you
didn't use a prefix (and logs the usual warnings about collversion
mismatch) instead of waiting for you to run REFRESH.  But both of them
have a phase where they might use the wrong library to access an
index.  That's dissatisfying, and leads me to prefer the simple
DB2-style solution that at least admits up front that it's not very
clever.  The DB2-style patch could be improved a bit here with the
addition of one more GUC: default_icu_library, so the administrator,
rather than the packager, remains in control of which version we use
for non-prefixed iculocale values (likely to be what almost everyone
is interested in), defaulting to what the packager linked against.
I've added that to the patch for illustration (though obviously the
error messages produced by collversion mismatch could use some
adjustment, ie to clarify that the warning might be cleared by
installing and selecting a different library version).

Problem 2:  If ICU 67 ever decides to report a different version for a
given collation (would it ever do that?  I don't expect so, but ...),
we'd be unable to open the collation with the search-by-collversion
design, and potentially the database.  What is a user supposed to do
then?  Presumably our error/hint for that would be "please insert the
correct ICU library into drive A", but now there is no correct
library; if you can even diagnose what's happened, I guess you might
downgrade the ICU library using package tools or whatever if possible,
but otherwise you'd be stuck, if you just can't get the right library.
Is this a problem?  Would you want to be able to say "I don't care,
computer, please just press on"?  So I think we need a way to turn off
the search-by-collversion thing.  How should it look?

I'd love to hear others' thoughts on how we can turn this into a
workable solution.  Hopefully while staying simple...

Attachment

Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 22.10.22 03:22, Thomas Munro wrote:
> Suppose your pgdata encounters a PostgreSQL linked against a later ICU
> library, most likely after an OS upgrade or migratoin, a pg_upgrade,
> or via streaming replication.  You might get a new error "can't find
> ICU collation 'en' with version '153.14'; HINT: install missing ICU
> library version", and somehow you'll have to work out which one might
> contain 'en' v153.14 and install it with apt-get etc.  Then it'll
> magically work: your postgres linked against (say) 71 will happily
> work with the dlopen'd 67.  This is enough if you want to stay on 67
> until the heat death of the universe.  So far so good.

What I'm wondering is where those ICU installations are going to come 
from.  In order for this project to be viable, we would need to convince 
some combination of ICU maintainers, OS packagers, and PGDG packagers to 
provide and maintain five year's worth of ICU packages (yearly releases 
AFAICT).  Is that something we are willing to get into?

(Even to test this I need to figure out where to get another ICU 
installation from.  I'll try how easy manual installations are.)




Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> What I'm wondering is where those ICU installations are going to come
> from.  In order for this project to be viable, we would need to convince
> some combination of ICU maintainers, OS packagers, and PGDG packagers to
> provide and maintain five year's worth of ICU packages (yearly releases
> AFAICT).  Is that something we are willing to get into?

I hacked on this on a Debian machine that has a couple of these
installed and they work fine, but now I realise that might have to do
with the major upgrade history of the machine.  So yeah... probably.
:-/  Not being involved in packaging I have no idea how plausible such
a backports (erm, forwardports?) repo would be, and I have even less
idea for other distros.



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 2, 2022 at 1:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> > What I'm wondering is where those ICU installations are going to come
> > from.  In order for this project to be viable, we would need to convince
> > some combination of ICU maintainers, OS packagers, and PGDG packagers to
> > provide and maintain five year's worth of ICU packages (yearly releases
> > AFAICT).  Is that something we are willing to get into?
>
> I hacked on this on a Debian machine that has a couple of these
> installed and they work fine, but now I realise that might have to do
> with the major upgrade history of the machine.  So yeah... probably.
> :-/  Not being involved in packaging I have no idea how plausible such
> a backports (erm, forwardports?) repo would be, and I have even less
> idea for other distros.

After sleeping on it, I don't really agree that the project is not
viable even if it requires hoop-jumping to set up right now.  It's a
chicken-and-egg problem, and the first step is to make it possible to
do it at all, thereby creating the demand for convenient packages.  I
think we have several topics here:

1.  Technical problems relating to dlopen'ing.  Does it work?  Is the
default dlopen() secure enough?  Is it building sensible library
names, even on the freaky-library OSes (Windows, macOS, AIX)?  Is it
enough to have that GUC for non-default path, should it be a search
path, should it share the existing dynamic_library_path?  Are the
indirect function calls fast enough?  Is the way it handles API
stability sound?  Can we drop some unfinished complexity by dropping
pre-53 ICU?  Does it use too much memory?
2.  User experience problems relating to upgrade paths and user
interface.  Is it enough to start with the basic DB2-style approach
that I've prototyped here?  How should we refer to library versions?
Is your search-for-the-collversion idea better?  My gut feeling is
that the early version should be about giving people options, and not
trying to be too clever/automatic with questionable semantics, and
later improvements could follow, for example if we have another go at
the per-object version tracking.
3.  Library availability.  This is a problem for downstream
communities to solve.  For example, the people who build Windows
installers might want to start bundling the ICU versions from their
earlier releases, the people involved with each Linux/BSD distro would
hopefully figure out a good way to publish the packages from older OS
releases in one repo, and the people running managed systems probably
do their own packaging anyway, they'll figure it out.  I realise that
you are involved in packaging and I am not, so we probably have
different perspectives: I get to say "and here, magic happens!" :-)

FWIW at least 57, 63 and 67 (corresponding to deb9, 10, 11) from
http://ftp.debian.org/debian/pool/main/i/icu/ can be installed with
dpkg -i on my Debian 11 machine.  52 (deb8) too, probably, but it has
dependencies I didn't look into.  71 and 72 are newer than the -dev
version (what we link against), so I didn't try installing but the
patch as posted wouldn't let me open them: the idea here is to allow
only older stuff to be dlopen'd, so if a breaking API change comes
down the pipe we'll be able to deal with it.  Not being a packaging
guy, I don't how how stupid it would be to build a package repo that
literally just exposes these via an index and that's all, or whether
it's better to rebuild the ICU versions from source against modern
C/C++ runtimes etc.



Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 02.11.22 00:57, Thomas Munro wrote:
> 3.  Library availability.  This is a problem for downstream
> communities to solve.  For example, the people who build Windows
> installers might want to start bundling the ICU versions from their
> earlier releases, the people involved with each Linux/BSD distro would
> hopefully figure out a good way to publish the packages from older OS
> releases in one repo, and the people running managed systems probably
> do their own packaging anyway, they'll figure it out.  I realise that
> you are involved in packaging and I am not, so we probably have
> different perspectives: I get to say "and here, magic happens!" :-)

I made a Homebrew repository for ICU versions 50 through 72: 
https://github.com/petere/homebrew-icu

All of these packages build and pass their self-tests on my machine.  So 
from that experience, I think maintaining a repository of ICU versions, 
and being able to install more than one for testing this feature, is 
feasible.

Now I have started building PostgreSQL against these, to get some 
baseline of what is supported and actually works.  The results are a bit 
mixed so far, more to come later.

The installation instructions currently say that the minimum required 
version of ICU is 4.2.  That was the one that shipped with RHEL 6.  I 
think we have de-supported RHEL 6 and could increase that.  The version 
in RHEL 7 is 50.

(My repository happens to start at 50 because the new versioning system 
started at 49, but 49 doesn't appear to be tagged at the icu github site.)

Note: Recent versions of libxml2 link against icu.  This isn't a 
problem, thanks to the symbol versioning, but if you get libxml2 via 
pkg-config, you might get LDFLAGS from not the icu version you wanted.




Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Nov 8, 2022 at 1:22 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> I made a Homebrew repository for ICU versions 50 through 72:
> https://github.com/petere/homebrew-icu

Nice!

> All of these packages build and pass their self-tests on my machine.  So
> from that experience, I think maintaining a repository of ICU versions,
> and being able to install more than one for testing this feature, is
> feasible.

I wonder what the situation with CVEs is in older releases.  I heard a
rumour that upstream might only patch current + previous, leaving it
up to distros to back-patch to whatever they need to support, but I
haven't tried to track down cold hard evidence of this or think about
what it means for this project...



Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 22.10.22 03:22, Thomas Munro wrote:
> I'd love to hear others' thoughts on how we can turn this into a
> workable solution.  Hopefully while staying simple...

I played with this patch a bit.  It looks like a reasonable approach.

Attached is a small patch to get the dynamic libicu* lookup working with 
the library naming on macOS.

Instead of packing the ICU version into the locale field ('63:en'), I 
would make it a separate field in pg_collation and a separate argument 
in CREATE COLLATION.

At this point, perhaps it would be good to start building some tests to 
demonstrate various upgrade scenarios and to ensure portability.


Attachment

Re: Collation version tracking for macOS

From
Jeff Davis
Date:
I looked at v6.

  * We'll need some clearer instructions on how to build/install extra
ICU versions that might not be provided by the distribution packaging.
For instance, I got a cryptic error until I used --enable-rpath, which
might not be obvious to all users.
  * Can we have a better error when the library was built with --
disable-renaming? We can just search for the plain (no suffix) symbol.
  * We should use dlerror() instead of %m to report dlopen() errors.
  * It seems like the collation version is just there to issue WARNINGs
when a user is using the non-versioned locale syntax and the library
changes underneath them (or if there is collation version change within
a single ICU major version)?
  * How are you testing this?
  * In my tests (sort, hacked so abbreviate is always false), I see a
~3% regression for ICU+UTF8. That's fine with me. I assume it's due to
the indirect function call, but that's not obvious to me from the
profile. If it's a major problem we could have a special case of
varstrfastcmp_locale() that works on the compile-time ICU version.

I realize your patch is experimental, but when there is a better
consensus on the approach, we should consider adding declarative syntax
such as:

   CREATE COLLATION (or LOCALE?) PROVIDER icu67
     TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';

It will offer more opportunities to catch errors early and offer better
error messages. It would also enable it to function if the library is
built with --disable-renaming (though we'd have to trust the user).

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> Problem 1:  Suppose you're ready to start using (say) v72.  I guess
> you'd use the REFRESH command, which would open the main linked ICU's
> collversion and stamp that into the catalogue, at which point new
> sessions would start using that, and then you'd have to rebuild all
> your indexes (with no help from PG to tell you how to find everything
> that needs to be rebuilt, as belaboured in previous reverted work).
> Aside from the possibility of getting the rebuilding job wrong (as
> belaboured elsewhere), it's not great, because there is still a
> transitional period where you can be using the wrong version for your
> data.  So this requires some careful planning and understanding from
> the administrator.

How is this related to the search-by-collversion design? It seems like
it's hard no matter what.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
Replying to Peter and Jeff in one email.

On Sat, Nov 12, 2022 at 3:57 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 22.10.22 03:22, Thomas Munro wrote:
> > I'd love to hear others' thoughts on how we can turn this into a
> > workable solution.  Hopefully while staying simple...
>
> I played with this patch a bit.  It looks like a reasonable approach.

Great news.

> Attached is a small patch to get the dynamic libicu* lookup working with
> the library naming on macOS.

Thanks, squashed.

> Instead of packing the ICU version into the locale field ('63:en'), I
> would make it a separate field in pg_collation and a separate argument
> in CREATE COLLATION.

I haven't tried this yet, as I focused on coming up with a way of testing in
this iteration.  I can try this next.  I'm imagining that we'd have
pg_collation.collicuversion and pg_database.daticuversion, and they'd default
to 0 for "use the GUC", and perhaps you'd even be able to ALTER them.  Perhaps
we wouldn't even need the GUC then...  0 could mean "the linked version", and
if you don't like it, you ALTER it.  Thinking about this.

> At this point, perhaps it would be good to start building some tests to
> demonstrate various upgrade scenarios and to ensure portability.

OK, here's what I came up with.  You enable it in PG_TEST_EXTRA, and
tell it about an alternative ICU version you have in the standard library
search path that is not the same as the main/linked one:

$ meson configure -DPG_TEST_EXTRA="icu=63"
$ meson test icu/020_multiversion

Another change from your feedback:  you mentioned that RHEL7 shipped with ICU
50, so I removed my suggestion of dropping some extra code we carry for
versions before 54 and set the minimum acceptable version to 50.  It probably
works further back than that, but that's a decent range, I think.

On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I looked at v6.

Thanks for jumping in and testing!

>   * We'll need some clearer instructions on how to build/install extra
> ICU versions that might not be provided by the distribution packaging.
> For instance, I got a cryptic error until I used --enable-rpath, which
> might not be obvious to all users.

Suggestions welcome.  No docs at all yet...

>   * Can we have a better error when the library was built with --
> disable-renaming? We can just search for the plain (no suffix) symbol.

I threw out that symbol probing logic, and wrote something simpler that should
now also work with --disable-renaming (though not tested).  Now it does a
cross-check with the library's self-reported major version, just to make
sure there wasn't a badly named library file, which may be more likely
with --disable-renaming.

>   * We should use dlerror() instead of %m to report dlopen() errors.

Fixed.

>   * It seems like the collation version is just there to issue WARNINGs
> when a user is using the non-versioned locale syntax and the library
> changes underneath them (or if there is collation version change within
> a single ICU major version)?

Correct.

I have now updated the warning messages you get when they don't match, to
provide a hint about what to do about it.  I am sure they need some more
word-smithing, though.

>   * How are you testing this?

Ad hoc noodling before now, but see attached.

> I realize your patch is experimental, but when there is a better
> consensus on the approach, we should consider adding declarative syntax
> such as:
>
>    CREATE COLLATION (or LOCALE?) PROVIDER icu67
>      TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';
>
> It will offer more opportunities to catch errors early and offer better
> error messages. It would also enable it to function if the library is
> built with --disable-renaming (though we'd have to trust the user).

Earlier in this and other threads, we wondered if each ICU major version should
be a separate provider, which is what you're showing there, or should be an
independent property of an individual COLLATION, which is what v6 did with
'63:en' and what Peter suggested I make more formal with CREATE COLLATION foo
(..., ICU_VERSION=63).  I actually started out thinking we'd have multiple
providers, but I couldn't really think of any advantage, and I think it makes
some upgrade scenarios more painful.  Can you elaborate on why you'd want
that model?

> On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> > Problem 1:  Suppose you're ready to start using (say) v72.  I guess
> > you'd use the REFRESH command, which would open the main linked ICU's
> > collversion and stamp that into the catalogue, at which point new
> > sessions would start using that, and then you'd have to rebuild all
> > your indexes (with no help from PG to tell you how to find everything
> > that needs to be rebuilt, as belaboured in previous reverted work).
> > Aside from the possibility of getting the rebuilding job wrong (as
> > belaboured elsewhere), it's not great, because there is still a
> > transitional period where you can be using the wrong version for your
> > data.  So this requires some careful planning and understanding from
> > the administrator.
>
> How is this related to the search-by-collversion design? It seems like
> it's hard no matter what.

Yeah.  I just don't like the way it *appears* to be doing something clever, but
it doesn't solve any fundamental problem at all because the collversion
information is under human control and so it's really doing something stupid.
Hence desire to build something that at least admits that it's primitive and
just gives you some controls, in a first version.  We could always reconsider
that in later work though, maybe even an optional policy or something?

Attachment

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Nov 19, 2022 at 7:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > I realize your patch is experimental, but when there is a better
> > consensus on the approach, we should consider adding declarative syntax
> > such as:
> >
> >    CREATE COLLATION (or LOCALE?) PROVIDER icu67
> >      TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';
> >
> > It will offer more opportunities to catch errors early and offer better
> > error messages. It would also enable it to function if the library is
> > built with --disable-renaming (though we'd have to trust the user).
>
> Earlier in this and other threads, we wondered if each ICU major version should
> be a separate provider, which is what you're showing there, or should be an
> independent property of an individual COLLATION, which is what v6 did with
> '63:en' and what Peter suggested I make more formal with CREATE COLLATION foo
> (..., ICU_VERSION=63).  I actually started out thinking we'd have multiple
> providers, but I couldn't really think of any advantage, and I think it makes
> some upgrade scenarios more painful.  Can you elaborate on why you'd want
> that model?

Hmm, thinking some more about this... I said the above thinking that
you couldn't change a provider after creating a database/collation.
But what if you could?

1.  CREATE DATABASE x LOCALE_PROVIDER=icu ...;
2.  Some time later after an upgrade, my postgres binary is linked
against a new ICU version and I start seeing warnings.
3.  ALTER DATABASE x LOCALE_PROVIDER=icu63;

I suppose you shouldn't be allowed to change libc -> icu, but you
could change icu - > icuXXX, or I guess icuXXX -> icuXXX.

What if you didn't have to manually manage the set of available
providers with DDL like you showed, but we just automatically
supported "icu" (= the linked ICU, whatever it might be), and icu50 up
to icuXXX where XXX is the linked ICU's version?  We can encode those
values + libc as an int, to replace the existing char the represents
providers in catalogues.

That's basically just a different way of encoding the same information
that Peter was suggesting I put in a new catalogue attribute.  How do
you like that bikeshed colour?



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> Problem 2:  If ICU 67 ever decides to report a different version for
> a
> given collation (would it ever do that?  I don't expect so, but ...),
> we'd be unable to open the collation with the search-by-collversion
> design, and potentially the database.  What is a user supposed to do
> then?  Presumably our error/hint for that would be "please insert the
> correct ICU library into drive A", but now there is no correct
> library

Let's say that Postgres is compiled against version 67.X, and the
sysadmin upgrades the ICU package to 67.Y, which reports a different
collation version for some locale.

Your current patch makes this impossible for the administrator to fix,
because there's no way to have two different libraries loaded with the
same major version number, so it will always pick the compiled-in ICU.
The user will be forced to accept the new version of the collation, see
WARNINGs in their logs, and possibly corrupt their indexes.

Search-by-collversion would still be frustrating for the admin, but at
least it would be possible to fix by compiling their own 67.X and
asking Postgres to search that library, too. We could make it slightly
more friendly by having an error that reports the libraries searched
and the collation versions found, if none of the versions match. We can
have a GUC that controls whether a failure to find the right version is
a WARNING or an ERROR.

On Sat, 2022-11-19 at 07:38 +1300, Thomas Munro wrote:
> >   * We'll need some clearer instructions on how to build/install
> > extra
> > ICU versions that might not be provided by the distribution
> > packaging.
> > For instance, I got a cryptic error until I used --enable-rpath,
> > which
> > might not be obvious to all users.
>
> Suggestions welcome.  No docs at all yet...

I tried to write up some docs. It's hard to explain why we are exposing
to the user the collation version and the library version in these
different ways, and what effects they have.

The current patch feels like it hasn't decided whether the collation
version is ucol_getVersion() (collversion) or u_getVersion() (library
version). The collversion is more prominent in the UI (with its own
syntax), yet it's just a cross-check for whether to issue a WARNING or
not; while the library version is hidden in the locale field and it
actually decides which symbol is called.

>
>
> Yeah.  I just don't like the way it *appears* to be doing something
> clever, but
> it doesn't solve any fundamental problem at all because the
> collversion
> information is under human control and so it's really doing something
> stupid.

I assume by "human control" you mean "ALTER COLLATION ... REFRESH
VERSION". I agree that relying on the admin's declaration is dubious,
especially when we provide no good advice on how to actually do that
safely.

But I don't see what using the library version instead buys us here,
except that library version is part of the LOCALE, and there's no ALTER
command for that. You could just as easily deprecate/eliminate the
ALTER COLLATION REFRESH VERSION, and then say that the collversion is
out of human control, too.

By introducing multiple libraries, I think we need to change that
syntax anyway, to be something like:

   ALTER COLLATION ... SET VERSION TO '...'

or even:

   ALTER COLLATION ... FORCE VERSION TO '...'

> Hence desire to build something that at least admits that it's
> primitive and
> just gives you some controls, in a first version.

Using either the library version or the collation version seems
reasonably simple to me. But from a documentation and usability
standpoint, the way they are currently mixed seems confusing.



--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Nov 22, 2022 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> > Problem 2:  If ICU 67 ever decides to report a different version for
> > a
> > given collation (would it ever do that?  I don't expect so, but ...),
> > we'd be unable to open the collation with the search-by-collversion
> > design, and potentially the database.  What is a user supposed to do
> > then?  Presumably our error/hint for that would be "please insert the
> > correct ICU library into drive A", but now there is no correct
> > library
>
> Let's say that Postgres is compiled against version 67.X, and the
> sysadmin upgrades the ICU package to 67.Y, which reports a different
> collation version for some locale.
>
> Your current patch makes this impossible for the administrator to fix,
> because there's no way to have two different libraries loaded with the
> same major version number, so it will always pick the compiled-in ICU.
> The user will be forced to accept the new version of the collation, see
> WARNINGs in their logs, and possibly corrupt their indexes.

They could probably also 'pin' the older minor version package using
their package manager (= downgrade) until they're ready to upgrade and
use REFRESH VERSION to certify that they've rebuilt everything
relevant or are OK with risks.  Not pretty I admit, but I think the
end result is about the same for search-for-collversion, because I
imagine that (1) the default behaviour on failure to search would
likely be to use the linked library instead and WARN about
[dat]collversion mismatch, so far the same, and (2) the set of people
who would really be prepared to compile their own copy of 67.X instead
of downgrading or REFRESHing (with or without rebuilding) is
vanishingly small.

Two questions I wondered about:

1.  *Do* they change ucol_getVersion() values in minor releases?  I
tried to find a written policy on that.
https://icu.unicode.org/processes is not encouraging: it gives the
example of a "third digit in an official release number" [changing]
because a CLDR change was incorporated.  Hrmph.  But that's clearly
not even the modern ICU versioning system (it made a change a bit like
ours in 49, making the first number only major, so maybe that "third"
number is now the second number, AKA minor version), and also that's a
CLDR minor version change; is CLDR minor even in the recipe for
ucol_getVersion()?  Even without data changes, I guess that bug fixes
could apply to the UCA logic, and I assume that UCA logic is included
in it.  Hmm.

A non-hypothetical example of a CLDR change within an ICU major
version that I've been able to find is:

https://cldr.unicode.org/index/downloads/cldr-38

Here we see that CLDR had a minor version bump 38 -> 38.1, "a very
small number of incremental additions to version 38 to address the
specific bugs listed in Δ38.1", and was included in ICU 68.2.  Being a
minor ICU release 68.1 -> 68.2, perhaps you could finish up running
that just with a regular upgrade on typical distros (not a major OS
upgrade), and since PostgreSQL would normally be linked against eg
.68, not .68.1, it'd start using it at the next cluster start when
that symlink is updated to point to .68.2.  As it happens, if you
follow the documentation links to see what actually changed in that
particular pair of CLDR+ICU minor releases, it's timezones and locale
stuff other than collations, so wouldn't affect us.  Can we find a
chapter and verse that says that ICU would only ever move to a new
CLDR in a minor release, and CLDR would never change order of
pre-existing code points in a minor release?

It might be interesting to see if
https://github.com/unicode-org/icu/tree/release-68-1 and
https://github.com/unicode-org/icu/tree/release-68-2 report a
different ucol_getVersion() for any locale, but not conclusive if it
doesn't; it might be because something in the version pipeline knew
that particular CLDR change didn't affect collators...

This speculation feels pretty useless.  Maybe we should go and read
the code or ask an ICU expert, but I'm not against making it
theoretically possible to access two different minor versions at once,
just to cover all the bases for future-proofing.

2.  Would package managers ever allow two minor versions to be
installed at once?  I highly doubt it; they're probably more
interested in ABI stability so that dependent packages work when
bugfixes are shipped, and that's certainly nailed down at the major
version level.  It'd probably be a case of having to compile it
yourself, which seems unlikely to me in the real world.  That's why I
left minor version out of earlier patches, but I'm OK with changing
that.

As for how, I think that depends on our modelling decision (see below).

> Search-by-collversion would still be frustrating for the admin, but at
> least it would be possible to fix by compiling their own 67.X and
> asking Postgres to search that library, too. We could make it slightly
> more friendly by having an error that reports the libraries searched
> and the collation versions found, if none of the versions match. We can
> have a GUC that controls whether a failure to find the right version is
> a WARNING or an ERROR.

Good ideas.

> I tried to write up some docs. It's hard to explain why we are exposing
> to the user the collation version and the library version in these
> different ways, and what effects they have.

Always a good test: see how crazy it sounds when translated to user speak.

> The current patch feels like it hasn't decided whether the collation
> version is ucol_getVersion() (collversion) or u_getVersion() (library
> version). The collversion is more prominent in the UI (with its own
> syntax), yet it's just a cross-check for whether to issue a WARNING or
> not; while the library version is hidden in the locale field and it
> actually decides which symbol is called.

Yeah.  I agree that it sucks to have two kinds of versions flying
around in the user's mind.

> > Yeah.  I just don't like the way it *appears* to be doing something
> > clever, but
> > it doesn't solve any fundamental problem at all because the
> > collversion
> > information is under human control and so it's really doing something
> > stupid.
>
> I assume by "human control" you mean "ALTER COLLATION ... REFRESH
> VERSION". I agree that relying on the admin's declaration is dubious,
> especially when we provide no good advice on how to actually do that
> safely.
>
> But I don't see what using the library version instead buys us here,
> except that library version is part of the LOCALE, and there's no ALTER
> command for that. You could just as easily deprecate/eliminate the
> ALTER COLLATION REFRESH VERSION, and then say that the collversion is
> out of human control, too.
>
> By introducing multiple libraries, I think we need to change that
> syntax anyway, to be something like:
>
>    ALTER COLLATION ... SET VERSION TO '...'
>
> or even:
>
>    ALTER COLLATION ... FORCE VERSION TO '...'

OK.  Time for a new list of the various models we've discussed so far:

1.  search-by-collversion:  We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.
Whenever opening a collation or database, the system will search some
candidate list of ICU libraries to try to find the one that agrees
with [dat]collversion.  When creating a new collation or database, the
system will select one (probably the linked one unless you override
somehow) and record ucol_getVersion() in [dat]collversion.  When
searching, it might fail to find a suitable library and ereport; to
fix that, it is the admin's job to somehow expand the set of candidate
libraries.  In such a failure case, perhaps it would fall back to
using some default library version (probably the one that is linked,
overridable by GUC?), with a WARNING (unless you turned on ERRORs),
and if you want to shut it up without supplying the right candidate
library, you can still fall back to the REFRESH VERSION hammer (or
maybe that should indeed called FORCE to make it clearer that it's not
a harmless operation where the system holds your hand, you're actually
certifying that you have rebuilt indexes and you know what you're
doing).

The set of candidate versions could perhaps be provided with
extra_icu_library_versions=63,71 OR =63.1,63.2 strings, at least on
Unix systems following the traditional symlink conventions.
Remembering that a typical Unixoid system should have libraries and
symlinks like:

  libicui18n.a
  libicui18n.so -> libicui18n.so.71.1
  libicui18n.so.63 -> libicui18n.so.63.1
  libicui18n.so.63.1
  libicui18n.so.67 -> libicui18n.so.67.1
  libicui18n.so.67.1
  libicui18n.so.71 -> libicui18n.so.71.1
  libicui18n.so.71.1

The reason I prefer major[.minor] strings over whole library names is
that we need to dlopen two of them so it's a little easier to build
them from those parts than have to supply both names.  The reason I
prefer to keep allowing major-only versions to be listed is that it's
good to have the option to just follow minor upgrades automatically.
Or I guess you could make something that can automatically search a
whole directory (which directory?) to find all the suitably named
libraries so you don't ever have to mention versions manually (if you
want "apt-get install libicu72" to be enough with no GUC change
needed) -- is that too weird?

Perhaps we could write functions that can show you the available
versions to demystify the searching mechanism slightly and show how
various numbers relate, something like (warning: I made up numbers for
illustration, they are wrong!):

  SELECT * FROM pg_available_icu_libraries()

  icu_version unicode_version uca_version  cldr_version
  67.1        14.0            3.1          38.0
  71.1        15.0            4.0          42.0

  SELECT * FROM pg_available_icu_collation_versions('en')

  icu_version collation_version
  67.1        142.42
  71.1        153.112

2.  lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today.  The collversion column is used only for warnings.  Warnings
are expected when you used the unversioned ICU provider and upgrade to
a binary linked to a later library.  You can clear the warnings by
doing ALTER COLLATION/DATABASE SET [LOCALE_]PROVIDER = ICU63, or with
the REFRESH VERSION hammer.

Not sure how you fit minor versions into that, if we want to support
those.  Maybe ICU means "whatever is linked", ICU63 means "whatever
libicui18n.so.63 points to" and ICU63_1 means libicu18n.so.63.1,
something like that, so the user can choose from three levels of
specificity.

3.  lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion).  Similar
to the above, but it's a separate property and the provider is always
ICU.  New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
ICU_VERSION.

4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work.  It doesn't sound like anyone really likes
this.

5.  lib-version-in-collversion:  We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion.  I haven't analysed this much but I don't think it
has a very nice upgrade path from PG15, and it forces you to decide
whether to store just the major version and not even notice when the
(unstored) minor version changes, or store major.minor and
complain/break down when routine minor upgrades happen.  It is a
logical possibility though, once you decide you only want one kind of
version in the system.

I'm willing to update the patch to try one of these out so we can kick
the tyres some more, but I'll wait to see if we can get some consensus
on the way forward.  Despite my initial reactions, I'm willing to try
out the search-by-collversion concept if others are keen on it.  The
example I worked through in the first paragraph of this email helped
me warm to it a little, and with the observability functions I showed
you might have a chance of figuring out what's going on in some edge
cases.  Any other ideas, or votes for these ideas?



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Wed, 2022-11-23 at 18:08 +1300, Thomas Munro wrote:

> (1) the default behaviour on failure to search would
> likely be to use the linked library instead and WARN about
> [dat]collversion mismatch, so far the same, and 

Agreed.

> (2) the set of people
> who would really be prepared to compile their own copy of 67.X
> instead
> of downgrading or REFRESHing (with or without rebuilding) is
> vanishingly small.

The set of people prepared to do so is probably small. But the set of
people who will do it (prepared or not) when a problem comes up is
significantly larger ;-)

> 1.  *Do* they change ucol_getVersion() values in minor releases?  I
> tried to find a written policy on that.

It seems like a valid concern. The mere existence of a collation
version separate from the library major version seems to suggest that
it's possible. Perhaps they avoid it in most cases; but absent a
specific policy against it, the separate collation version seems to
allow them the freedom to do so.

> This speculation feels pretty useless.  Maybe we should go and read
> the code or ask an ICU expert, but I'm not against making it
> theoretically possible to access two different minor versions at
> once,
> just to cover all the bases for future-proofing.

I don't think this should be an overriding concern that drives the
whole design. It is a nudge in favor of search-by-collversion.

> 2.  Would package managers ever allow two minor versions to be
> installed at once?  I highly doubt it; 

Agreed.

I'm sure this has been discussed, but which distros even support
multiple major versions of ICU?

>
> 1.  search-by-collversion:  We introduce no new "library version"
> concept to COLLATION and DATABASE object and little or no new syntax.
> Whenever opening a collation or database, the system will search some
> candidate list of ICU libraries to try to find the one that agrees
> with [dat]collversion.

[...]

> The reason I prefer major[.minor] strings over whole library names is
> that we need to dlopen two of them so it's a little easier to build
> them from those parts than have to supply both names.

It also makes it easier to know which version suffixes to look for.

>   The reason I
> prefer to keep allowing major-only versions to be listed is that it's
> good to have the option to just follow minor upgrades automatically.

Makes sense.

> Or I guess you could make something that can automatically search a
> whole directory (which directory?) to find all the suitably named
> libraries so you don't ever have to mention versions manually (if you
> want "apt-get install libicu72" to be enough with no GUC change
> needed) -- is that too weird?

That seems to go a little too far.

>   SELECT * FROM pg_available_icu_libraries()
>   SELECT * FROM pg_available_icu_collation_versions('en')

+1

> 2.  lib-version-in-providers: We introduce a separate provider value
> for each ICU version, for example ICU63, plus an unversioned ICU like
> today.

I expressed interest in this approach before, but when you allowed ICU
compiled with --disable-renaming, that mitigated my concerns about when
to throw that error.

> 3.  lib-version-in-attributes: We introduce daticuversion (alongside
> datcollversion) and collicuversion (alongside collversion).

I think this is the best among 2-4.

> 4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
> mostly a strawman proposal to avoid getting bogged down in
> syntax/catalogue/model change discussions while trying to prove that
> dlopen would even work.  It doesn't sound like anyone really likes
> this.

I don't see any advantage of this over 3.

> 5.  lib-version-in-collversion:  We didn't explicitly discuss this
> before, but you hinted at it: we could just use u_getVersion() in
> [dat]collversion.

The advantage here is that it's very easy to tell the admin what
library the collation is looking for, but the disadvantages you point
out seem a lot worse: migration problems from v15, and the minor
version question.



I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to find
the collation version among whatever libraries the admin can provide;
but adding some observability into the search should mitigate any
confusion.

Can you go over the advantages of approaches 2-4 again? Is it just a
concern about burdening the admin with finding the right ICU library
version for a given collation version? That's a valid concern, but I
don't think that should be an overriding design point. It seems more
important to model the collation versions properly.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I'm sure this has been discussed, but which distros even support
> multiple major versions of ICU?

For Debian and friends, you can install any number of libicuNN
packages (if you can find them eg from previous release repos), but
there's only one libicu-dev.  That means that one specific major
version is blessed by each Debian release and has its headers and
static libraries for you to use as a developer, but you can still
install the dynamic libraries from older releases at the same time to
satisfy the dependencies of packages or programs that were built on an
earlier OS release.  They don't declare conflicts on each other and
they contain non-conflicting filenames.  That's similar to the way
standard libraries and various other things are treated, for backward
compatibility.

For RHEL and friends, I'm pretty sure it's the same concept, but I
don't use those and haven't seen it with my own eyes.

I don't know for other Linux distros/families, but I expect the above
two cover a huge percentage of our users and I expect others to have
made similar choices.

For the BSDs, which tend to have a single binary package with both
headers and libraries owing to their origins as source-based
distributions (ports), the above way of thinking doesn't work; I
couldn't develop this on my usual FreeBSD battlestation without
building ICU myself (problem being that there's only one "pkg install
icu") and I hope to talk to someone who knows what to do about that
eventually.  I want this to work there easily for end users.

macOS and Windows have so many different ways of installing things
that there isn't a single answer there; supposedly open source is like
a bazaar and closed source like a cathedral, but as far as package
management goes, it looks more like rubble to me.



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I'd vote for 1 on the grounds that it's easier to document and
> understand a single collation version, which comes straight from
> ucol_getVersion(). This approach makes it a separate problem to find
> the collation version among whatever libraries the admin can provide;
> but adding some observability into the search should mitigate any
> confusion.

OK, it sounds like I should code that up next.

> Can you go over the advantages of approaches 2-4 again? Is it just a
> concern about burdening the admin with finding the right ICU library
> version for a given collation version? That's a valid concern, but I
> don't think that should be an overriding design point. It seems more
> important to model the collation versions properly.

Yes, that's a good summary.  The user has a problem, and the solution
is to find some version of ICU and install it, so the problem space
necessarily involves the other kind of version.  My idea was that we
should therefore make that part of the model.  But the observability
support does indeed make it a bit clearer what's going on.



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > I'd vote for 1 on the grounds that it's easier to document and
> > understand a single collation version, which comes straight from
> > ucol_getVersion(). This approach makes it a separate problem to find
> > the collation version among whatever libraries the admin can provide;
> > but adding some observability into the search should mitigate any
> > confusion.
>
> OK, it sounds like I should code that up next.

Here's the first iteration.  The version rosetta stone functions look like this:

postgres=# select * from pg_icu_library_versions();
 icu_version | unicode_version | cldr_version
-------------+-----------------+--------------
 67.1        | 13.0            | 37.0
 63.1        | 11.0            | 34.0
 57.1        | 8.0             | 29.0
(3 rows)

postgres=# select * from pg_icu_collation_versions('zh');
 icu_version | uca_version | collator_version
-------------+-------------+------------------
 67.1        | 13.0        | 153.14.37
 63.1        | 11.0        | 153.88.34
 57.1        | 8.0         | 153.64.29
(3 rows)

It's no longer necessary to put anything in PG_TEST_EXTRA to run
"meson test irc/020_multiversion" usefully.  It will find extra ICU
versions all by itself in your system library search path and SKIP if
it doesn't find a second major version.  I have tried to cover the
main scenarios that I expect users to encounter in the update TAP
tests, with commentary that I hope will be helpful to assess the
usability of this thing.

Other changes:

* now using RTLD_LOCAL instead of RTLD_GLOBAL (I guess the latter
might cause trouble for someone using --disable-renaming, but I
haven't tested that and am not an expert on linker/loader arcana)
* fixed library names on Windows (based on reading the manual, but I
haven't tested that)
* fixed failure on non-ICU builds (the reason CI was failing in v7,
some misplaced #ifdefs)
* various cleanup
* I've attached a throwaway patch to install a second ICU version on
Debian/amd64 on CI, since otherwise the new test would SKIP on all
systems

This is just a first cut, but enough to try out and see if we like it,
what needs to be improved, what edge cases we haven't thought about
etc.  Let me know what you think.

Attachment

Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Sat, Nov 26, 2022 at 6:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> This is just a first cut, but enough to try out and see if we like it,
> what needs to be improved, what edge cases we haven't thought about
> etc.  Let me know what you think.

BTW one problem to highlight (mentioned but buried in the test
comments), is that REFRESH VERSION doesn't affect other sessions or
even the current session.  You have to log out and back in again to
pick up the new version.  Obviously that's not good enough, but fixing
that involves making it transactional, I think.  If you abort, we have
to go back to using the old version, if you commit you keep the new
version and we might also consider telling other backends to start
using the new version -- or something like that.  I think that's just
a Small Matter of Programming, but a little bit finickity and I need
to take a break for a bit and go work on bugs elsewhere, hence v8
didn't address that yet.



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> OK.  Time for a new list of the various models we've discussed so far:
>
> 1.  search-by-collversion:  We introduce no new "library version"
> concept to COLLATION and DATABASE object and little or no new syntax.
>
> 2.  lib-version-in-providers: We introduce a separate provider value
> for each ICU version, for example ICU63, plus an unversioned ICU like
> today.
>
> 3.  lib-version-in-attributes: We introduce daticuversion (alongside
> datcollversion) and collicuversion (alongside collversion).  Similar
> to the above, but it's a separate property and the provider is always
> ICU.  New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
> ICU_VERSION.
>
> 4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
> mostly a strawman proposal to avoid getting bogged down in
> syntax/catalogue/model change discussions while trying to prove that
> dlopen would even work.  It doesn't sound like anyone really likes
> this.
>
> 5.  lib-version-in-collversion:  We didn't explicitly discuss this
> before, but you hinted at it: we could just use u_getVersion() in
> [dat]collversion.

I'd like to vote against #3 at least in the form that's described
here. If we had three more libraries providing collations, it's likely
that they would need versioning, too. So if we add an explicit notion
of provider version, then it ought not to be specific to libicu.

I think it's OK to decide that different library versions are
different providers (your option #2), or that they are the same
provider but give rise to different collations (your option #4), or
that there can be multiple version of each collation which are
distinguished by some additional provider version field (your #3 made
more generic).

I don't really understand #1 or #5 well enough to have an educated
opinion, but I do think that #1 seems a bit magical. It hopes that the
combination of a collation name and a datcollversion will be
sufficient to find exactly one matcing collation in a list of provided
libraries. The advantage of that, as I understand it, is that if you
do something to your system that causes the number of matches to go
from one to zero, you can just throw another library on the pile and
get the number back up to one. Woohoo! But there's a part of me that
worries: what if the number goes up to two, and they're not all the
same? Probably that's something that shouldn't happen, but if it does
then I think there's kind of no way to fix it. With the other options,
if there's some way to jigger the catalog state to match what you want
to happen, you can always repair the situation somehow, because the
library to be used for each collation is explicitly specified in some
way, and you just have to get it to match what you want to have
happen.

I don't know too much about this, though, so I might have it all wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> Here's the first iteration.

I will send a full review shortly, but I encountered an ICU bug along
the way, which caused me some confusion for a bit. I'll skip past the
various levels of confusion I had (burned a couple hours), and get
right to the repro:

Install the latest release of all major versions 50-69, and compile
postgres against 70. You'll get:

=# select * from pg_icu_collation_versions('en_US') order by
icu_version;
 icu_version | uca_version | collator_version
-------------+-------------+------------------
 50.2        | 6.2         | 58.0.6.50
 51.3        | 6.2         | 58.0.6.50
 52.2        | 6.2         | 58.0.6.50
 53.2        | 6.3         | 137.51
 54.2        | 7.0         | 137.56
 55.2        | 7.0         | 153.56
 56.2        | 8.0         | 153.64
 57.2        | 8.0         | 153.64
 58.3        | 9.0         | 153.72
 59.2        | 9.0         | 153.72
 60.3        | 10.0        | 153.80
 61.2        | 10.0        | 153.80
 62.2        | 11.0        | 153.88
 63.2        | 11.0        | 153.88
 64.2        | 12.1        | 153.97
 65.1        | 12.1        | 153.97
 66.1        | 13.0        | 153.14
 67.1        | 13.0        | 153.14
 68.2        | 13.0        | 153.14
 69.1        | 13.0        | 153.14
 70.1        | 14.0        | 153.112
(21 rows)

This is good information, because it tells us that major library
versions change more often than collation versions, empirically-
speaking.

But did you notice that the version went backwards from 65.1 -> 66.1?
Well, actually, it didn't. The version of that collation in 66.1 went
from 153.97 -> 153.104. But there's a bug in versionToString() that
does the decimal output incorrectly when there's a '0' digit between
the hundreds and the ones place. I'll see about reporting that, but I
thought I'd mention it here because it could have consequences, as we
are storing the strings :-(

The bug is still present in 70.1, but it's masked because it went to
.112.

Incidentally, this answers our other question about whether the
collation version can change in a minor version update. Perhaps not,
but if they fix this bug and backport it, then the version *string*
will change in a minor update. Ugh.

Regards,
    Jeff Davis




Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Mon, Nov 28, 2022 at 9:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
> But did you notice that the version went backwards from 65.1 -> 66.1?
> Well, actually, it didn't. The version of that collation in 66.1 went
> from 153.97 -> 153.104. But there's a bug in versionToString() that
> does the decimal output incorrectly when there's a '0' digit between
> the hundreds and the ones place. I'll see about reporting that, but I
> thought I'd mention it here because it could have consequences, as we
> are storing the strings :-(
>
> The bug is still present in 70.1, but it's masked because it went to
> .112.
>
> Incidentally, this answers our other question about whether the
> collation version can change in a minor version update. Perhaps not,
> but if they fix this bug and backport it, then the version *string*
> will change in a minor update. Ugh.

That is ... astonishingly bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:
> That is ... astonishingly bad.

https://unicode-org.atlassian.net/browse/CLDR-16175


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Nov 29, 2022 at 3:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
> =# select * from pg_icu_collation_versions('en_US') order by
> icu_version;
>  icu_version | uca_version | collator_version
> -------------+-------------+------------------
>  50.2        | 6.2         | 58.0.6.50
>  51.3        | 6.2         | 58.0.6.50
>  52.2        | 6.2         | 58.0.6.50
>  53.2        | 6.3         | 137.51
>  54.2        | 7.0         | 137.56
>  55.2        | 7.0         | 153.56
>  56.2        | 8.0         | 153.64
>  57.2        | 8.0         | 153.64
>  58.3        | 9.0         | 153.72
>  59.2        | 9.0         | 153.72
>  60.3        | 10.0        | 153.80
>  61.2        | 10.0        | 153.80
>  62.2        | 11.0        | 153.88
>  63.2        | 11.0        | 153.88
>  64.2        | 12.1        | 153.97
>  65.1        | 12.1        | 153.97
>  66.1        | 13.0        | 153.14
>  67.1        | 13.0        | 153.14
>  68.2        | 13.0        | 153.14
>  69.1        | 13.0        | 153.14
>  70.1        | 14.0        | 153.112
> (21 rows)
>
> This is good information, because it tells us that major library
> versions change more often than collation versions, empirically-
> speaking.

Wow, nice discovery about 104 -> 14.  Yeah, I imagine we'll want some
kind of band-aid to tolerate that exact screwup and avoid spurious
warnings.

Bugs aside, that's quite a revealing table in other ways.  We can see:

* The version scheme changed completely in ICU 53.  This corresponds
to a major rewrite of the collation code, I see[1].

* The first component seems to be (UCOL_RUNTIME_VERSION << 4) + 9.
UCOL_RUNTIME_VERSION is in their uvernum.h, currently 9, was 8, bumped
between 54 and 55 (I see this in their commit log), corresponding to
the two possible numbers 137 and 153 that we see there.  I don't know
where the final 9 term is coming from but it looks stable since the v2
collation rewrite landed.

* The second component seems to be uca_version_major * 8 +
uca_version_minor (that's the Unicode Collation Algorithm version, and
so far always matches the Unicode version, visible in the output of
the other function).

* The values you showed for English don't have a third component, but
if you try some other locales like 'zh' you'll see the CLDR major
version in third position.  So I guess some locales depend on CLDR
data and others don't.

TL;DR it *looks* like the set of ingredients for the version string is:

* UCOL_RUNTIME_VERSION (rarely changes)
* UCA/Unicode major.minor version
* sometimes CLDR major version, not sure when
* 9

[1] https://icu.unicode.org/design/collation/v2



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Mon, 2022-11-28 at 14:11 -0500, Robert Haas wrote:
> I don't really understand #1 or #5 well enough to have an educated
> opinion, but I do think that #1 seems a bit magical. It hopes that
> the
> combination of a collation name and a datcollversion will be
> sufficient to find exactly one matcing collation in a list of
> provided
> libraries. The advantage of that, as I understand it, is that if you
> do something to your system that causes the number of matches to go
> from one to zero, you can just throw another library on the pile and
> get the number back up to one. Woohoo! But there's a part of me that
> worries: what if the number goes up to two, and they're not all the
> same? Probably that's something that shouldn't happen, but if it does
> then I think there's kind of no way to fix it. With the other
> options,
> if there's some way to jigger the catalog state to match what you
> want
> to happen, you can always repair the situation somehow, because the
> library to be used for each collation is explicitly specified in some
> way, and you just have to get it to match what you want to have
> happen.

Not necessarily, #2-4 (at least as implemented in v7) can only load one
major version at a time, so can't specify minor versions:
https://www.postgresql.org/message-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.camel@j-davis.com

With #1, you can provide control over the search order to find the
symbol you want. Granted, if you want to specify that different
collations look in different libraries for the same version, then it
won't work, because the search order is global -- is that what you're
worried about? If so, I think we need to compare it against the
downsides of #2-4, which in my opinion are more serious.

The first thing to sort out with options #2-4 is: what about minor
versions? V7 took the approach that only the major version matters.
That means that if you want to select a specific minor version, then
you are out of luck, because only one major at a time can be loaded,
globally. But paying attention to minor versions seems like a mess --
we'd need even more magical fallbacks that try later minor versions or
something.

Second, there is weirdness in the common case that a collation version
doesn't change between versions. Let's say you have a collation
"mycoll" with locale "en_US" and it's pointed at built-in library
version 64, with collation version 153.97. GUC
default_icu_library_version is set to 63. Then you upgrade the system
and ICU gets updated from 64 -> 65. Now, it can't find version 64 to
load, so it falls back to 63 (which has the wrong version 153.88), even
though 65 is just fine because it still offers that locale with version
153.97. (A similar problem exists when you remove a version of ICU from
icu_library_path, and another version suffices for all of your
collations.)

Thirdly, as I said earlier, it's just hard on the user to try to sort
out two different versions modeled in the database. Understanding
encodings and collations are hard enough, and then we introduce *two*
versions on top of that.

Fourth, I don't see what the point of ucol_getVersion() is in schemes
#2-4. All it does is control a WARNING, because throwing an error (at
least by default) would be too harsh, given that users have lived with
these risks for so long. But if all it does is throw a warning, what's
the point in modeling it in the catalog as though it's the most
important version?

Ultimately, I think collation version (as reported by
ucol_getVersion()) is the most accurate and least-surprising way to
match a library-provided collation with the collation in the catalog.
And it seems like we'd be using it in exactly the way the ICU
maintainers intend it to be used.

Of course, I cast my vote for #1 before I discovered this ICU bug
here: 
https://www.postgresql.org/message-id/0f7922d4f411376f420ec9139febeae4cdc748a6.camel@j-davis.com

That injects some doubt, to be sure. If I were to try to solve the
problems with #2-4, one approach might be to treat the built-in ICU
version differently from the ones in icu_library_path. Not quite sure,
I'd have to think more. But as of now, I'd still lean toward #1 until a
better option is presented.

Regards,
    Jeff Davis




Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com>
> wrote:
> > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com>
> > wrote:
> > > I'd vote for 1 on the grounds that it's easier to document and
> > > understand a single collation version, which comes straight from
> > > ucol_getVersion(). This approach makes it a separate problem to
> > > find
> > > the collation version among whatever libraries the admin can
> > > provide;
> > > but adding some observability into the search should mitigate any
> > > confusion.
> >
> > OK, it sounds like I should code that up next.
>
> Here's the first iteration.

Thank you.

Proposed changes:

* I attached a first pass of some documentation.

* Should be another GUC to turn WARNING into an ERROR. Useful at least
for testing; perhaps too dangerous for production.

* The libraries should be loaded in a more diliberate order. The "*"
should be expanded in a descending fashion so that later versions are
preferred.

* GUCs should be validated.

* Should validate that loaded library has expected version.

* We need to revise or remove pg_collation_actual_version() and
pg_database_collation_actual_version().

* The GUCs are PGC_SUSET, but don't take effect because
icu_library_list_fully_loaded is never reset.

* The extra collations you're adding at bootstrap time are named based
on the library major version. I suppose it might be more "proper" to
name them based on the collation version, but that would be more
verbose, so I won't advocate for that. Just pointing it out.

* It looks hard (or impossible) to mix multiple ICU libraries with the
same major version and different minor versions. That's because,
e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
and when you install ICU 63.2, those dependencies get clobbered with
the 63.2 versions. That fails the sanity check I proposed above about
the library version number matching the requested library version
number. And it also just seems wrong -- why would you have minor-
version precision about an ICU library but then only major-version
precision about the ICU dependencies of that library? Doesn't that
defeat the whole purpose of this naming scheme? (Maybe another ICU
bug?).

Minor comments:

* ICU_I18N is defined in make_icu_library_name() but used outside of
it. One solution might be to have it return both library names to the
caller and rename it as make_icu_library_names().

* get_icu_function() could use a clarifying comment or a better name.
Something that communicates that you are looking for the function in
the given library with the given major version number (which may or may
not be needed depending on how the library was compiled).

* typo in comment over make_icu_collator:
s/u_getVersion/ucol_getVersion/

* The return value of make_icu_collator() seems backwards to me,
stylistically. I typically see the false-is-good pattern with integer
returns.

* weird bracketing style in get_icu_collator for the "else"

>   The version rosetta stone functions look like this:
>
> postgres=# select * from pg_icu_library_versions();
>  icu_version | unicode_version | cldr_version
> -------------+-----------------+--------------
>  67.1        | 13.0            | 37.0
>  63.1        | 11.0            | 34.0
>  57.1        | 8.0             | 29.0
> (3 rows)
>
> postgres=# select * from pg_icu_collation_versions('zh');
>  icu_version | uca_version | collator_version
> -------------+-------------+------------------
>  67.1        | 13.0        | 153.14.37
>  63.1        | 11.0        | 153.88.34
>  57.1        | 8.0         | 153.64.29
> (3 rows)

I like these functions.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: Collation version tracking for macOS

From
Joe Conway
Date:
On 11/28/22 14:11, Robert Haas wrote:
> On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> OK.  Time for a new list of the various models we've discussed so far:
>>
>> 1.  search-by-collversion:  We introduce no new "library version"
>> concept to COLLATION and DATABASE object and little or no new syntax.
>>
>> 2.  lib-version-in-providers: We introduce a separate provider value
>> for each ICU version, for example ICU63, plus an unversioned ICU like
>> today.
>>
>> 3.  lib-version-in-attributes: We introduce daticuversion (alongside
>> datcollversion) and collicuversion (alongside collversion).  Similar
>> to the above, but it's a separate property and the provider is always
>> ICU.  New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
>> ICU_VERSION.
>>
>> 4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
>> mostly a strawman proposal to avoid getting bogged down in
>> syntax/catalogue/model change discussions while trying to prove that
>> dlopen would even work.  It doesn't sound like anyone really likes
>> this.
>>
>> 5.  lib-version-in-collversion:  We didn't explicitly discuss this
>> before, but you hinted at it: we could just use u_getVersion() in
>> [dat]collversion.
> 
> I'd like to vote against #3 at least in the form that's described
> here. If we had three more libraries providing collations, it's likely
> that they would need versioning, too. So if we add an explicit notion
> of provider version, then it ought not to be specific to libicu.

+many

> I think it's OK to decide that different library versions are
> different providers (your option #2), or that they are the same
> provider but give rise to different collations (your option #4), or
> that there can be multiple version of each collation which are
> distinguished by some additional provider version field (your #3 made
> more generic).

I think provider and collation version are distinct concepts. The 
provider ('c' versus 'i' for example) determines a unique code path in 
the backend due to different APIs, whereas collation version is related 
to a specific ordering given a set of characters.


> I don't really understand #1 or #5 well enough to have an educated
> opinion, but I do think that #1 seems a bit magical. It hopes that the
> combination of a collation name and a datcollversion will be
> sufficient to find exactly one matcing collation in a list of provided
> libraries. The advantage of that, as I understand it, is that if you
> do something to your system that causes the number of matches to go
> from one to zero, you can just throw another library on the pile and
> get the number back up to one. Woohoo! But there's a part of me that
> worries: what if the number goes up to two, and they're not all the
> same? Probably that's something that shouldn't happen, but if it does
> then I think there's kind of no way to fix it. With the other options,
> if there's some way to jigger the catalog state to match what you want
> to happen, you can always repair the situation somehow, because the
> library to be used for each collation is explicitly specified in some
> way, and you just have to get it to match what you want to have
> happen.

My vote is for something like #5. The collversion should indicate a 
specific immutable ordering behavior.


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Mon, Nov 28, 2022 at 11:49 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Not necessarily, #2-4 (at least as implemented in v7) can only load one
> major version at a time, so can't specify minor versions:
> https://www.postgresql.org/message-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.camel@j-davis.com
>
> With #1, you can provide control over the search order to find the
> symbol you want. Granted, if you want to specify that different
> collations look in different libraries for the same version, then it
> won't work, because the search order is global -- is that what you're
> worried about? If so, I think we need to compare it against the
> downsides of #2-4, which in my opinion are more serious.

You know more about this than I do, for sure, so don't let my vote
back the project into a bad spot. But, yeah, the thing you mention
here is what I'm worried about. Without a way to force a certain
behavior for a certain particular collation, you don't have an escape
valve if the global library ordering isn't doing what you want. Your
argument seems to at least partly be that #1 will be more usable on
the whole, and that does seem like an important consideration. People
may have a lot of collations and adjusting them all individually could
be difficult and unpleasant. However, I think it's also worth asking
what options someone has if #1 can't be made to work due to a single
ordering controlling every collation.

It's entirely possible that the scenario I'm worried about is too
remote in practice to be concerned about. I don't know how this stuff
works well enough to be certain. It's just that, on the basis of
previous experience, (1) it's not that uncommon for people to actually
end up in situations that we thought shouldn't ever happen and (2)
code that deals with collations is more untrustworthy than average.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeremy Schneider
Date:
On 11/28/22 6:54 PM, Jeff Davis wrote:

> 
> =# select * from pg_icu_collation_versions('en_US') order by
> icu_version;
>  icu_version | uca_version | collator_version 
> -------------+-------------+------------------
>  ...
>  67.1        | 13.0        | 153.14
>  68.2        | 13.0        | 153.14
>  69.1        | 13.0        | 153.14
>  70.1        | 14.0        | 153.112
> (21 rows)
> 
> This is good information, because it tells us that major library
> versions change more often than collation versions, empirically-
> speaking.


It seems to me that the collator_version field is not a good version
identifier to use.

Just taking a quick glance at the ICU home page right now, it shows that
all of the last 5 versions of ICU have included "additions and
corrections" to locale data itself, including 68 to 69 where the
collator version did not change.

Is it possible that this "collator_version" only reflects the code that
processes collation data to do comparisons/sorts, but it does not
reflect updates to the locale data itself?

https://icu.unicode.org/

ICU v72 -> CLDR v42
ICU v71 -> CLDR v41
ICU v70 -> CLDR v40
ICU v69 -> CLDR v39
ICU v68 -> CLDR v38

-Jeremy


-- 
http://about.me/jeremy_schneider




Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> It seems to me that the collator_version field is not a good version
> identifier to use.
>
> Just taking a quick glance at the ICU home page right now, it shows that
> all of the last 5 versions of ICU have included "additions and
> corrections" to locale data itself, including 68 to 69 where the
> collator version did not change.
>
> Is it possible that this "collator_version" only reflects the code that
> processes collation data to do comparisons/sorts, but it does not
> reflect updates to the locale data itself?

I think it also includes the CLDR version for *some* locales.  From a
quick look, that includes 'ar', 'ru', 'tr', 'zh'.  Jeff, would you
mind sharing the same table for one of those?  Perhaps 'en' really
does depend only on the UCA?



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Tue, 2022-11-29 at 12:32 -0500, Robert Haas wrote:
> You know more about this than I do, for sure, so don't let my vote
> back the project into a bad spot.

I'm going back and forth myself. I haven't found a great answer here
yet.

>  But, yeah, the thing you mention
> here is what I'm worried about. Without a way to force a certain
> behavior for a certain particular collation, you don't have an escape
> valve if the global library ordering isn't doing what you want.

One bit of weirdness is that I may have found another ICU problem.
First, install 63.1, and you get (editing for clarity):

$ ls -l /path/to/libicui18n.so.63*
/path/to/libicui18n.so.63 -> libicui18n.so.63.1
/path/to/libicui18n.so.63.1

$ ls -l /path/to/libicuuc.so.63*
/path/to/libicuuc.so.63 -> libicuuc.so.63.1
/path/to/libicuuc.so.63.1

$ ls -l /path/to/libicudata.so.63*
/path/to/libicudata.so.63 -> libicudata.so.63.1
/path/to/lib/libicudata.so.63.1

$ ldd /path/to/libicui18n.so.63.1
        libicuuc.so.63 => /path/to/libicuuc.so.63
        libicudata.so.63 => /path/to/libicudata.so.63

OK, now install 63.2. Then you get:

$ ls -l /path/to/libicui18n.so.63*
/path/to/libicui18n.so.63 -> libicui18n.so.63.2
/path/to/libicui18n.so.63.1
/path/to/libicui18n.so.63.2

$ ls -l /path/to/libicuuc.so.63*
/path/to/libicuuc.so.63 -> libicuuc.so.63.2
/path/to/libicuuc.so.63.1
/path/to/libicuuc.so.63.2

$ ls -l /path/to/libicudata.so.63*
/path/to/libicudata.so.63 -> libicudata.so.63.2
/path/to/libicudata.so.63.1
/path/to/libicudata.so.63.2

$ ldd /path/to/libicui18n.so.63.2
        libicuuc.so.63 => /path/to/libicuuc.so.63
        libicudata.so.63 => /path/to/libicudata.so.63

The problem is that the specific minor version 63.1 depends on only the
major version of its ICU link dependencies. When loading
libicui18n.so.63.1, you are actually pulling in libicuuc.so.63.2 and
libicudata.so.63.2.

When I tried this with Thomas's patch, it caused some confusing
problems. I inserted a check that, when you open a library, that the
requested and reported versions match, and the check failed when
multiple minors are installed. In other words, opening
libicui18n.so.63.1 reports a version of 63.2!

(Note: I compiled ICU with --enable-rpath, but I don't think it
matters.)

Summary: even locking down to a minor version does not seem to identify
a specific ICU library, because its shared library dependencies do not
reference a specific minor version.

> It's entirely possible that the scenario I'm worried about is too
> remote in practice to be concerned about. I don't know how this stuff
> works well enough to be certain. It's just that, on the basis of
> previous experience, (1) it's not that uncommon for people to
> actually
> end up in situations that we thought shouldn't ever happen and (2)
> code that deals with collations is more untrustworthy than average.

Yeah...


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote:
> My vote is for something like #5. The collversion should indicate a
> specific immutable ordering behavior.

Easier said than done:

https://www.postgresql.org/message-id/abddc35a7a447d93e2b8371a1a9052cb48866070.camel@j-davis.com

Even pointing at a specific minor version doesn't guarantee that
specific ICU code is loaded. It could also be a mix of different minor
versions that happen to be installed.

But if we ignore that problem for a moment, and assume that major
version is precise enough, let me make another proposal (not advocating
for this, but wanted to put it out there):

6. Create a new concept of a "locked down collation" that points at
some specific collation code (identified by some combination of library
version and collation version or whatever else can be used to identify
it). If a collation is locked down, it would never have a fallback or
any other magic, it would either find the code it's looking for, or
fail. If a collation is not locked down, it would look only in the
built-in ICU library, and warn if it detects some kind of change
(again, by whatever heuristic we think is reasonable).

#6 doesn't answer all of the problems I pointed out earlier:

https://www.postgresql.org/message-id/83faecb4a89dfb5794938e7b4d9f89daf4c5d631.camel@j-davis.com

but could be a better starting place for answers.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote:
> On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider
> <schneider@ardentperf.com> wrote:
> > It seems to me that the collator_version field is not a good
> > version
> > identifier to use.
> >
> > Just taking a quick glance at the ICU home page right now, it shows
> > that
> > all of the last 5 versions of ICU have included "additions and
> > corrections" to locale data itself, including 68 to 69 where the
> > collator version did not change.
> >
> > Is it possible that this "collator_version" only reflects the code
> > that
> > processes collation data to do comparisons/sorts, but it does not
> > reflect updates to the locale data itself?
>
> I think it also includes the CLDR version for *some* locales.  From a
> quick look, that includes 'ar', 'ru', 'tr', 'zh'.  Jeff, would you
> mind sharing the same table for one of those?  Perhaps 'en' really
> does depend only on the UCA?

=# select * from pg_icu_collation_versions('ar') order by icu_version;
 icu_version | uca_version | collator_version
-------------+-------------+------------------
 50.2        | 6.2         | 58.0.0.50
 51.3        | 6.2         | 58.0.0.50
 52.2        | 6.2         | 58.0.0.50
 53.2        | 6.3         | 137.51.25
 54.2        | 7.0         | 137.56.26
 55.2        | 7.0         | 153.56.27.1
 56.2        | 8.0         | 153.64.28
 57.2        | 8.0         | 153.64.29
 58.3        | 9.0         | 153.72.30.3
 59.2        | 9.0         | 153.72.31.1
 60.3        | 10.0        | 153.80.32.1
 61.2        | 10.0        | 153.80.33
 62.2        | 11.0        | 153.88.33.8
 63.2        | 11.0        | 153.88.34
 64.2        | 12.1        | 153.97.35.8
 65.1        | 12.1        | 153.97.36
 66.1        | 13.0        | 153.14.36.8
 67.1        | 13.0        | 153.14.37
 68.2        | 13.0        | 153.14.38.8
 69.1        | 13.0        | 153.14.39
 70.1        | 14.0        | 153.112.40
(21 rows)


=# select * from pg_icu_collation_versions('zh') order by icu_version;
 icu_version | uca_version | collator_version
-------------+-------------+------------------
 50.2        | 6.2         | 58.0.0.50
 51.3        | 6.2         | 58.0.0.50
 52.2        | 6.2         | 58.0.0.50
 53.2        | 6.3         | 137.51.25
 54.2        | 7.0         | 137.56.26
 55.2        | 7.0         | 153.56.27.1
 56.2        | 8.0         | 153.64.28
 57.2        | 8.0         | 153.64.29
 58.3        | 9.0         | 153.72.30.3
 59.2        | 9.0         | 153.72.31.1
 60.3        | 10.0        | 153.80.32.1
 61.2        | 10.0        | 153.80.33
 62.2        | 11.0        | 153.88.33.8
 63.2        | 11.0        | 153.88.34
 64.2        | 12.1        | 153.97.35.8
 65.1        | 12.1        | 153.97.36
 66.1        | 13.0        | 153.14.36.8
 67.1        | 13.0        | 153.14.37
 68.2        | 13.0        | 153.14.38.8
 69.1        | 13.0        | 153.14.39
 70.1        | 14.0        | 153.112.40
(21 rows)


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Joe Conway
Date:
On 11/29/22 13:59, Jeff Davis wrote:
> On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote:
>> My vote is for something like #5. The collversion should indicate a 
>> specific immutable ordering behavior.
> 
> Easier said than done:
> https://www.postgresql.org/message-id/abddc35a7a447d93e2b8371a1a9052cb48866070.camel@j-davis.com
> 
> Even pointing at a specific minor version doesn't guarantee that
> specific ICU code is loaded. It could also be a mix of different minor
> versions that happen to be installed.

I understand that it is not easily done, but if the combination of 
collprovider + collversion does not represent specific immutable 
ordering behavior for a given locale, what value is there in tracking it?

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote:
> On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:
> > That is ... astonishingly bad.
>
> https://unicode-org.atlassian.net/browse/CLDR-16175

Oops, reported in CLDR instead of ICU. Moved to:

https://unicode-org.atlassian.net/browse/ICU-22215


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> One bit of weirdness is that I may have found another ICU problem.

Reported as:

https://unicode-org.atlassian.net/browse/ICU-22216


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 8:03 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote:
> > I think it also includes the CLDR version for *some* locales.  From a
> > quick look, that includes 'ar', 'ru', 'tr', 'zh'.  Jeff, would you
> > mind sharing the same table for one of those?  Perhaps 'en' really
> > does depend only on the UCA?
>
> =# select * from pg_icu_collation_versions('ar') order by icu_version;
>  icu_version | uca_version | collator_version
> -------------+-------------+------------------
>  50.2        | 6.2         | 58.0.0.50
>  51.3        | 6.2         | 58.0.0.50
>  52.2        | 6.2         | 58.0.0.50
>  53.2        | 6.3         | 137.51.25
>  54.2        | 7.0         | 137.56.26
>  55.2        | 7.0         | 153.56.27.1
>  56.2        | 8.0         | 153.64.28
>  57.2        | 8.0         | 153.64.29
>  58.3        | 9.0         | 153.72.30.3
>  59.2        | 9.0         | 153.72.31.1
>  60.3        | 10.0        | 153.80.32.1
>  61.2        | 10.0        | 153.80.33
>  62.2        | 11.0        | 153.88.33.8
>  63.2        | 11.0        | 153.88.34
>  64.2        | 12.1        | 153.97.35.8
>  65.1        | 12.1        | 153.97.36
>  66.1        | 13.0        | 153.14.36.8
>  67.1        | 13.0        | 153.14.37
>  68.2        | 13.0        | 153.14.38.8
>  69.1        | 13.0        | 153.14.39
>  70.1        | 14.0        | 153.112.40
> (21 rows)

Thanks.  So now we can see that the CLDR minor version is there too.
At a guess, in ICU 60 and before, it was the 4th component directly,
and from ICU 61 on, it's shifted left 3 bits.  I guess that means
those CLDR-dependent locales have higher frequency collversion
changes, including everyday "apt-get upgrade" (no major OS upgrade
required), assuming that Debian et al take those minor upgrades, while
others like 'en' should be stable for the whole ICU major version's
lifetime, and even across some ICU major version upgrades, because the
Unicode/UCA version changes more slowly.

Those CLDR-dependent locales therefore present us with a problem: as
discussed a while back, it's impossible to install two minor versions
of the same ICU major version with packages, and as Jeff has pointed
out in recent emails, even if you compile them yourself (which no one
really expects users to do), it doesn't really work because the
SONAMEs only have the major version, so the various libraries
that make up ICU will not be able to open each other correctly
(they'll follow symlinks to an arbitrary minor version).  (These two
things are not unrelated.)  So I probably need to remove the code that
claimed to support minor version addressing and go back to the
previous thinking that major will have to be enough.

In terms of user experience, I think that might mean that users of
'zh' who encounter warnings after a minor upgrade would therefore
really only have the options of REFRESHing and rebuilding, or
downgrading the package, because there's no way for us to access the
older version.  Users of 'en' probably only encounter collversion
changes when moving between OS releases with an ICU major version
change, and then the various schemes in this thread can help them
avoid the need to rebuild, until they eventually want to, if ever.



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis <pgsql@j-davis.com> wrote:
> 6. Create a new concept of a "locked down collation" that points at
> some specific collation code (identified by some combination of library
> version and collation version or whatever else can be used to identify
> it). If a collation is locked down, it would never have a fallback or
> any other magic, it would either find the code it's looking for, or
> fail. If a collation is not locked down, it would look only in the
> built-in ICU library, and warn if it detects some kind of change
> (again, by whatever heuristic we think is reasonable).

It seems like it would be somewhat reasonable to allow varying levels
of specificity in saying which what suffix to append when calling
dlopen() on the ICU library. Like you could allow adding nothing,
which would find the system-default ICU, or you could add 53 to find
the default version of ICU 53, or you could 53.1 to pick a specific
minor version. The idea is that the symlinks in the filesystem would
be responsible for sorting out the meaning of the supplied string. The
way that minor versions work may preclude having this work as well as
one might hope, though.

I continue to be confused about why collation maintainers think that
it's OK to whack stuff around in minor versions. The thought that
people might use collations to sort data that needs to stay sorted
after upgrading the library seems to be an alien one, and it doesn't
really seem like libicu is a whole lot better than libc, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 8:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > 6. Create a new concept of a "locked down collation" that points at
> > some specific collation code (identified by some combination of library
> > version and collation version or whatever else can be used to identify
> > it). If a collation is locked down, it would never have a fallback or
> > any other magic, it would either find the code it's looking for, or
> > fail. If a collation is not locked down, it would look only in the
> > built-in ICU library, and warn if it detects some kind of change
> > (again, by whatever heuristic we think is reasonable).
>
> It seems like it would be somewhat reasonable to allow varying levels
> of specificity in saying which what suffix to append when calling
> dlopen() on the ICU library. Like you could allow adding nothing,
> which would find the system-default ICU, or you could add 53 to find
> the default version of ICU 53, or you could 53.1 to pick a specific
> minor version. The idea is that the symlinks in the filesystem would
> be responsible for sorting out the meaning of the supplied string. The
> way that minor versions work may preclude having this work as well as
> one might hope, though.

I'm struggling to understand what's new about proposal #6.  The
earlier proposals except #1 already contemplated different levels of
locked-down-ness.  For example in the libversion-as-provider idea, we
said you could use just provider = ICU (warn me if the collverison
changes, but always use the "default" library and carry on, pretty
much like today except perhaps "the default" can be changed with a
GUC), or you could be more specific and say provider = ICU63.  (We
also mentioned ICU63_2 as a third level of specificity, but maybe
that's practically impossible.)  And it was the same for the other
ideas, just encoded in different ways.



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Tue, 2022-11-29 at 14:34 -0500, Joe Conway wrote:
> I understand that it is not easily done, but if the combination of
> collprovider + collversion does not represent specific immutable
> ordering behavior for a given locale

Given the u_versionToString() bug, we know the version string could end
up being the same between two different collation versions (e.g.
153.104 and 153.14). So that really undermines the credibility of ICU's
collation versions (at least the strings, which is what we store in
collversion).

But if we ignore that bug, do we have evidence that the actual versions
could be the same for collations that sort differently? It's worth
exploring, to be sure, but right now I don't know of a case.

> , what value is there in tracking [collation version]?

Similarly, what is the value in tracking the library minor versions, if
when you open libicui18n.63.1, you may end up with a mix of code
between 63.1 and 63.2?

That doesn't mean it's impossible. We could attach collations to a
library major version, and tell administrators that once they install a
major version in icu_library_path, they never touch that major version
again (no updates or new minors, only new majors). #6 might be a good
approach to facilitate this best practice. We'd then probably need to
change collversion to be a library major version, and then come up with
a migration path from 15 -> 16. Or we could store both library major
version and collversion, and verify both.

--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Wed, 2022-11-30 at 08:41 +1300, Thomas Munro wrote:
> In terms of user experience, I think that might mean that users of
> 'zh' who encounter warnings after a minor upgrade would therefore
> really only have the options of REFRESHing and rebuilding, or
> downgrading the package, because there's no way for us to access the
> older version.  Users of 'en' probably only encounter collversion
> changes when moving between OS releases with an ICU major version
> change, and then the various schemes in this thread can help them
> avoid the need to rebuild, until they eventually want to, if ever.

I installed the first minor release for each major, and got some new
tables. I think we can all agree that it's a lot easier to work with
information once it's in table form.

Here's what I found for the 'ar' locale (firstminor/lastminor are the
icu library versions, firstcollversion/lastcollversion are their
respective collation versions for the given locale):

 firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
 60.1       | 60.3      | 153.80.32        | 153.80.32.1
 64.1       | 64.2      | 153.96.35        | 153.97.35.8
 68.1       | 68.2      | 153.14.38        | 153.14.38.8
(3 rows)

For 'en':

 firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
 64.1       | 64.2      | 153.96           | 153.97
(1 row)

And for 'zh':

 firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
 60.1       | 60.3      | 153.80.32        | 153.80.32.1
 64.1       | 64.2      | 153.96.35        | 153.97.35.8
 68.1       | 68.2      | 153.14.38        | 153.14.38.8
(3 rows)

It looks like collation versions do change in minor releases. It looks
like it's *not* safe to lock a collation to a major version *if* that
major version could be updated to a new minor. And we can't lock to a
minor, as I said earlier. Therefore, once we lock a collation down to a
major release, we better keep that in the icu_library_path, and never
touch it, and never install a new minor for that major.

Then again, maybe some of these are just about how the version is
reported... maybe 153.80.32 and 153.80.32.1 are really the same
version? But 64.1 -> 64.2 looks like a real difference.

I suppose the next step is to test with actual data and find
differences?


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Here's what I found for the 'ar' locale (firstminor/lastminor are the
> icu library versions, firstcollversion/lastcollversion are their
> respective collation versions for the given locale):
>
>  firstminor | lastminor | firstcollversion | lastcollversion
> ------------+-----------+------------------+-----------------
>  60.1       | 60.3      | 153.80.32        | 153.80.32.1
>  64.1       | 64.2      | 153.96.35        | 153.97.35.8
>  68.1       | 68.2      | 153.14.38        | 153.14.38.8
> (3 rows)

Right, this fits with what I said earlier: the third component is CLDR
major, fourth component is CLDR minor except from ICU 61 on the CLDR
minor is << 3'd (X.X.38.8 means CLDR 38.1).  I wrote something about
that particular CLDR upgrade that happened in ICU 68 back here, with a
link to the CLDR change list:

https://www.postgresql.org/message-id/CA+hUKGJxg6AbKC9RJ7r1ByVLtvVkThQV+RZO6BKVWYESPCp3Ug@mail.gmail.com

TL;DR that particular CLDR change didn't actually affect collations,
it affected other locale stuff we don't care about (timezones etc).
We probably have to assume that any CLDR change *might* affect us,
though, unless we can find a written policy somewhere that says CLDR
minor changes never change sort order.  But I wouldn't want to get
into 2nd guessing their ucol_getVersion() format, and if they knew
that minor changes didn't affect sort order they presumably wouldn't
have included it in the recipe, so I think we simply have to treat it
as opaque and assume that ucol_getVersion() change means what it says
on the tin: sort order might have changed.

> I suppose the next step is to test with actual data and find
> differences?

Easier to read the published CLDR deltas, but I'm not sure it'd tell
us much about what *could* happen in future releases...



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Wed, 2022-11-30 at 09:00 +1300, Thomas Munro wrote:
> I'm struggling to understand what's new about proposal #6.

Perhaps it's just a slight variant; I'm not sure. It's not a complete
proposal yet.

The difference I had in mind is that it would treat the built-in ICU
differently from what is found in icu_library_path. I think that could
remove confusion over what happens when you upgrade the system's ICU
library.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> > One bit of weirdness is that I may have found another ICU problem.
>
> Reported as:
>
> https://unicode-org.atlassian.net/browse/ICU-22216

I'm no expert on loader/linker arcana but I have a feeling this is a
dead end.  It's an ancient Unix or at least elf-era Unix convention
that SONAMEs have major versions only, because major versions are the
basis of ABI stability.

As a workaround with an already built ICU, I think you could use elf
editing tools like "patchelf" to change the SONAME and DT_NEEDED to
include the minor version.  Or you could convince the build/link
scripts to set them that way in the first place, but no distro would
want to do that as it would cause lots of executables to fail to load
when the next ICU minor comes out.



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote:
> On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> > > One bit of weirdness is that I may have found another ICU
> > > problem.
> >
> > Reported as:
> >
> > https://unicode-org.atlassian.net/browse/ICU-22216
>
> I'm no expert on loader/linker arcana but I have a feeling this is a
> dead end.  It's an ancient Unix or at least elf-era Unix convention
> that SONAMEs have major versions only, because major versions are the
> basis of ABI stability.

It's possible that it's more a problem of how they are doing it: the
specific version is coming from a dependency rather than the library
itself. The results are surprising, so I figured it's worth a report.
Let's see what they say.

Regardless, even if they did make a change, it's not going to help us
anytime soon. We can't rely on any scheme that involves multiple minor
versions for a single major version being installed at once. That means
that, if you create a collation depending on ICU X.Y, and then it gets
upgraded to X.(Y+1), and you create another collation depending on that
library version, you are stuck.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:
> On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > Here's what I found for the 'ar' locale (firstminor/lastminor are
> > the
> > icu library versions, firstcollversion/lastcollversion are their
> > respective collation versions for the given locale):
> >
> >  firstminor | lastminor | firstcollversion | lastcollversion
> > ------------+-----------+------------------+-----------------
> >  60.1       | 60.3      | 153.80.32        | 153.80.32.1
> >  64.1       | 64.2      | 153.96.35        | 153.97.35.8
> >  68.1       | 68.2      | 153.14.38        | 153.14.38.8
> > (3 rows)
>
> Right, this fits with what I said earlier: the third component is
> CLDR
> major, fourth component is CLDR minor except from ICU 61 on the CLDR
> minor is << 3'd (X.X.38.8 means CLDR 38.1).

What about 64.1 -> 64.2? That changed the *second* component from 96 ->
97. Are we agreed that collations can materially change in minor ICU
releases?


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 1:32 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:
> > On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > > Here's what I found for the 'ar' locale (firstminor/lastminor are
> > > the
> > > icu library versions, firstcollversion/lastcollversion are their
> > > respective collation versions for the given locale):
> > >
> > >  firstminor | lastminor | firstcollversion | lastcollversion
> > > ------------+-----------+------------------+-----------------
> > >  60.1       | 60.3      | 153.80.32        | 153.80.32.1
> > >  64.1       | 64.2      | 153.96.35        | 153.97.35.8
> > >  68.1       | 68.2      | 153.14.38        | 153.14.38.8
> > > (3 rows)
> >
> > Right, this fits with what I said earlier: the third component is
> > CLDR
> > major, fourth component is CLDR minor except from ICU 61 on the CLDR
> > minor is << 3'd (X.X.38.8 means CLDR 38.1).
>
> What about 64.1 -> 64.2? That changed the *second* component from 96 ->
> 97. Are we agreed that collations can materially change in minor ICU
> releases?

That means that the Unicode/UCA version switched from 12 to 12.1, so
that's a confirmed sighting of a UCA minor version bump within one ICU
major version.  Let's see what the purpose of that Unicode minor
release was[1]:

"Unicode 12.1 adds exactly one character, for a total of 137,929 characters.

The new character added to Version 12.1 is:

U+32FF SQUARE ERA NAME REIWA

Version 12.1 adds that single character to enable software to be
rapidly updated to support the new Japanese era name in calendrical
systems and date formatting. The new Japanese era name was officially
announced on April 1, 2019, and is effective as of May 1, 2019."

Wow!

Wikipedia says[2] "the "rei" character 令 has never appeared before".

The sort order of characters that didn't previously exist is a special
topic.  In theory they can't hurt you because you shouldn't have been
using them, but PostgreSQL doesn't enforce that (other systems do), so
you could be exposed to a change from whatever default ordering the
non-existent codepoint had for random implementation reasons to some
deliberate ordering which may or may not be the same.

Are all Unicode/UCA minor versions of that type?  I dunno.  Something
to research, but [3] is far too vague and [4] is about other problems.

[1] https://unicode.org/versions/Unicode12.1.0/
[2] https://en.wikipedia.org/wiki/Reiwa
[3] https://www.unicode.org/versions/#major_minor
[4] https://www.unicode.org/policies/stability_policy.html



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Wed, Nov 30, 2022 at 1:25 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote:
> > On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> > > https://unicode-org.atlassian.net/browse/ICU-22216
> >
> > I'm no expert on loader/linker arcana but I have a feeling this is a
> > dead end.  It's an ancient Unix or at least elf-era Unix convention
> > that SONAMEs have major versions only, because major versions are the
> > basis of ABI stability.
>
> It's possible that it's more a problem of how they are doing it: the
> specific version is coming from a dependency rather than the library
> itself. The results are surprising, so I figured it's worth a report.
> Let's see what they say.
>
> Regardless, even if they did make a change, it's not going to help us
> anytime soon. We can't rely on any scheme that involves multiple minor
> versions for a single major version being installed at once. That means
> that, if you create a collation depending on ICU X.Y, and then it gets
> upgraded to X.(Y+1), and you create another collation depending on that
> library version, you are stuck.

Mainstream package maintainers aren't going to let that happen anyway
as discussed, so this would always be a fairly specialised concern.
Maybe someone in our community would be motivated to publish a repo
full of mutant packages that don't conflict with each other and that
have specially modified DT_NEEDED, or are rolled into one single
library so the DT_NEEDED problem goes away.



Re: Collation version tracking for macOS

From
Michael Paquier
Date:
On Wed, Nov 30, 2022 at 01:50:51PM +1300, Thomas Munro wrote:
> The new character added to Version 12.1 is:
>
> U+32FF SQUARE ERA NAME REIWA
>
> Version 12.1 adds that single character to enable software to be
> rapidly updated to support the new Japanese era name in calendrical
> systems and date formatting. The new Japanese era name was officially
> announced on April 1, 2019, and is effective as of May 1, 2019."
>
> Wow!

Wow++.  I didn't know this one.

> Wikipedia says[2] "the "rei" character 令 has never appeared before".

At least there was some time ahead to prepare for the switch from "平
成" to "令和".  Things were much "funnier" when the era has switched
from "昭和" to "平成", as the sudden death of the emperor has required
Japan to switch to a new calendar very suddenly back in the day..
I've heard this was quite a mess for folks in IT back then, especially
for public agencies.
--
Michael

Attachment

Re: Collation version tracking for macOS

From
Dagfinn Ilmari Mannsåker
Date:
Jeff Davis <pgsql@j-davis.com> writes:

> On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote:
>> On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:
>> > That is ... astonishingly bad.
>> 
>> https://unicode-org.atlassian.net/browse/CLDR-16175
>
> Oops, reported in CLDR instead of ICU. Moved to:
>
> https://unicode-org.atlassian.net/browse/ICU-22215

Out of morbid curiosity I went source diving, and the culprit is this
bit (which will also break if a version component ever goes above 999):

    /* write the decimal field value */
    field=versionArray[part];
    if(field>=100) {
        *versionString++=(char)('0'+field/100);
        field%=100;
    }
    if(field>=10) {
        *versionString++=(char)('0'+field/10);
        field%=10;
    }
    *versionString++=(char)('0'+field);

(https://sources.debian.org/src/icu/72.1-3/source/common/putil.cpp#L2308)

because apparently snprintf() is too hard?

- ilmari



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Nov 29, 2022 at 7:51 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> > On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com>
> > wrote:
> > > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com>
> > > wrote:
> > > > I'd vote for 1 on the grounds that it's easier to document and
> > > > understand a single collation version, which comes straight from
> > > > ucol_getVersion(). This approach makes it a separate problem to
> > > > find
> > > > the collation version among whatever libraries the admin can
> > > > provide;
> > > > but adding some observability into the search should mitigate any
> > > > confusion.
> > >
> > > OK, it sounds like I should code that up next.
> >
> > Here's the first iteration.
>
> Thank you.

Thanks for the review.  Responses further down.  And thanks also for
the really interesting discussion about how the version numbers work
(or in some cases, don't work...), and practical packaging and linking
problems.

To have a hope of making something happen for PG16, which I think
means we need a serious contender patch in the next few weeks, we
really need to make some decisions.  I enjoyed trying out
search-by-collversion, but it's still not my favourite.  On the ballot
we have two main questions:

1.  Should we commit to search-by-collversion, or one of the explicit
library version ideas, and if the latter, which?
2.  Should we try to support being specific about minor versions (in
various different ways according to the choice made for #1)?

My tentative votes are:

1.  I think we should seriously consider provider = ICU63.  I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work.  Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while still
making it super clear what is going on.  This would include adding DDL
commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
to make warnings go way.

2.  I think we should ignore minor versions for now (other than
reporting them in the relevant introspection functions), but not make
any choices that would prevent us from changing our mind about that in
a later release.  For example, having two levels of specificity ICU
and ICU68  in the libver-in-provider-name design wouldn't preclude us
from adding support for ICU68_2 later

I haven't actually tried that design out in code yet, but I'm willing
to try to code that up very soon.  So no new patch from me yet.  Does
anyone else want to express a view?

> Proposed changes:
>
> * I attached a first pass of some documentation.

Thanks.  Looks pretty good, and much of it would stay if we changed to
one of the other models.

> * Should be another GUC to turn WARNING into an ERROR. Useful at least
> for testing; perhaps too dangerous for production.

OK, will add that into the next version.

> * The libraries should be loaded in a more diliberate order. The "*"
> should be expanded in a descending fashion so that later versions are
> preferred.

Yeah, I agree.

> * GUCs should be validated.

Will do.

> * Should validate that loaded library has expected version.

Will do.

> * We need to revise or remove pg_collation_actual_version() and
> pg_database_collation_actual_version().

I never liked that use of the word "actual"...

> * The GUCs are PGC_SUSET, but don't take effect because
> icu_library_list_fully_loaded is never reset.

True.  Just rought edges because I was trying to prototype
search-by-collversion fast.  Will consider this for the next version.

> * The extra collations you're adding at bootstrap time are named based
> on the library major version. I suppose it might be more "proper" to
> name them based on the collation version, but that would be more
> verbose, so I won't advocate for that. Just pointing it out.

Ah, yes, the ones with names like "en-US-x-icu68".  I agree that made
a little less sense in the search-by-collversion patch.  Maybe we
wouldn't want these at all in the search-by-collversion model.  But I
think they're perfect the way they are in the provider = ICU68 model.
The other idea I considered ages ago was that we could use namespaces:
you could "icu68.en-US", or just "en-US" in some contexts to get what
your search path sees, but that all seemed a little too cute and not
really like anything else we do with system-created catalogues, so I
gave that idea up.

> * It looks hard (or impossible) to mix multiple ICU libraries with the
> same major version and different minor versions. That's because,
> e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
> and when you install ICU 63.2, those dependencies get clobbered with
> the 63.2 versions. That fails the sanity check I proposed above about
> the library version number matching the requested library version
> number. And it also just seems wrong -- why would you have minor-
> version precision about an ICU library but then only major-version
> precision about the ICU dependencies of that library? Doesn't that
> defeat the whole purpose of this naming scheme? (Maybe another ICU
> bug?).

I don't think it's a bug exactly.  That scheme is designed to
advertise ABI stability, and not intended to support parallel
installation of minor versions.  It does seem a little silly for
libraries that are shipped together as one atomic unit not to use
fully qualified dependency names, though.

I think there would be various technical solutions, if you're prepared
to give up existing ready-made packages and build stuff yourself.
Install them into different directories with different DT_RPATH so
they can't see each other (but then our icu_library_path needs to
support a list of paths or it won't find these ones which will have to
be not in the usual system path), or clobber the DT_NEEDED (but I
guess not the DT_SONAME) to mention the minor version, and equivalent
concepts for other non-elf systems (at a glance the same problem
applies on macOS), or re-roll the libraries into a single .so.  Or
convince them to support a single library build mode (maybe there is
one already?  I couldn't find it).

That's all a bit against the grain for now, and makes me want to
abandon the notion of minor versions completely for now but leave the
option open for later exploration.

In the meantime, I think the feature is still pretty useful.  For
example, it helps you with the common case of a major OS upgrade or
streaming replication across major OS versions: just find the right
.deb/rpm/whatever for the older one, and install it, until you're
ready to upgrade and REFRESH.  The story is not quite as good for
someone with an index full of Chinese or Turkish text who gets a
surprise warning after a minor apt-get update, because the Japanese
have decided to invent a new character.  We can't offer a nice
solution to that: they have to determine that it is safe to REFRESH to
clear the warning, with or without rebuild, or downgrade/pin the ICU
package until they are ready to REFRESH.  But that is already the case
today and this patch neither helps nor hinders.  The only reason we
didn't know about this pre-existing type of problem is because
(approximately) nobody uses ICU yet, because it wasn't available as a
database default yet.

> Minor comments:
>
> * ICU_I18N is defined in make_icu_library_name() but used outside of
> it. One solution might be to have it return both library names to the
> caller and rename it as make_icu_library_names().

Good idea, will do.

> * get_icu_function() could use a clarifying comment or a better name.
> Something that communicates that you are looking for the function in
> the given library with the given major version number (which may or may
> not be needed depending on how the library was compiled).

Agreed.

> * typo in comment over make_icu_collator:
> s/u_getVersion/ucol_getVersion/

Thanks.

> * The return value of make_icu_collator() seems backwards to me,
> stylistically. I typically see the false-is-good pattern with integer
> returns.

Agreed.

> * weird bracketing style in get_icu_collator for the "else"

Yep.

> >   The version rosetta stone functions look like this:
> >
> > postgres=# select * from pg_icu_library_versions();
> >  icu_version | unicode_version | cldr_version
> > -------------+-----------------+--------------
> >  67.1        | 13.0            | 37.0
> >  63.1        | 11.0            | 34.0
> >  57.1        | 8.0             | 29.0
> > (3 rows)
> >
> > postgres=# select * from pg_icu_collation_versions('zh');
> >  icu_version | uca_version | collator_version
> > -------------+-------------+------------------
> >  67.1        | 13.0        | 153.14.37
> >  63.1        | 11.0        | 153.88.34
> >  57.1        | 8.0         | 153.64.29
> > (3 rows)
>
> I like these functions.

Yeah, they've been quite educational.  Now I'm wondering what form
these functions would take in the provider = ICU68 patch.



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Sun, Dec 4, 2022 at 10:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> My tentative votes are:
>
> 1.  I think we should seriously consider provider = ICU63.  I still
> think search-by-collversion is a little too magical, even though it
> clearly can be made to work.  Of the non-magical systems, I think
> encoding the choice of library into the provider name would avoid the
> need to add a second confusing "X_version" concept alongside our
> existing "X_version" columns in catalogues and DDL syntax, while still
> making it super clear what is going on.  This would include adding DDL
> commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
> to make warnings go way.

+1. I wouldn't lose any sleep if we picked a different non-magical
option, but I think this is probably my favorite of the
explicit-library-version options (though it is close) and I like it
better than search-by-collversion.

(It's possible that I'm wrong to like it better, but I do.)

> 2.  I think we should ignore minor versions for now (other than
> reporting them in the relevant introspection functions), but not make
> any choices that would prevent us from changing our mind about that in
> a later release.  For example, having two levels of specificity ICU
> and ICU68  in the libver-in-provider-name design wouldn't preclude us
> from adding support for ICU68_2 later

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
> 1.  I think we should seriously consider provider = ICU63.  I still
> think search-by-collversion is a little too magical, even though it
> clearly can be made to work.  Of the non-magical systems, I think
> encoding the choice of library into the provider name would avoid the
> need to add a second confusing "X_version" concept alongside our
> existing "X_version" columns in catalogues and DDL syntax, while
> still
> making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

  CREATE COLLATION PROVIDER icu63 TYPE icu
    AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

It avoids a lot of the annoyances we're discussing, and puts the power
in the hands of the admin. If they want to allow minor version updates,
they specify the library with .so.63, and let the symlinking handle it.

Of course, we can still do some sanity checks (WARNINGs or ERRORs) when
we think something is going wrong; like the version of ICU is too new,
or the reported version (ucol_getVersion()) doesn't match what's in
collversion. But we basically get out of the business of understanding
ICU versioning and leave that up to the administrator.

It's easier to document, and would require fewer GUCs (if any). And it
avoids mixing version information from another project into our data
model.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Collation version tracking for macOS

From
Joe Conway
Date:
On 12/5/22 12:41, Jeff Davis wrote:
> On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
>> 1.  I think we should seriously consider provider = ICU63.  I still
>> think search-by-collversion is a little too magical, even though it
>> clearly can be made to work.  Of the non-magical systems, I think
>> encoding the choice of library into the provider name would avoid the
>> need to add a second confusing "X_version" concept alongside our
>> existing "X_version" columns in catalogues and DDL syntax, while
>> still
>> making it super clear what is going on.
> 
> As I understand it, this is #2 in your previous list?
> 
> Can we put the naming of the provider into the hands of the user, e.g.:
> 
>    CREATE COLLATION PROVIDER icu63 TYPE icu
>      AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';
> 
> In this model, icu would be a "provider kind" and icu63 would be the
> specific provider, which is named by the user.
> 
> That seems like the least magical approach, to me. We need an ICU
> library; the administrator gives us one that looks like ICU; and we're
> happy.

+1

I like this. The provider kind defines which path we take in our code, 
and the specific library unambiguously defines a specific collation 
behavior (I think, ignoring bugs?)

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Dec 6, 2022 at 6:45 AM Joe Conway <mail@joeconway.com> wrote:
> On 12/5/22 12:41, Jeff Davis wrote:
> > On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
> >> 1.  I think we should seriously consider provider = ICU63.  I still
> >> think search-by-collversion is a little too magical, even though it
> >> clearly can be made to work.  Of the non-magical systems, I think
> >> encoding the choice of library into the provider name would avoid the
> >> need to add a second confusing "X_version" concept alongside our
> >> existing "X_version" columns in catalogues and DDL syntax, while
> >> still
> >> making it super clear what is going on.
> >
> > As I understand it, this is #2 in your previous list?
> >
> > Can we put the naming of the provider into the hands of the user, e.g.:
> >
> >    CREATE COLLATION PROVIDER icu63 TYPE icu
> >      AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';
> >
> > In this model, icu would be a "provider kind" and icu63 would be the
> > specific provider, which is named by the user.
> >
> > That seems like the least magical approach, to me. We need an ICU
> > library; the administrator gives us one that looks like ICU; and we're
> > happy.
>
> +1
>
> I like this. The provider kind defines which path we take in our code,
> and the specific library unambiguously defines a specific collation
> behavior (I think, ignoring bugs?)

OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Tue, 2022-12-06 at 10:33 +1300, Thomas Munro wrote:
> OK, I'm going to see what happens if I try to wrangle that stuff into
> a new catalogue table.

I've been hacking on a major refactor of the locale-related code. I
attached my progress and I think several patches are ready.

The main motivation is that I was frustrated by the special cases
everywhere. I wanted to make it easier to hack on this code going
forward, now that we are adding even more complexity for multiple ICU
libraries.

I'm posting to this thread (rather than my previous refactoring
thread[1]) because a lot of the people interested in working on this
code are here. So, if you like (or don't like) the structure of these
changes, please let me know.

Changes:
  * Introduce pg_locale_internal.h to hide all USE_ICU code,
    including all callers of the ICU routines. The files that
    still need to include pg_locale_internal.h are:
    - pg_locale.c
    - regc_pg_locale.c
    - formatting.c
    - like.c
    - like_support.c
    - collationcmds.c
  * Other callers (in files that don't include 
    pg_locale_internal.h) don't need to branch based on the
    provider, platform, database encoding, USE_ICU,
    HAVE_LOCALE_T, etc.
  * ICU and libc are treated the same way in more places.
  * I made it so pg_locale_t is constructed first, then
    moved to TopMemoryContext, so that it won't leak in
    TopMemoryContext if errors are encountered.
  * Introduce pg_strcoll, pg_strncoll, pg_strxfrm, and pg_strnxfrm
    so that varlena/hash/verchar code doesn't worry about the
    details.
  * Add method structure pg_icu_library, borrowed from Thomas's
    patch, that provides one convenient place to provide
    multiple-ICU-library support.
  * Add a hook that allows you to fill in the pg_icu_library
    structure however you want while a pg_locale_t is being
    constructed. This allows do-it-yourself ICU library
    lockdown.

On the negative side, it increases the line count. Part of that is
because adding indirection for the ICU library is just more lines of
code, but a lot of it is just that I used a lot of smaller functions.
Perhaps my style is a bit verbose?

Even though we're close to consensus on how we should offer control
over the ICU libraries, having the hook may be useful for
experimentation, testing, or as a last resort. Right now the hook has
limited information to use to find the right library -- just the ICU
collation name and the version, because that's what we have in the
catalog. But I assume the patch Thomas is working on will change that.

Performance:

I did brief performance sanity tests on several paths and the results
are unremarkable (which is generally good for a refactor). On the path
I was watching most closely, ICU/UTF8/en-US-x-icu, it came in about 2%
faster, which was a pleasant surprise. This was true both when I
disabled abbreviated keys (to stress localized comparison paths) and
also with abbreviated keys enabled. My previous refactoring work[1]
ended up a percent or two slower. My guess right now is that I moved
some code around after I noticed that ICU accepts NUL-terminated
strings (by specifying the lenght as -1), and that helped. But I'll
need to profile and look more closely to be more certain of my results,
these are preliminary.

There are a few things that could be done differently:

  * I am still a bit confused about why someone would want a collation
with a different lc_collate and lc_ctype in libc; and assuming there is
a reason, why it can't be done with ICU. The way I did the refactoring
tries to accommodate them as different concepts, but I can rip that
out.

  * In theory, we could also support multilib libc, and an associated
get_libc_library() and hook, but there are a couple big challenges.
Firstly, if it's the default locale, it relies on setlocale(), so we'd
have to figure out what to do about that. Second, having an a second
version of glibc on your system is not as normal or trivial as having a
second version of ICU.

  * I made the hook simple, but all it can do is replace the ICU
library. It's possible that it would want to construct it's own entire
pg_locale_t for some reason, and keep more complex state in a private
pointer, or something like that. It seemed better to keep it simple,
but maybe someone would want more flexibility there?

  * I used the library indirection for pretty much all ICU calls,
including the ucnv_ and the uloc_ functions. I did this mainly because,
if we are so paranoid about ICU changing in subtle ways, we might as
well make it possible to lock down everything. I can rip this out, too,
but it didn't add many lines.

Loose ends:

  * I need to do something with get_collation_actual_version. I had an
earlier iteration that went through pg_newlocale() and then queried the
resulting pg_locale_t structure, but that changed the error paths in a
way that failed a couple tests, so I left that out.

  * Error paths could be improved further to make sure that libc
locale_t and UCollator structures are freed in error paths during
construction. I was thinking about using resowner for this, and then if
the pg_locale_t structure gets moved to TopMemoryContext, just doing a
ResourceOwnerForget. Alternatively, I could just be careful about the
error paths.

  * We'd need to adapt this and make sure it works with whatever scheme
we decide is best for finding the right library. I suspect this would
just be adding another parameter to get_icu_library (and the hook) to
represent the new collation provider Oid (and get_icu_library could use
that to look up the library names and load them).

Comments welcome.

[1]
https://www.postgresql.org/message-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: Collation version tracking for macOS

From
Peter Eisentraut
Date:
On 05.12.22 22:33, Thomas Munro wrote:
> On Tue, Dec 6, 2022 at 6:45 AM Joe Conway <mail@joeconway.com> wrote:
>> On 12/5/22 12:41, Jeff Davis wrote:
>>> On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
>>>> 1.  I think we should seriously consider provider = ICU63.  I still
>>>> think search-by-collversion is a little too magical, even though it
>>>> clearly can be made to work.  Of the non-magical systems, I think
>>>> encoding the choice of library into the provider name would avoid the
>>>> need to add a second confusing "X_version" concept alongside our
>>>> existing "X_version" columns in catalogues and DDL syntax, while
>>>> still
>>>> making it super clear what is going on.
>>>
>>> As I understand it, this is #2 in your previous list?
>>>
>>> Can we put the naming of the provider into the hands of the user, e.g.:
>>>
>>>     CREATE COLLATION PROVIDER icu63 TYPE icu
>>>       AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';
>>>
>>> In this model, icu would be a "provider kind" and icu63 would be the
>>> specific provider, which is named by the user.
>>>
>>> That seems like the least magical approach, to me. We need an ICU
>>> library; the administrator gives us one that looks like ICU; and we're
>>> happy.
>>
>> +1
>>
>> I like this. The provider kind defines which path we take in our code,
>> and the specific library unambiguously defines a specific collation
>> behavior (I think, ignoring bugs?)
> 
> OK, I'm going to see what happens if I try to wrangle that stuff into
> a new catalogue table.

I'm reviewing the commit fest entry 
https://commitfest.postgresql.org/41/3956/, which points to this thread. 
  It appears that the above patch did not come about in time.  The patch 
of record is now Jeff's refactoring patch, which is also tracked in 
another commit fest entry (https://commitfest.postgresql.org/41/4058/). 
So as a matter of procedure, we should probably close this commit fest 
entry for now.  (Maybe we should also use a different thread subject in 
the future.)

I have a few quick comments on the above syntax example:

There is currently a bunch of locale-using code that selects different 
code paths by "collation provider", i.e., a libc-based code path and an 
ICU-based code path (and sometimes also a default provider path).  The 
above proposal would shift the terminology and would probably require 
some churn at those sites, in that they would now have to select by 
"collation provider type".  We could probably avoid that by shifting the 
terms a bit, so instead of the suggested

provider type -> provider

we could use

provider -> version of that provider

(or some other actual term), which would leave the meaning of "provider" 
unchanged as far as locale-using code is concerned.  At least that's my 
expectation, since no code for this has been seen yet.  We should keep 
this in mind in any case.

Also, the above example exposes a lot of operating system level details. 
  This creates issues with dump/restore, which some of the earlier 
patches avoided by using a path-based approach, and it would also 
require some thoughts about permissions.  We probably want 
non-superusers to be able to interact with this system somehow, for 
upgrading (for some meaning of that action) indexes etc. without 
superuser access.  The more stuff from the OS we expose, the more stuff 
we have to be able to lock down again in a usable manner.

(The search-by-collversion approach can probably avoid those issues better.)



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
Attached are a new set of patches, including a major enhancement: the
icu_multilib contrib module.

The pure refactoring patches have gone into this thread:

This patch series is dependent on those patches, and contains:

   0001: Support multiple ICU libraries
   0002: Add test

Attachment

Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote:
> Attached are a new set of patches, including a major enhancement: the
> icu_multilib contrib module.

Attached rebased v8.

[ It looks like my email client truncated the last email somehow, in
case someone was wondering why it just stopped. ]

The big change is the introduction of the icu_multilib contrib module
which provides a lot of the functionality requested in this thread:

   * icu version stability, which allows you to "lock down" ICU to a
specific major and minor version (or major version only)
   * multi-lib ICU, which (if a GUC is set) will enable the "search by
collversion" behavior. Some doubts were raised about the wisdom of this
approach, but it's the only multi-lib solution we have without doing
some significant catalog work.

I rendered the HTML docs for icu_multilib and attached to this email to
make it easier to view.

icu_multilib assumes that the various ICU library versions are already
available in a single location, most likely installed with a package
manager. That location can be the same as the built-in ICU, or a
different location. Ideally, packagers would start to offer a few
"stable" versions of ICU that would be available for a long time, but
it will take a while for that to happen. So for now, it's up to the
user to figure out how to get the right versions of ICU on their system
and keep them there.

Automated tests of icu_multilib are a problem unless the one running
the tests is willing to compile the right versions of ICU (like I did).
But I at least have automated tests for the hooks by using the test
module test_collator_lib_hooks.

The v7 patches in this thread are dependent on the pure refactoring
patches in this CF entry:

   https://commitfest.postgresql.org/41/3935/

https://postgr.es/m/052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com

The requested functionality _not_ offered by icu_multilib is tying a
specific collation to a specific ICU version. A few variants were
proposed, the latest is to tie a collation to the library file itself
through the provider. That needs to be done with proper catalog support
in core. But I believe the work I've done here has made a lot of
progress in that direction, and also shows the versatility of the new
hook to solve at least some problems.



--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: Collation version tracking for macOS

From
vignesh C
Date:
On Sat, 21 Jan 2023 at 02:24, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote:
> > Attached are a new set of patches, including a major enhancement: the
> > icu_multilib contrib module.
>
> Attached rebased v8.
>
> [ It looks like my email client truncated the last email somehow, in
> case someone was wondering why it just stopped. ]
>
> The big change is the introduction of the icu_multilib contrib module
> which provides a lot of the functionality requested in this thread:
>
>    * icu version stability, which allows you to "lock down" ICU to a
> specific major and minor version (or major version only)
>    * multi-lib ICU, which (if a GUC is set) will enable the "search by
> collversion" behavior. Some doubts were raised about the wisdom of this
> approach, but it's the only multi-lib solution we have without doing
> some significant catalog work.
>
> I rendered the HTML docs for icu_multilib and attached to this email to
> make it easier to view.
>
> icu_multilib assumes that the various ICU library versions are already
> available in a single location, most likely installed with a package
> manager. That location can be the same as the built-in ICU, or a
> different location. Ideally, packagers would start to offer a few
> "stable" versions of ICU that would be available for a long time, but
> it will take a while for that to happen. So for now, it's up to the
> user to figure out how to get the right versions of ICU on their system
> and keep them there.
>
> Automated tests of icu_multilib are a problem unless the one running
> the tests is willing to compile the right versions of ICU (like I did).
> But I at least have automated tests for the hooks by using the test
> module test_collator_lib_hooks.
>
> The v7 patches in this thread are dependent on the pure refactoring
> patches in this CF entry:
>
>    https://commitfest.postgresql.org/41/3935/
>
> https://postgr.es/m/052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com
>
> The requested functionality _not_ offered by icu_multilib is tying a
> specific collation to a specific ICU version. A few variants were
> proposed, the latest is to tie a collation to the library file itself
> through the provider. That needs to be done with proper catalog support
> in core. But I believe the work I've done here has made a lot of
> progress in that direction, and also shows the versatility of the new
> hook to solve at least some problems.

This thread has been idle for a year now, It has stalled after a lot
of discussion.
@Jeff Davis: Do you want to try to restart the discussion by posting
an updated version and see what happens?

Regards,
Vignesh



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote:
> This thread has been idle for a year now, It has stalled after a lot
> of discussion.
> @Jeff Davis: Do you want to try to restart the discussion by posting
> an updated version and see what happens?

Thank you for following up. Yes, I'd like to find a path forward here,
but I need some validation from others on my approach.

I rendered the docs I wrote as an HTML page and attached it to this
thread, to make it easier for others to read and comment. It's
basically a tool for experts who are willing to devote effort to
managing their collations and ICU libraries. Is that what we want?

At an implementation level, did I get the extension APIs right? I
considered making the API simpler, but that would require the extension
to do quite a bit more work (including a lot of redundant work) to use
ICU properly.

Regards,
    Jeff Davis




Re: Collation version tracking for macOS

From
vignesh C
Date:
On Mon, 22 Jan 2024 at 00:28, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote:
> > This thread has been idle for a year now, It has stalled after a lot
> > of discussion.
> > @Jeff Davis: Do you want to try to restart the discussion by posting
> > an updated version and see what happens?
>
> Thank you for following up. Yes, I'd like to find a path forward here,
> but I need some validation from others on my approach.

Let's start by posting a rebased version to fix the CFBot patch apply
issue as in [1]:

=== Applying patches on top of PostgreSQL commit ID
402388946fb3ac54f0fd5944d7e177ef7737eab2 ===
=== applying patch
./v8-0001-Support-multiple-ICU-collation-provider-libraries.patch
patching file src/backend/commands/collationcmds.c
Hunk #1 FAILED at 566.
....
1 out of 4 hunks FAILED -- saving rejects to file
src/backend/commands/collationcmds.c.rej
patching file src/backend/utils/adt/formatting.c
Hunk #1 succeeded at 1575 (offset 9 lines).
Hunk #2 succeeded at 1587 (offset 9 lines).
Hunk #3 succeeded at 1605 (offset 9 lines).
Hunk #4 succeeded at 1700 (offset 3 lines).
Hunk #5 succeeded at 1819 (offset -1 lines).
Hunk #6 succeeded at 1939 (offset -5 lines).
patching file src/backend/utils/adt/pg_locale.c
Hunk #1 FAILED at 70.
...
Hunk #31 FAILED at 2886.
Hunk #32 FAILED at 2902.
22 out of 32 hunks FAILED -- saving rejects to file
src/backend/utils/adt/pg_locale.c.rej

[1] - http://cfbot.cputube.org/patch_46_3956.log

Regards,
Vignesh



Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Sun, Jan 21, 2024 at 1:58 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I rendered the docs I wrote as an HTML page and attached it to this
> thread, to make it easier for others to read and comment. It's
> basically a tool for experts who are willing to devote effort to
> managing their collations and ICU libraries. Is that what we want?
>
> At an implementation level, did I get the extension APIs right? I
> considered making the API simpler, but that would require the extension
> to do quite a bit more work (including a lot of redundant work) to use
> ICU properly.

Not that I'm the most qualified person to have an opinion on this
topic, but did you intend to attach this stuff to this email, or is it
somewhere else?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Thu, 2024-02-01 at 15:58 -0500, Robert Haas wrote:
> Not that I'm the most qualified person to have an opinion on this
> topic, but did you intend to attach this stuff to this email, or is
> it
> somewhere else?

The previous patch is here:

https://www.postgresql.org/message-id/6f4a8c01a5cb1edf3a07d204c371fbddaef252f9.camel%40j-davis.com

And I attached the rendered HTML doc page, which conveniently renders
in the archives (thanks to web team -- I didn't know if that would
actually work until I tried it):

https://www.postgresql.org/message-id/attachment/142818/icu-multilib.html

For anyone interested in this work, the docs are the best place to
start.

I'm hesitant to put much more work into it (e.g. new patches, etc.)
without more feedback. Your opinion would certainly be valuable -- for
instance, when reading the docs, can you imagine yourself actually
using this if you ran into a collation versioning/migration problem?

Regards,
    Jeff Davis




Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Sun, Feb 4, 2024 at 10:42 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I'm hesitant to put much more work into it (e.g. new patches, etc.)
> without more feedback. Your opinion would certainly be valuable -- for
> instance, when reading the docs, can you imagine yourself actually
> using this if you ran into a collation versioning/migration problem?

I'm having some difficulty understanding what the docs are trying to
tell me. I think there are some issues with ordering and pacing.

"The icu_multilib module provides control over the version (or
versions) of the ICU provider library used by PostgreSQL, which can be
different from the version of ICU with which it was built. Collations
are a product of natural language, and natural language evolves over
time; but PostgreSQL depends on stable ordering for structures such as
indexes. Newer versions of ICU update the provided collators to adapt
to changes in natural language, so it's important to control when and
how those new versions of ICU are used to prevent problems such as
index corruption."

Check. So far, so good.

"This module assumes that the necessary versions of ICU are already
available, such as through the operating system's package manager; and
already properly installed in a single location accessible to
PostgreSQL. The configration variable icu_multilib.library_path should
be set to the location where these ICU library versions are
installed."

Here I feel we've skipped a few steps. I suggest postponing all
discussion of specific GUCs to a later point -- specifically the
configuration parameters section, which I think should actually be
F.19.1, with the use cases following that rather than preceding it. In
this introductory section, I suggest elaborating a bit more on what
problem we're trying to solve at a conceptual level. It feels like
we've gone straight from the very general issue (collation definitions
need to be stable but language isn't) to very specific (here's a GUC
that you can set to a pathname). I feel like the need for this module
should be more specifically motivated. Maybe something like:

1. Here's what we think your OS package manager is probably going to do.
2. That's going to interact with PostgreSQL in this way that I will
now describe.
3. See, that sucks, because of the stuff I said above about needing
stable collations!
4. But if you installed this module instead, then you could prevent
the things I said under #2 from happening.
5. Instead, you'd get this other behavior, which would make you happy.

I feel like I can almost piece together in my head how this is
supposed to work -- I think it's like "we expect the OS package
manager to drop all the ICU versions in the same directory via side by
side installs, and that works well for other programs because ... for
some mysterious reason they can latch onto the specific version they
were linked against ... but we can't or don't do that because ... I
guess we're dumber than those other pieces of software or
something???? ... so this module lets you ask for more sensible
behavior." But I think that could be spelled out a bit more clearly
and directly than this document seems to me to do.

I also wonder if we should be explaining why we don't get this right
out of the box. Like, if the normal behavior categorically sucks, why
do you have to install icu_multilib to get something else? Why not
make the multilib treatment the default? And if the normal behavior is
better for some cases and the icu_multilib behavior is better for
other cases, then maybe we ought to explain which one to use in which
scenario.

"icu_multilib must be loaded via shared_preload_libraries.
icu_multilib ignores any ICU library with a major version greater than
that with which PostgreSQL was built."

It's not clear from reading this whether the second sentence here is a
regrettable implementation restriction or design behavior. If it's
design behavior, what's the point of it?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Jeff Davis
Date:
On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote:
> 1. Here's what we think your OS package manager is probably going to
> do.
> 2. That's going to interact with PostgreSQL in this way that I will
> now describe.
> 3. See, that sucks, because of the stuff I said above about needing
> stable collations!
> 4. But if you installed this module instead, then you could prevent
> the things I said under #2 from happening.
> 5. Instead, you'd get this other behavior, which would make you
> happy.

I like that framing, thank you. I'll try to come up with something
there.

> I feel like I can almost piece together in my head how this is
> supposed to work -- I think it's like "we expect the OS package
> manager to drop all the ICU versions in the same directory via side
> by
> side installs, and that works well for other programs because ... for
> some mysterious reason they can latch onto the specific version they
> were linked against ... but we can't or don't do that because ... I
> guess we're dumber than those other pieces of software or
> something???? ... 

Postgres can and does latch on to the version of ICU it was compiled
against. It's a normal shared library dependency.

The problem is that databases -- and the file structures -- outlive a
particular version of Postgres. So if Postgres 16 is compiled against
ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade
from 16 to 17? Postgres 17 will try to access the old file structures
using ICU Y, and they'll be corrupt.

What we want is the file structures that depend on ICU X to continue to
find ICU X even after you upgrade to Postgres 17, yet allow new
structures to be created using ICU Y. In other words, "multi-lib",
meaning that the same Postgres binary is linking to multiple versions
of ICU and the different versions for different structures. That would
allow users to recreate one index at a time to use ICU Y, until nothing
depends on ICU X any longer.

I should say this is not an easy process even if something like
icu_multilib is available. We don't have all of the information needed
in the catalog to track which structures depend on which versions of a
collation library, collation library version is itself not easy to
define, and it still involves rebuilding (or at least re-validating) a
lot of structures. This is a "make hard things possible" tool, and I
suspect only a handful of users would use it successfully to migrate to
new ICU versions.

More simply, some users might just want to lock down the version of ICU
to X, and just use that forever until they have a reason to change it.
icu_multilib can also facilitate that, though it's still not trivial.

> "icu_multilib must be loaded via shared_preload_libraries.
> icu_multilib ignores any ICU library with a major version greater
> than
> that with which PostgreSQL was built."
>
> It's not clear from reading this whether the second sentence here is
> a
> regrettable implementation restriction or design behavior. If it's
> design behavior, what's the point of it?

That restriction came from Thomas's (uncommitted) work on the same
problem. I believe the reasoning was that we don't know whether future
versions of ICU might break something that we're doing, though perhaps
there's a better way.

Regards,
    Jeff Davis





Re: Collation version tracking for macOS

From
Robert Haas
Date:
On Tue, Feb 13, 2024 at 1:55 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Postgres can and does latch on to the version of ICU it was compiled
> against. It's a normal shared library dependency.
>
> The problem is that databases -- and the file structures -- outlive a
> particular version of Postgres. So if Postgres 16 is compiled against
> ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade
> from 16 to 17? Postgres 17 will try to access the old file structures
> using ICU Y, and they'll be corrupt.
>
> What we want is the file structures that depend on ICU X to continue to
> find ICU X even after you upgrade to Postgres 17, yet allow new
> structures to be created using ICU Y. In other words, "multi-lib",
> meaning that the same Postgres binary is linking to multiple versions
> of ICU and the different versions for different structures. That would
> allow users to recreate one index at a time to use ICU Y, until nothing
> depends on ICU X any longer.

Ah, I see. At least, I think I do. I think some of this material could
be very usefully included into the first section of the doc you're
trying to write. What you say here makes it a lot easier to grasp the
motivation and use case for this code, at least for me.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Collation version tracking for macOS

From
Thomas Munro
Date:
On Tue, Feb 13, 2024 at 9:25 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote:
> > "icu_multilib must be loaded via shared_preload_libraries.
> > icu_multilib ignores any ICU library with a major version greater
> > than
> > that with which PostgreSQL was built."
> >
> > It's not clear from reading this whether the second sentence here is
> > a
> > regrettable implementation restriction or design behavior. If it's
> > design behavior, what's the point of it?
>
> That restriction came from Thomas's (uncommitted) work on the same
> problem. I believe the reasoning was that we don't know whether future
> versions of ICU might break something that we're doing, though perhaps
> there's a better way.

Right, to spell that out more fully:  We compile and link against one
particular ICU library that is present at compile time, and there is a
place in that multi-lib patch that assigns the function pointers from
that version to variables of the function pointer type that we expect.
Compilation would fail if ICU ever changed relevant function
prototypes in a future release, and then we'd have to come up with
some trampoline/wrapper scheme to wallpaper over differences.  That's
why I think it's safe to use dlsym() to access function pointers for
versions up to and including the one whose headers we were compiled
against, but not later ones which we haven't tested in that way.

Sadly I won't be able to work on multi-lib ICU support again in this
cycle.  I think we managed to prove that dlopen works for this, and
learn some really interesting stuff about Unicode and ICU evolution,
but we still have to come up with the right model, catalogues and DDL
etc, for a nice user experience.  What I was most recently
experimenting with based on earlier discussions was the idea of
declaring separate providers: icu72 and icu68 could both exist and you
could create extra indexes and then drop the originals as a
no-downtime upgrade path.  I have a pet theory that you could usefully
support multi-version libc locales too if you're prepared to make
certain assumptions (short version: take the collation definition
files from any older version of your OS, compile with newer version's
localedef, give it a name like "en_US@ubuntu18", and assume/pray they
didn't change stuff that wasn't expressed in the definition file), so
I was working on a generalisation slightly wider than just
multi-version ICU.