Re: Collation version tracking for macOS - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Collation version tracking for macOS
Date
Msg-id CAH2-Wzmp7v_MYxnGe=kASV0ieL6jKoYfs7vS2NObL6tA7m9ptw@mail.gmail.com
Whole thread Raw
In response to Re: Collation version tracking for macOS  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Collation version tracking for macOS
List pgsql-hackers
On Fri, Jun 10, 2022 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Executive summary of experiments so far: the "distinct collations"
> concept is quite simple and robust, but exposes all the versions to
> users and probably makes it really hard to upgrade (details not worked
> out), while the "time travelling collations" concept is nice for users
> but hard to pin down and prove correctness for since it seems to
> require dynamic scoping/global state changes affecting code in far
> away places.

It didn't really occur to me until now that the pg_dump problems that
come with the approach you outlined ("distinct collations") are likely
to be total blockers, and not just disadvantages. It's not just ICU
that prefers approximately correct behavior over throwing an "unknown
collation" error -- the same could be said for pg_dump itself. After
all, pg_dump doesn't care about collation versions -- except when run
in binary mode, for pg_upgrade, where it must satisfy the requirements
of pg_upgrade.

Even today we could be restoring to a server with an older ICU
version, where in general we might not get exactly the behavior the
user expects (though usually only when they've decided to use advanced
features like custom tailorings). So pg_dump already deliberately
disregards the ICU version, for essentially the same reasons that I
argued were good reasons upthread (when talking about a
multi-ICU-version Postgres via "time travelling collations").

Some more thoughts on "time travelling collations":

Doing a version switch in one atomic operation (like a special
REINDEX) isn't going to be practical. We need to be prepared for cases
where a database has a mix of indexes with old and new physical
collations. We certainly cannot allow queries to give wrong
answers...but I tend to doubt that (say) making merge joins work with
two indexes with different physical collations (though matching
logical collations) actually makes much sense. Maybe we can cut scope
in a pragmatic way instead.

The special REINDEX (or whatever) won't work as an atomic
operation...but that doesn't mean that the system as a whole will have
a mix of old and new physical collations forever, or even for very
long. So while everything still has to work correctly, mediocre
performance with certain kinds of plan shapes might be okay.

As you kind of said yourself yesterday, "time travelling collations"
could naturally have an invariant that worked at the index/constraint
level (not the collation level): any given index needs to use only one
underlying ICU version at the same time, no matter what. The invariant
could perhaps be leveraged in the planner -- perhaps we start out with
a working assumption that *every* index is on the newer ICU version
(when at least one index is on the lastest and "now current" version),
and work backwards by excluding indexes that still have old physical
collations. Defining the problem as a problem with old
indexes/constraints only seems like it might make things a lot easier.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Collation version tracking for macOS
Next
From: Peter Geoghegan
Date:
Subject: Re: Collation version tracking for macOS