Re: Collation version tracking for macOS - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Collation version tracking for macOS |
Date | |
Msg-id | CAH2-Wzmp7v_MYxnGe=kASV0ieL6jKoYfs7vS2NObL6tA7m9ptw@mail.gmail.com Whole thread Raw |
In response to | Re: Collation version tracking for macOS (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Collation version tracking for macOS
|
List | pgsql-hackers |
On Fri, Jun 10, 2022 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Executive summary of experiments so far: the "distinct collations" > concept is quite simple and robust, but exposes all the versions to > users and probably makes it really hard to upgrade (details not worked > out), while the "time travelling collations" concept is nice for users > but hard to pin down and prove correctness for since it seems to > require dynamic scoping/global state changes affecting code in far > away places. It didn't really occur to me until now that the pg_dump problems that come with the approach you outlined ("distinct collations") are likely to be total blockers, and not just disadvantages. It's not just ICU that prefers approximately correct behavior over throwing an "unknown collation" error -- the same could be said for pg_dump itself. After all, pg_dump doesn't care about collation versions -- except when run in binary mode, for pg_upgrade, where it must satisfy the requirements of pg_upgrade. Even today we could be restoring to a server with an older ICU version, where in general we might not get exactly the behavior the user expects (though usually only when they've decided to use advanced features like custom tailorings). So pg_dump already deliberately disregards the ICU version, for essentially the same reasons that I argued were good reasons upthread (when talking about a multi-ICU-version Postgres via "time travelling collations"). Some more thoughts on "time travelling collations": Doing a version switch in one atomic operation (like a special REINDEX) isn't going to be practical. We need to be prepared for cases where a database has a mix of indexes with old and new physical collations. We certainly cannot allow queries to give wrong answers...but I tend to doubt that (say) making merge joins work with two indexes with different physical collations (though matching logical collations) actually makes much sense. Maybe we can cut scope in a pragmatic way instead. The special REINDEX (or whatever) won't work as an atomic operation...but that doesn't mean that the system as a whole will have a mix of old and new physical collations forever, or even for very long. So while everything still has to work correctly, mediocre performance with certain kinds of plan shapes might be okay. As you kind of said yourself yesterday, "time travelling collations" could naturally have an invariant that worked at the index/constraint level (not the collation level): any given index needs to use only one underlying ICU version at the same time, no matter what. The invariant could perhaps be leveraged in the planner -- perhaps we start out with a working assumption that *every* index is on the newer ICU version (when at least one index is on the lastest and "now current" version), and work backwards by excluding indexes that still have old physical collations. Defining the problem as a problem with old indexes/constraints only seems like it might make things a lot easier. -- Peter Geoghegan
pgsql-hackers by date: