proposal: change behavior on collation version mismatch - Mailing list pgsql-hackers

From Jeremy Schneider
Subject proposal: change behavior on collation version mismatch
Date
Msg-id 2a04654b-7f46-49ed-bd40-251561c5c2fc@amazon.com
Whole thread Raw
Responses Re: proposal: change behavior on collation version mismatch
Re: proposal: change behavior on collation version mismatch
Re: proposal: change behavior on collation version mismatch
List pgsql-hackers
I had some interesting conversations with a couple PostgreSQL community
members at PASS Data Summit the week before last about the collation
problem, and then - just in this last week - I saw two more people on
public channels hitting corruption problems. One person on the public
PostgreSQL Slack, we eventually figured out they had upgraded from
Ubuntu 18.04 to 22.04 which hits glibc 2.28; a second person here on the
pgsql-general list reported by Daniel Westermann and I assume
representing a client of dbi services [1]. Everyone who's been tracking
this over the past few years has seen the steady stream of quiet
complaints in public from people at almost every major PG company,
largely around the glibc 2.28 debacle.

I've been tracking the discussions around collation here on the lists
and I've had a number of conversations with folks working deeply in this
area inside and outside of AWS, and I was part of the effort to address
it at AWS since we first became aware of it many years ago.

It seems to me the general perspective on the mailing lists is that:

1) "collation changes are uncommon" (which is relatively correct)
2) "most users would rather have ease-of-use than 100% safety, since
it's uncommon"

And I think this led to the current behavior of issuing a warning rather
than an error, and providing a SQL command "alter ... refresh collation"
which simply instructs the database to permanently forget the warning
without changing anything. I agree that some users might prefer this
behavior, but I think businesses like banks or healthcare companies
would be appalled, and would prefer to do the extra work and have
certainty of avoiding small but known probabilities of silent data
corruption.

As I said on the pgsql-general thread: glibc 2.28 has certainly been the
most obvious and impactful case, so the focus is understandable, but
there's a bit of a myth in the general public that the problem is only
with glibc 2.28 (and not ICU or other glibc versions or data structures
other than indexes). Unfortunately, contrary to current popular belief,
the only truly safe way to update an operating system under PosgreSQL is
with logical dump/load or logical replication, or continuing to compile
and use the exact older version of ICU from the old OS (if you use ICU).
I think the ICU folks are generally careful enough that it'll be far
less likely for compiler changes and new compiler optimizations to
inadvertently change collation on newer operating systems and newer
build toolchains (eg. for strings with don't have linguistically defined
collation, like mixing characters from multiple languages and classes).

It's been two years now since I published the collation torture test
over on github, which directly compares 10 years of both glibc and ICU
changes and demonstrates clearly that both ICU and glibc libraries have
regular small changes, and both libraries have had at least one release
with a massive number of changes. [2]

I also published a blog post this past March with a step-by-step
reproducible demonstration of silent corruption without any indexes
involved by using ICU (not glibc) with an OS upgrade from Ubuntu 20.04
to 22.04. [3]  The warning was not even displayed to the user, because
it happened at connection time rather than query time.

That blog also listed many reasons that glibc & ICU regularly include
small changes and linked to real examples from ICU: new characters
(which use existing code points), fixing incorrect rules, governments or
universities clarifying rules, general improvements, and unintentional
changes from code changes or refactors (like glibc in Ubuntu 15.04
changing sort order for 22 thousand CJK characters, many years prior to
2.28).

My own personal opinion here is that PostgreSQL is on a clear trajectory
to soon be the dominant database of businesses like banks and healthcare
companies, and that the PostgreSQL default configuration with regard to
safety and durability should be bank defaults rather than "easy" defaults.

For this reason, I'd like to revisit two specific current behaviors of
PostgreSQL and get a sense of how strongly everyone feels about them.

First: I'd suggest that a collation version mismatch should cause an
ERROR rather than a WARNING by default. If we want to have a GUC that
allows warning behavior, I think that's OK but I think it should be
superuser-only and documented as a "developer" setting similar to
zero_damaged_pages.

Second: I'd suggest that all of the "alter ... refresh collation"
commands should be strictly superuser-only rather than
owner-of-collation-privs, and that they should be similarly documented
as something that is generally advised against and exists for
extraordinary circumstances.

I know these things have been discussed before, and I realize the
implications are important and inconvenient for many users, and also I
realize that I'm not often involved in discussions here on the hackers
email list. (I usually catch up on hackers from archives irregularly,
for areas I'm interested in, and I'm involved more regularly with public
slack and user groups.) But I'm a bit unsatisfied with the current state
of things and want to bring up the topic here and see what happens.

Respectfully,
Jeremy



1:

https://www.postgresql.org/message-id/flat/CA%2BfnDAZufFS-4-6%3DO3L%2BqG9iFT8tm6BvtZXNnSm1dkJ8GciCkA%40mail.gmail.com#beefde2f9e54dcee813a8f731993247d

2: https://github.com/ardentperf/glibc-unicode-sorting/

3: https://ardentperf.com/2023/03/26/did-postgres-lose-my-data/



-- 
Jeremy Schneider
Database and Performance Engineer
Amazon Web Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: trying again to get incremental backup
Next
From: Robert Haas
Date:
Subject: Re: Partial aggregates pushdown