proposal: change behavior on collation version mismatch - Mailing list pgsql-hackers
From | Jeremy Schneider |
---|---|
Subject | proposal: change behavior on collation version mismatch |
Date | |
Msg-id | 2a04654b-7f46-49ed-bd40-251561c5c2fc@amazon.com Whole thread Raw |
Responses |
Re: proposal: change behavior on collation version mismatch
Re: proposal: change behavior on collation version mismatch Re: proposal: change behavior on collation version mismatch |
List | pgsql-hackers |
I had some interesting conversations with a couple PostgreSQL community members at PASS Data Summit the week before last about the collation problem, and then - just in this last week - I saw two more people on public channels hitting corruption problems. One person on the public PostgreSQL Slack, we eventually figured out they had upgraded from Ubuntu 18.04 to 22.04 which hits glibc 2.28; a second person here on the pgsql-general list reported by Daniel Westermann and I assume representing a client of dbi services [1]. Everyone who's been tracking this over the past few years has seen the steady stream of quiet complaints in public from people at almost every major PG company, largely around the glibc 2.28 debacle. I've been tracking the discussions around collation here on the lists and I've had a number of conversations with folks working deeply in this area inside and outside of AWS, and I was part of the effort to address it at AWS since we first became aware of it many years ago. It seems to me the general perspective on the mailing lists is that: 1) "collation changes are uncommon" (which is relatively correct) 2) "most users would rather have ease-of-use than 100% safety, since it's uncommon" And I think this led to the current behavior of issuing a warning rather than an error, and providing a SQL command "alter ... refresh collation" which simply instructs the database to permanently forget the warning without changing anything. I agree that some users might prefer this behavior, but I think businesses like banks or healthcare companies would be appalled, and would prefer to do the extra work and have certainty of avoiding small but known probabilities of silent data corruption. As I said on the pgsql-general thread: glibc 2.28 has certainly been the most obvious and impactful case, so the focus is understandable, but there's a bit of a myth in the general public that the problem is only with glibc 2.28 (and not ICU or other glibc versions or data structures other than indexes). Unfortunately, contrary to current popular belief, the only truly safe way to update an operating system under PosgreSQL is with logical dump/load or logical replication, or continuing to compile and use the exact older version of ICU from the old OS (if you use ICU). I think the ICU folks are generally careful enough that it'll be far less likely for compiler changes and new compiler optimizations to inadvertently change collation on newer operating systems and newer build toolchains (eg. for strings with don't have linguistically defined collation, like mixing characters from multiple languages and classes). It's been two years now since I published the collation torture test over on github, which directly compares 10 years of both glibc and ICU changes and demonstrates clearly that both ICU and glibc libraries have regular small changes, and both libraries have had at least one release with a massive number of changes. [2] I also published a blog post this past March with a step-by-step reproducible demonstration of silent corruption without any indexes involved by using ICU (not glibc) with an OS upgrade from Ubuntu 20.04 to 22.04. [3] The warning was not even displayed to the user, because it happened at connection time rather than query time. That blog also listed many reasons that glibc & ICU regularly include small changes and linked to real examples from ICU: new characters (which use existing code points), fixing incorrect rules, governments or universities clarifying rules, general improvements, and unintentional changes from code changes or refactors (like glibc in Ubuntu 15.04 changing sort order for 22 thousand CJK characters, many years prior to 2.28). My own personal opinion here is that PostgreSQL is on a clear trajectory to soon be the dominant database of businesses like banks and healthcare companies, and that the PostgreSQL default configuration with regard to safety and durability should be bank defaults rather than "easy" defaults. For this reason, I'd like to revisit two specific current behaviors of PostgreSQL and get a sense of how strongly everyone feels about them. First: I'd suggest that a collation version mismatch should cause an ERROR rather than a WARNING by default. If we want to have a GUC that allows warning behavior, I think that's OK but I think it should be superuser-only and documented as a "developer" setting similar to zero_damaged_pages. Second: I'd suggest that all of the "alter ... refresh collation" commands should be strictly superuser-only rather than owner-of-collation-privs, and that they should be similarly documented as something that is generally advised against and exists for extraordinary circumstances. I know these things have been discussed before, and I realize the implications are important and inconvenient for many users, and also I realize that I'm not often involved in discussions here on the hackers email list. (I usually catch up on hackers from archives irregularly, for areas I'm interested in, and I'm involved more regularly with public slack and user groups.) But I'm a bit unsatisfied with the current state of things and want to bring up the topic here and see what happens. Respectfully, Jeremy 1: https://www.postgresql.org/message-id/flat/CA%2BfnDAZufFS-4-6%3DO3L%2BqG9iFT8tm6BvtZXNnSm1dkJ8GciCkA%40mail.gmail.com#beefde2f9e54dcee813a8f731993247d 2: https://github.com/ardentperf/glibc-unicode-sorting/ 3: https://ardentperf.com/2023/03/26/did-postgres-lose-my-data/ -- Jeremy Schneider Database and Performance Engineer Amazon Web Services
pgsql-hackers by date: