Re: Collation version tracking for macOS - Mailing list pgsql-hackers

From Jeremy Schneider
Subject Re: Collation version tracking for macOS
Date
Msg-id 1874de62-6bec-4bc1-1d14-0a2730b125da@ardentperf.com
Whole thread Raw
In response to Re: Collation version tracking for macOS  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Collation version tracking for macOS
List pgsql-hackers
On 6/3/22 9:21 AM, Tom Lane wrote:
> 
> According to that document, they changed it in macOS 11, which came out
> a year and a half ago.  Given the lack of complaints, it doesn't seem
> like this is urgent enough to mandate a post-beta change that would
> have lots of downside (namely, false-positive warnings for every other
> macOS update).


Sorry, I'm going to rant for a minute... it is my very strong opinion
that using language like "false positive" here is misguided and dangerous.

If new version of sort order is released, for example when they recently
updated backwards-secondary sorting in french [CLDR-2905] or matching of
v and w in swedish and finnish [CLDR-7088], it is very dangerous to use
language like “false positive” to describe a database where there just
didn't happen to be any rows with accented french characters at the
point in time where PostgreSQL magically changed which version of sort
order it was using from the 2010 french version to the 2020 french version.

No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.

I have personally seen PostgreSQL databases where an update to an old
version of glibc was applied (I'm not even talking 2.28 here) and it
resulted in data loss b/c crash recovery couldn't replay WAL records and
the user had to do a PITR. That's aside from the more common issues of
segfaults or duplicate records that violate unique constraints or wrong
query results like missing data. And it's not just updates - people can
set up a hot standby on a different version and see many of these
problems too.

Collation versioning absolutely must be first class and directly
controlled by users, and it's very dangerous to allow users - at all -
to take an index and then use a different version than what the index
was built with.

Not to mention all the other places in the DB where collation is used...
partitioning, constraints, and any other place where persisted data can
make an assumption about any sort of string comparison.

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

-Jeremy Schneider


-- 
http://about.me/jeremy_schneider



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Proposal: adding a better description in psql command about large objects
Next
From: Thomas Munro
Date:
Subject: Re: Collation version tracking for macOS