On Wed, Jun 8, 2022 at 4:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm very skeptical of this process as being a reason to push users
> to reindex everything in sight. If U+NNNN was not a thing last year,
> there's no reason to expect that it appears in anyone's existing data,
> and therefore the fact that it sorts differently this year is a poor
> excuse for sounding time-to-reindex alarm bells.
That seems completely wrong to me. It's not like a new character shows
up and people wait to start using it until it makes its way into
everyone's collation data. That is emphatically not what happens, I
would say. What happens is that people upgrade their libc packages at
one times and their postgres packages at another time, and it's
unlikely that they have any idea which order they do or did those
things. Meanwhile, people start using all the latest emojis. The idea
that the average PostgreSQL user has any idea whether a certain emoji
shows up in the data set for the first time before or after they
install the libc version that knows about it seems absurd. We don't
even know how to figure out which emojis the installed libc supports
-- if we did, we could reject data that we don't know how to sort
properly instead of ending up with corrupted indexes later. The user
has no more ability to figure it out than we do, and even if they did,
they probably wouldn't want to compare their stream of input data to
their collate definitions using some process external to the database.
--
Robert Haas
EDB: http://www.enterprisedb.com