Re: Change initdb default to the builtin collation provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Change initdb default to the builtin collation provider
Date
Msg-id be03b4f8e5ed147a3e733d9a73ddeb51e6b83ad4.camel@j-davis.com
Whole thread Raw
In response to Re: Change initdb default to the builtin collation provider  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, 2026-03-11 at 08:47 -0400, Robert Haas wrote:
> At the end of the day, we're all just guessing.

Part of the reason for that is that changing collation is so difficult
that we have very few examples of users moving real workloads from one
collation to another.

>  My experience working
> for EDB is that we have a number of customers who care about sort
> order quite a lot, and we've had to sweat blood to make them happy.

Thank you. I have one burning question: for these users who care deeply
about sort order, which scenario best describes their needs?

  (a) they mostly work in a single locale (if so, does it match their
UNIX environment?); or

  (b) one locale (which one?) is good enough for a variety of locales
because even if it's not perfect, it's still better than ASCII; or

  (c) they somehow partition their data by locale and use multiple
locales; or

  (d) they have a variety of indexes on the same column using different
collations to satisfy queries from users in different locales

I have found it very difficult to get an answer to that question. When
I press users for details (in the sample of users I've been able to
reach), usually they back off on the need for sort order, and instead
focus on case insensitivity (in which case I suggest the builtin C.UTF-
8).

> And, on a personal level, I have a hard time understanding why anyone
> would be OK with a sort order that puts Álvaro after Zebra instead of
> between Alvaro and Beatriz, because that seems extremely frustrating.

I tend to agree, and I wish we had a way to handle this at a
"presentation" layer rather than pushing the whole thing down into
indexes (storage layer).

In theory, pushing collation down to indexes could offer performance
advantages, but in practice humans don't read a lot of data, so a post-
processing step would be efficient in most cases.

> That's perfectly legitimate, but it's different from my
> experience. My experience is that when I tell people they can use
> collate "C" to speed up sorting, they tell me that's a stupid
> workaround that doesn't give them the answers that they want, which
> obviously colors my viewpoint on this question in the same way that
> your experiences color yours.

"C" is especially unappealing because it doesn't even get basic case
transformations right outside of ASCII.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Make Intel's ICX compiler working
Next
From: Greg Sabino Mullane
Date:
Subject: Re: [PATCH] libpq: try all addresses for a host before moving to next on target_session_attrs mismatch