Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id 3c1f4043bb4f76de78160f8afc8678eaa10b0e46.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Peter Eisentraut <peter@eisentraut.org>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
> I think that would be a terrible direction to take, because it would
> regress the default sort order from "correct" to "useless".

I don't agree that the current default is "correct". There are a lot of
ways it can be wrong:

  * the environment variables at initdb time don't reflect what the
users of the database actually want
  * there are so many different users using so many different
applications connected to the database that no one "correct" sort order
exists
  * libc has some implementation quirks
  * the version of Unicode that libc is based on is not what you expect
  * the version of libc is not what you expect

>   Aside from
> the overall message this sends about how PostgreSQL cares about
> locales
> and Unicode and such.

Unicode is primarily about the semantics of characters and their
relationships. The patches I propose here do a great job of that.

Collation (relationships between *strings*) is a part of Unicode, but
not the whole thing or even the main thing.

> Maybe you don't intend for this to be the default provider?

I am not proposing that this provider be the initdb-time default.

>   But then
> who would really use it? I mean, sure, some people would, but how
> would
> you even explain, in practice, the particular niche of users or use
> cases?

It's for users who want to respect Unicode support text from
international sources in their database; but are not experts on the
subject and don't know precisely what they want or understand the
consequences. If and when such users do notice a problem with the sort
order, they'd handle it at that time (perhaps with a COLLATE clause, or
sorting in the application).

> Maybe if this new provider would be called "minimal", it might
> describe
> the purpose better.

"Builtin" communicates that it's available everywhere (not a
dependency), that specific behaviors can be documented and tested, and
that behavior doesn't change within a major version. I want to
communicate all of those things.

> I could see a use for this builtin provider if it also included the
> default UCA collation (what COLLATE UNICODE does now).

I won't rule that out, but I'm not proposing that right now and my
proposal is already offering useful functionality.

> There would still be a risk with that approach, since it would
> permanently marginalize ICU functionality

Yeah, ICU already does a good job offering the root collation. I don't
think the builtin provider needs to do so.

> I would be curious what your overall vision is here?

Vision:

* The builtin provider will offer Unicode character semantics, basic
collation, platform-independence, and high performance. It can be used
on its own or in combination with ICU via the COLLATE clause.

* ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
matching, and customization with rules. It's the solution for
everything from "slightly more advanced" to "very advanced".

* libc would be for databases serving applications on the same machine
where a matching sort order is helpful, risks to indexes are
acceptable, and performance is not important.

>   Is switching the
> default to ICU still your goal?  Or do you want the builtin provider
> to
> be the default?

It's hard to answer this question while initdb chooses the database
default collation based on the environment. Neither ICU nor the builtin
provider can reasonably handle whatever those environment variables
might be set to.

Stepping back from the focus on what initdb does, we should be
providing the right encouragement in documentation and packaging to
guide users toward the right provider based their needs and the vision
outlined above.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Strange Bitmapset manipulation in DiscreteKnapsack()
Next
From: Thomas Munro
Date:
Subject: Re: 039_end_of_wal: error in "xl_tot_len zero" test