On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote:
> I agree with merging the threads, even though it makes for a larger
> patch set. It would be great to get a unified "builtin" provider in
> place for the next major.
I believe that's possible and that this proposal is quite close (hoping
to get something in this 'fest). The tables I'm introducing have
exhaustive test coverage, so there's not a lot of risk there. And the
builtin provider itself is an optional feature, so it won't be
disruptive.
>
> In the first list it seems that some callers might be influenced by a
> COLLATE clause or table definition while others always take the
> database
> default? It still seems a bit odd to me if different providers can be
> used for different parts of a single SQL.
Right, that can happen today, and my proposal doesn't change that.
Basically those are cases where the caller was never properly onboarded
to our collation system, like the ts_locale.c routines.
> Is there any reason we couldn't commit the minor cleanup (patch 0001)
> now? It's less than 200 lines and pretty straightforward.
Sure, I'll commit that fairly soon then.
> I wonder if, after a year of running the builtin provider in
> production,
> whether we might consider adding to the builtin provider a few
> locales
> with simple but more reasonable ordering for european and asian
> languages?
I won't rule that out completely, but there's a lot we would need to do
to get there. Even assuming we implement that perfectly, we'd need to
make sure it's a reasonable scope for Postgres as a project and that we
have more than one person willing to maintain it. Similar things have
been rejected before for similar reasons.
What I'm proposing for v17 is much simpler: basically some lookup
tables, which is just an extension of what we're already doing for
normalization.
> https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main
>
> Make sure to click the link to show the counterexamples and
> discussion,
> that's the best part.
Yes, it can be hard to reason about this stuff but I believe Unicode
provides a lot of good data and guidance to work from. If you think my
proposal relies on one of those assumptions let me know.
To the extent that I do rely on any of those assumptions, it's mostly
to match libc's "C.UTF-8" behavior.
Regards,
Jeff Davis