Re: Order changes in PG16 since ICU introduction - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Order changes in PG16 since ICU introduction
Date
Msg-id 4e3281d1ba5950dbe3b5d3c8d511da4c5751217a.camel@j-davis.com
Whole thread Raw
In response to Re: Order changes in PG16 since ICU introduction  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
On Mon, 2023-05-22 at 14:27 +0200, Peter Eisentraut wrote:
> The rules are for setting whatever sort order you like.  Maybe you
> want
> to sort + before - or whatever.  It's like, if you don't like it,
> build
> your own.

A build-your-own feature is fine, but it's not completely zero cost.

There some risk that rules specified for ICU version X fail to load for
ICU version Y. If that happens to your database default collation, you
are in big trouble. The risk of failing to load a language tag in a
later version, especially one returned by uloc_toLanguageTag() in
strict mode, is much lower. We can reduce the risk by allowing rules
only for CREATE COLLATION (not  CREATE DATABASE), and see what users do
with it first, and consider adding it to CREATE DATABASE later.

We can also try to explain in the docs that it's a build-it-yourself
kind of feature (use it if you see a purpose, otherwise ignore it),
though I'm not sure quite how we should word it.

And I'm skeptical that we don't have a single plausible end-to-end user
story. I just can't think of any reason someone would need something
like this, given how flexible the collation settings in the language
tags are. The best case I can think of is if someone is trying to make
an ICU collation that matches some non-ICU collation in another system,
which sounds hard; but perhaps it's reasonable to do in cases where it
just needs to work well-enough in some limited case.

Also, do we have an answer as to why specifying the rules as '' is not
the same as not specifying any rules[1]?

[1]
https://www.postgresql.org/message-id/36a6e89689716c2ca1fae8adc8e84601a041121c.camel@j-davis.com

> The co settings are just everything else.
> They are not parametric, they are just some other sort order that
> someone spelled out explicitly.

This sounds like another case where we can't really tell the user why
they would want to use a specific "co" setting; they should only use it
if they already know they want it. Is there some way we can word that
in the documentation so that people don't misuse them?

For instance, one of them is called "emoji". I'm sure a lot of
applications use emoji (or at least might encounter them), should they
always use co-emoji, or would some people who are using emoji not want
it? Can it be combined with "ks" or other "k*" settings?

What I'm trying to avoid is users seeing something in the documentation
and using it without it really being a good fit for their problem. Then
they see something unexpected, and need to rebuild all of their indexes
or something.

> > * I don't understand what "kc" means if "ks" is not set to
> > "level1".
>
> There is an example here:
> https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel

Interesting, thank you.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: PG 16 draft release notes ready
Next
From: Erik Rijkers
Date:
Subject: Re: PostgreSQL 16 Beta 1 release announcement draft