Thread: [18] separate collation and ctype versions, and cleanup of pg_database locale fields

Definitions:

  - collation is text ordering and comparison
  - ctype affects case mapping (e.g. LOWER()) and pattern
    matching/regexes

Currently, there is only one version field, and it represents the
version of the collation. So, if your provider is libc and datcollate
is "C" and datctype is "en_US.utf8", then the datcollversion will
always be NULL. Other providers use datcolllocale, which is only one
field, so it doesn't matter.

Given the discussion here:

https://www.postgresql.org/message-id/1078884.1721762815@sss.pgh.pa.us

it seems like it may be a good idea to version collation and ctype
separately. The ctype version is, more or less, the Unicode version,
and we know what that is for the builtin provider as well as ICU.

(Aside: ICU could theoretically report the same Unicode version and
still make some change that would affect us, but I have not observed
that to be the case. I use exhaustive code point coverage to test that
our Unicode functions return the same results as the corresponding ICU
functions when the Unicode version matches.)

Adding more collation fields is getting to be messy, though, because
they all have to be present in pg_database, as well. It's hard to move
those fields into pg_collation, because that's not a shared catalog, so
that could cause problems with CREATE/ALTER DATABASE. Is it worth
thinking about how we can clean this up, or should we just put up with
the idea that almost half the fields in pg_database will be locale-
related?

Regards,
    Jeff Davis





On Thu, 2024-07-25 at 13:29 -0700, Jeff Davis wrote:
> it may be a good idea to version collation and ctype
> separately. The ctype version is, more or less, the Unicode version,
> and we know what that is for the builtin provider as well as ICU.

Attached a rough patch for the purposes of discussion. It tracks the
ctype version separately, but doesn't do anything with it yet.

The main problem is that it's one more slightly confusing thing to
understand, especially in pg_database because it's the ctype version of
the database default collation, not necessarily datctype.

Maybe we can do something with the naming or catalog representation to
make this more clear?

Regards,
    Jeff Davis


Attachment