Thread: Possible to create a hidden collation

Possible to create a hidden collation

From

Jeff Davis

Date:

11 May 2023, 23:45:38

Repro:

  create collation test(provider=icu,
    deterministic=false, locale='@colStrength=primary');
  create collation "en_US"(provider=icu,
    deterministic=false, locale='@colStrength=primary');

  select 'a' = 'A' collate test; -- true
  select 'a' = 'A' collate "en_US"; -- false
  drop collation "en_US"; -- drops built-in collation
  select 'a' = 'A' collate "en_US"; -- true

Explanation:

The second collation named "en_US" is hidden behind the built-in
collation "en_US" because the former is created with collencoding=-1
(as all icu collations are), and the latter is a libc collation with
collencoding equal to the current database encoding (which takes
precedence).

It's a minor bug, but could be surprising behavior.

Solution:

There's no reason to create user defined collations with collencoding=-
1. As far as I can tell, collencoding is only there to hide built-in
collations (inherited from template0) that aren't compatible with the
current database encoding. So we can just always create user-defined
collations with collencoding=GetDatabaseEncoding().

Patch attached (extracted from series posted elsewhere because this is
a bug in old versions). I don't intend to backport because the bug is
minor and it will affect catalog contents. If others agree with the
fix, then I'll also bump the catversion, though I'm not sure that it's
strictly necessary.

Regards,
    Jeff Davis

Attachment

v5-0001-For-user-defined-collations-never-set-collencodin.patch

Re: Possible to create a hidden collation

From

Tom Lane

Date:

12 May 2023, 03:02:14

Jeff Davis <pgsql@j-davis.com> writes:
> There's no reason to create user defined collations with collencoding=-
> 1. As far as I can tell, collencoding is only there to hide built-in
> collations (inherited from template0) that aren't compatible with the
> current database encoding. So we can just always create user-defined
> collations with collencoding=GetDatabaseEncoding().

What happens if somebody makes a user-defined collation in template0
and then clones that into a different encoding?  I guess the user-defined
collation would then be hidden/inaccessible in the clone DB, so maybe it's
fine.

            regards, tom lane

Re: Possible to create a hidden collation

From

Jeff Davis

Date:

12 May 2023, 03:27:04

On Thu, 2023-05-11 at 20:02 -0400, Tom Lane wrote:
> What happens if somebody makes a user-defined collation in template0
> and then clones that into a different encoding?

I didn't know changing template0 was supported behavior for normal
operation.

>   I guess the user-defined
> collation would then be hidden/inaccessible in the clone DB, so maybe
> it's
> fine.

Assuming we do support changes to template0, the fix would prevent
users from making a new ICU collation in template0 that is visible to
all encodings, other than calling pg_import_system_collations(). That
doesn't sound like a big problem; it's already impossible to do that
with libc collations.

Regards,
    Jeff Davis

Re: Possible to create a hidden collation

From

"Daniel Verite"

Date:

13 May 2023, 15:28:59

    Jeff Davis wrote:

> create collation test(provider=icu,
>    deterministic=false, locale='@colStrength=primary');
>  create collation "en_US"(provider=icu,
>    deterministic=false, locale='@colStrength=primary');
>
>  select 'a' = 'A' collate test; -- true
>  select 'a' = 'A' collate "en_US"; -- false
>  drop collation "en_US"; -- drops built-in collation
>  select 'a' = 'A' collate "en_US"; -- true
>
> Explanation:
>
> The second collation named "en_US" is hidden behind the built-in
> collation "en_US" because the former is created with collencoding=-1
> (as all icu collations are), and the latter is a libc collation with
> collencoding equal to the current database encoding (which takes
> precedence).
>
> It's a minor bug, but could be surprising behavior.

ISTM that this behavior is not due to collencoding=-1, but
to the custom "en_US" collation being in the "public" schema
whereas the built-in "en_US" is in "pg_catalog".

Assuming a default search_path,
-  create collation "en_US"(...) means
 create collation "public"."en_US"(...)
- select 'a' = 'A' collate "en_US"   means
 select 'a' = 'A' collate "pg_catalog"."en_US"
- drop collation "en_US" means
 drop collation "pg_catalog"."en_US"

So in practice the new collation en_US is not being seen
until the system collation is dropped, independently of
collencoding.

Also, the proposed patch doesn't appear to change the outcome
of the sequence of statements given as an example, which is both
expected considering the above, and surprising because you imply
that it should improve the user experience.


What I'm seeing after applying it:

Initial state:
postgres=# select * from pg_collation where collname='en_US' \gx
-[ RECORD 1 ]-------+-----------
oid            | 12419
collname        | en_US
collnamespace        | 11
collowner        | 10
collprovider        | c
collisdeterministic | t
collencoding        | 6
collcollate        | en_US.utf8
collctype        | en_US.utf8
colliculocale        |
collicurules        |
collversion        | 2.35


postgres=# create collation "en_US"(provider=icu,
postgres(# deterministic=false, locale='@colStrength=primary');
NOTICE:  using standard form "und-u-ks-level1" for locale
"@colStrength=primary"
CREATE COLLATION


postgres=# select * from pg_collation where collname='en_US' \gx
-[ RECORD 1 ]-------+----------------
oid            | 12419
collname        | en_US
collnamespace        | 11
collowner        | 10
collprovider        | c
collisdeterministic | t
collencoding        | 6
collcollate        | en_US.utf8
collctype        | en_US.utf8
colliculocale        |
collicurules        |
collversion        | 2.35
-[ RECORD 2 ]-------+----------------
oid            | 16388
collname        | en_US
collnamespace        | 2200
collowner        | 10
collprovider        | i
collisdeterministic | f
collencoding        | 6
collcollate        |
collctype        |
colliculocale        | und-u-ks-level1
collicurules        |
collversion        | 153.112

(I notice collencoding=6 for the new collation in the public namespace)


postgres=# select 'a' = 'A' collate "en_US";
 ?column?
----------
 f
(1 row)

(the libc collation is still used, because pg_catalog comes first)

postgres=# drop collation "en_US";
DROP COLLATION

postgres=# select * from pg_collation where collname='en_US' \gx
-[ RECORD 1 ]-------+----------------
oid            | 16388
collname        | en_US
collnamespace        | 2200
collowner        | 10
collprovider        | i
collisdeterministic | f
collencoding        | 6
collcollate        |
collctype        |
colliculocale        | und-u-ks-level1
collicurules        |
collversion        | 153.112

(the collation in pg_catalog has been dropped, so we're left with only
the custom collation).

So maybe it's better to set collencoding to the db encoding as done
by the patch, but it's not clear what concrete problem it solves.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Possible to create a hidden collation

From

Jeff Davis

Date:

16 May 2023, 01:04:21

On Sat, 2023-05-13 at 14:28 +0200, Daniel Verite wrote:
> ISTM that this behavior is not due to collencoding=-1, but
> to the custom "en_US" collation being in the "public" schema
> whereas the built-in "en_US" is in "pg_catalog".

My mistake -- it looks like this is not a bug and I don't see a real
problem here. Thank you for looking.

> So maybe it's better to set collencoding to the db encoding as done
> by the patch, but it's not clear what concrete problem it solves.

I was a bit bothered by the inconsistency between libc and icu here,
and I still feel like this patch is probably the right thing to do.

If nothing else, right now the error messages are inconsistent:

  create collation test(provider='libc', locale='en_US.utf8');
  CREATE COLLATION
  create collation test(provider='libc', locale='en_US.utf8');
  ERROR:  collation "test" for encoding "UTF8" already exists
  create collation test(provider='icu', locale='und');
  ERROR:  collation "test" already exists

The patch fixes that. Not terribly important, but having a consistent
catalog representation seems like a good idea.

Regards,
    Jeff Davis