Thread: Possible to create a hidden collation
Repro: create collation test(provider=icu, deterministic=false, locale='@colStrength=primary'); create collation "en_US"(provider=icu, deterministic=false, locale='@colStrength=primary'); select 'a' = 'A' collate test; -- true select 'a' = 'A' collate "en_US"; -- false drop collation "en_US"; -- drops built-in collation select 'a' = 'A' collate "en_US"; -- true Explanation: The second collation named "en_US" is hidden behind the built-in collation "en_US" because the former is created with collencoding=-1 (as all icu collations are), and the latter is a libc collation with collencoding equal to the current database encoding (which takes precedence). It's a minor bug, but could be surprising behavior. Solution: There's no reason to create user defined collations with collencoding=- 1. As far as I can tell, collencoding is only there to hide built-in collations (inherited from template0) that aren't compatible with the current database encoding. So we can just always create user-defined collations with collencoding=GetDatabaseEncoding(). Patch attached (extracted from series posted elsewhere because this is a bug in old versions). I don't intend to backport because the bug is minor and it will affect catalog contents. If others agree with the fix, then I'll also bump the catversion, though I'm not sure that it's strictly necessary. Regards, Jeff Davis
Attachment
Jeff Davis <pgsql@j-davis.com> writes: > There's no reason to create user defined collations with collencoding=- > 1. As far as I can tell, collencoding is only there to hide built-in > collations (inherited from template0) that aren't compatible with the > current database encoding. So we can just always create user-defined > collations with collencoding=GetDatabaseEncoding(). What happens if somebody makes a user-defined collation in template0 and then clones that into a different encoding? I guess the user-defined collation would then be hidden/inaccessible in the clone DB, so maybe it's fine. regards, tom lane
On Thu, 2023-05-11 at 20:02 -0400, Tom Lane wrote: > What happens if somebody makes a user-defined collation in template0 > and then clones that into a different encoding? I didn't know changing template0 was supported behavior for normal operation. > I guess the user-defined > collation would then be hidden/inaccessible in the clone DB, so maybe > it's > fine. Assuming we do support changes to template0, the fix would prevent users from making a new ICU collation in template0 that is visible to all encodings, other than calling pg_import_system_collations(). That doesn't sound like a big problem; it's already impossible to do that with libc collations. Regards, Jeff Davis
Jeff Davis wrote: > create collation test(provider=icu, > deterministic=false, locale='@colStrength=primary'); > create collation "en_US"(provider=icu, > deterministic=false, locale='@colStrength=primary'); > > select 'a' = 'A' collate test; -- true > select 'a' = 'A' collate "en_US"; -- false > drop collation "en_US"; -- drops built-in collation > select 'a' = 'A' collate "en_US"; -- true > > Explanation: > > The second collation named "en_US" is hidden behind the built-in > collation "en_US" because the former is created with collencoding=-1 > (as all icu collations are), and the latter is a libc collation with > collencoding equal to the current database encoding (which takes > precedence). > > It's a minor bug, but could be surprising behavior. ISTM that this behavior is not due to collencoding=-1, but to the custom "en_US" collation being in the "public" schema whereas the built-in "en_US" is in "pg_catalog". Assuming a default search_path, - create collation "en_US"(...) means create collation "public"."en_US"(...) - select 'a' = 'A' collate "en_US" means select 'a' = 'A' collate "pg_catalog"."en_US" - drop collation "en_US" means drop collation "pg_catalog"."en_US" So in practice the new collation en_US is not being seen until the system collation is dropped, independently of collencoding. Also, the proposed patch doesn't appear to change the outcome of the sequence of statements given as an example, which is both expected considering the above, and surprising because you imply that it should improve the user experience. What I'm seeing after applying it: Initial state: postgres=# select * from pg_collation where collname='en_US' \gx -[ RECORD 1 ]-------+----------- oid | 12419 collname | en_US collnamespace | 11 collowner | 10 collprovider | c collisdeterministic | t collencoding | 6 collcollate | en_US.utf8 collctype | en_US.utf8 colliculocale | collicurules | collversion | 2.35 postgres=# create collation "en_US"(provider=icu, postgres(# deterministic=false, locale='@colStrength=primary'); NOTICE: using standard form "und-u-ks-level1" for locale "@colStrength=primary" CREATE COLLATION postgres=# select * from pg_collation where collname='en_US' \gx -[ RECORD 1 ]-------+---------------- oid | 12419 collname | en_US collnamespace | 11 collowner | 10 collprovider | c collisdeterministic | t collencoding | 6 collcollate | en_US.utf8 collctype | en_US.utf8 colliculocale | collicurules | collversion | 2.35 -[ RECORD 2 ]-------+---------------- oid | 16388 collname | en_US collnamespace | 2200 collowner | 10 collprovider | i collisdeterministic | f collencoding | 6 collcollate | collctype | colliculocale | und-u-ks-level1 collicurules | collversion | 153.112 (I notice collencoding=6 for the new collation in the public namespace) postgres=# select 'a' = 'A' collate "en_US"; ?column? ---------- f (1 row) (the libc collation is still used, because pg_catalog comes first) postgres=# drop collation "en_US"; DROP COLLATION postgres=# select * from pg_collation where collname='en_US' \gx -[ RECORD 1 ]-------+---------------- oid | 16388 collname | en_US collnamespace | 2200 collowner | 10 collprovider | i collisdeterministic | f collencoding | 6 collcollate | collctype | colliculocale | und-u-ks-level1 collicurules | collversion | 153.112 (the collation in pg_catalog has been dropped, so we're left with only the custom collation). So maybe it's better to set collencoding to the db encoding as done by the patch, but it's not clear what concrete problem it solves. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Sat, 2023-05-13 at 14:28 +0200, Daniel Verite wrote: > ISTM that this behavior is not due to collencoding=-1, but > to the custom "en_US" collation being in the "public" schema > whereas the built-in "en_US" is in "pg_catalog". My mistake -- it looks like this is not a bug and I don't see a real problem here. Thank you for looking. > So maybe it's better to set collencoding to the db encoding as done > by the patch, but it's not clear what concrete problem it solves. I was a bit bothered by the inconsistency between libc and icu here, and I still feel like this patch is probably the right thing to do. If nothing else, right now the error messages are inconsistent: create collation test(provider='libc', locale='en_US.utf8'); CREATE COLLATION create collation test(provider='libc', locale='en_US.utf8'); ERROR: collation "test" for encoding "UTF8" already exists create collation test(provider='icu', locale='und'); ERROR: collation "test" already exists The patch fixes that. Not terribly important, but having a consistent catalog representation seems like a good idea. Regards, Jeff Davis