Home > mailing lists

Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From	Peter Eisentraut
Subject	Re: Built-in CTYPE provider
Date	February 7, 2024 09:53:36
Msg-id	612987b5-38fa-44c6-8b78-c0be5a177a70@eisentraut.org Whole thread Raw
In response to	Re: Built-in CTYPE provider (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Built-in CTYPE provider Re: Built-in CTYPE provider
List	pgsql-hackers

Tree view

Review of the v16 patch set:

(Btw., I suppose you started this patch series with 0002 because some 
0001 was committed earlier.  But I have found this rather confusing.  I 
think it's ok to renumber from 0001 for each new version.)

* v16-0002-Add-Unicode-property-tables.patch

Various comments are updated to include the term "character class".  I 
don't recognize that as an official Unicode term.  There are categories 
and properties.  Let's check this.

Some files need heavy pgindent and perltidy.  You were surely going to 
do this eventually, but maybe you want to do this sooner to check 
whether you like the results.

- src/common/unicode/Makefile

This patch series adds some new post-update-unicode tests.  Should we 
have a separate target for each or just one common "unicode test" 
target?  Not sure.

- .../generate-unicode_category_table.pl

The trailing commas handling ($firsttime etc.) is not necessary with 
C99.  The code can be simplified.

For this kind of code:

+print $OT <<"HEADER";

let's use a common marker like EOS instead of a different one for each 
block.  That just introduces unnecessary variations.

- src/common/unicode_category.c

The mask stuff at the top could use more explanation.  It's impossible
to figure out exactly what, say, PG_U_PC_MASK does.

Line breaks in the different pg_u_prop_* functions are gratuitously 
different.

Is it potentially confusing that only some pg_u_prop_* have a posix
variant?  Would it be better for a consistent interface to have a
"posix" argument for each and just ignore it if not used?  Not sure.

Let's use size_t instead of Size for new code.


* v16-0003-Add-unicode-case-mapping-tables-and-functions.patch

Several of the above points apply here analogously.


* v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch

This is mostly a straightforward renaming patch, but there are some 
changes in initdb and pg_dump that pre-assume the changes in the next 
patch, like which locale columns apply for which providers.  I think it 
would be better for the historical record to make this a straight 
renaming patch and move those semantic changes to the next patch (or a 
separate intermediate patch, if you prefer).

- src/bin/psql/describe.c
- src/test/regress/expected/psql.out

This would be a good opportunity to improve the output columns for 
collations.  The updated view is now:

+ Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | 
Deterministic?
+--------+------+----------+---------+-------+--------+-----------+----------------

This is historically grown but suboptimal.  Why is Locale after Collate 
and Ctype, and why does it show both?  I think we could have just the 
Locale column, and if the libc provider is used with different 
collate/ctype (very rare!), we somehow write that into the single locale 
column.

(A change like this would be a separate patch.)


* v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch

About this initdb --builtin-locale option and analogous options 
elsewhere:  Maybe we should flip this around and provide a --libc-locale 
option, and have all the other providers just use the --locale option. 
This would be more consistent with the fact that it's libc that is 
special in this context.

Do we even need the "C" locale?  We have established that "C.UTF-8" is 
useful, but if that is easily available, who would need "C"?

Some changes in this patch appear to be just straight renamings, like in
src/backend/utils/init/postinit.c and 
src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put into 
the previous patch instead.

On the collation naming: My expectation would have been that the 
"C.UTF-8" locale would be exposed as the UCS_BASIC collation.  And the 
"C" locale as some other name (or not at all, see above).  You have this 
the other way around.

pgsql-hackers by date:

From: Amit Kapila
Date: 07 February 2024, 09:51:18
Subject: Re: Why is subscription/t/031_column_list.pl failing so much?

From: Erik Wienhold
Date: 07 February 2024, 09:54:21
Subject: Re: Psql meta-command conninfo+

Re: Built-in CTYPE provider - Mailing list pgsql-hackers

Previous

Next