Thread: Re: Collation & ctype method table, and extension hooks

Re: Collation & ctype method table, and extension hooks

From

Jeff Davis

Date:

26 October 2024, 01:42:36

On Thu, 2024-10-24 at 10:05 +0200, Andreas Karlsson wrote:
> Why is there no pg_locale_builtin.c?

Just that it would be a fairly small file, but I'm fine with doing
that.

> I think adding an assert to create_pg_locale() which enforces valid
> there is always a combination of ctype_is_c and casemap would be
> good,
> similar to the collate field.

Good idea.

> Why are casemap and ctype_methods not the same struct? They seem very
> closely related.

The code impact was in fairly different places, so it seemed like a
nice way to break it out. I could combine them, but it would be a
fairly large patch.

> This commit makes me tempted to handle the ctype_is_c logic for
> character classes also in callbacks and remove the if in functions
> like
> pg_wc_ispunct(). But this si something that would need to be
> benchmarked.

That's a good idea. The reason collate_is_c is important is because
there are quite a few caller-specific optimizations, but that doesn't
seem to be true of ctype_is_c.

> I wonder if the bitmask idea isn't terrible for the branch predictor
> and
> that me may want one function per character class, but this is yet
> again
> something we need to benchmark.

Agreed -- a lot of work has gone into optimizing the regex code, and we
don't want a perf regression there. But I'm also not sure exactly which
kinds of tests I should be running for that.

> Is there a reason we allocate the icu_provider in
> create_pg_locale_icu
> with MemoryContextAllocZero when we intialize everything anyway? And
> similar for other providers.

Allocating and zeroing is a good defense against new optional methods
and fields which can safely default to zero.

> = v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch
>
> Looks good but seems like a quite painful API to use.

How is it painful and can we make it better?

> > * Have a CREATE LOCALE PROVIDER command and make "provider" an Oid
> > rather than a char ('b'/'i'/'c'). The v6 patches brings us close to
> > this point, but I'm not sure if we want to go this far in v18.
>
> Probably necessary but I hate all the DDL commands the way to SQL
> standard is written forces us to add.

There is some precedent for a DDL-like thing without new grammar:
pg_replication_origin_create(). I don't have a strong opinion on
whether to do that or not.

>
Regards,
    Jeff Davis

Re: Collation & ctype method table, and extension hooks

From

Andreas Karlsson

Date:

01 November 2024, 16:08:07

On 10/26/24 12:42 AM, Jeff Davis wrote:
> On Thu, 2024-10-24 at 10:05 +0200, Andreas Karlsson wrote:
>> Why is there no pg_locale_builtin.c?
> 
> Just that it would be a fairly small file, but I'm fine with doing
> that.

I think adding such a small file would make life easier for people new 
to the collation part of the code base. It would be a nice symmetry 
between collation providers and where code for them can be found.

>> Why are casemap and ctype_methods not the same struct? They seem very
>> closely related.
> 
> The code impact was in fairly different places, so it seemed like a
> nice way to break it out. I could combine them, but it would be a
> fairly large patch.

For me combining them would make the intention of the code easier to 
understand since aren't the casemap functions just a set of "ctype_methods"?

>> This commit makes me tempted to handle the ctype_is_c logic for
>> character classes also in callbacks and remove the if in functions
>> like
>> pg_wc_ispunct(). But this si something that would need to be
>> benchmarked.
> 
> That's a good idea. The reason collate_is_c is important is because
> there are quite a few caller-specific optimizations, but that doesn't
> seem to be true of ctype_is_c.

Yeah, that was my though too but I have not confirmed it.

>> I wonder if the bitmask idea isn't terrible for the branch predictor
>> and
>> that me may want one function per character class, but this is yet
>> again
>> something we need to benchmark.
> 
> Agreed -- a lot of work has gone into optimizing the regex code, and we
> don't want a perf regression there. But I'm also not sure exactly which
> kinds of tests I should be running for that.

I think we should at least try to find the worst case to see how big the 
performance hit for that is. And then after that try to figure out a 
more typical case benchmark.

>> = v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch
>>
>> Looks good but seems like a quite painful API to use.
> 
> How is it painful and can we make it better?

The painful part was mostly just a reference to that without a catalog 
table where new providers can be added we would need to add collations 
for our new custom provider on some already existing provider and then 
do for example some pattern matching on the name of the new collation. 
Really ugly but works.

I am thinking of implementing ICU4x as an external extension to try out 
the hook, but for the in-core contrib module we likely want to use 
something which does not require an external dependency. Or what do you 
think?

Andreas

Re: Collation & ctype method table, and extension hooks

From

Jeff Davis

Date:

01 November 2024, 21:17:09

On Fri, 2024-11-01 at 14:08 +0100, Andreas Karlsson wrote:
> > Agreed -- a lot of work has gone into optimizing the regex code,
> > and we
> > don't want a perf regression there. But I'm also not sure exactly
> > which
> > kinds of tests I should be running for that.
>
> I think we should at least try to find the worst case to see how big
> the
> performance hit for that is. And then after that try to figure out a
> more typical case benchmark.

What I had in mind was:

  * a large table with a single ~100KiB text field
  * a scan with a case insensitive regex that uses some character
classes

Does that sound like a worst case?

> The painful part was mostly just a reference to that without a
> catalog
> table where new providers can be added we would need to add
> collations
> for our new custom provider on some already existing provider and
> then
> do for example some pattern matching on the name of the new
> collation.
> Really ugly but works.

To add a catalog table for the locale providers, the main challenge is
around the database default collation and, relatedly, initdb. Do you
have some ideas around that?

Regards,
    Jeff Davis