Thread: Re: Optimization for lower(), upper(), casefold() functions.

Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Fri, 2025-03-14 at 15:00 +0300, Alexander Borisov wrote:
> I tried adding a loop to create tables, and everything looks fine
> (v7).
> Also removed unnecessary (hanging) global variables.

Changed. I used a loop more similar to your first one (hash of arrays),
and I left case_map_special outside of the loop. It a different type of
array: rather than mapping to a codepoint, it maps to another index,
and I think it's a different case (and needs a different comment).

>
> It's all about not getting too carried away. In my vision these
> subroutines (user-defined functions) will have to be moved to perl
> module (.pm) in the future, after the patch is committed, so that the
> code can be used for Normalization Forms.

I prefer to generalize when we have the other code in place. As it was,
it was a bit confusing why the extra arguments and subroutines were
there.


Also, make_ranges() doesn't seem to do what is described in the
comment: it produces a single range. I changed "> $limit" to ">=
$limit", but then it generates 3 ranges, not two. I rewrote it to be
more clear what's going on:

* I used new looping logic that, at least to me, seems a bit simpler.

* I changed the ranges from an array of 4 elements to a hash with 3
keys, which is more descriptive when using it.

* I changed the terminology of a range so that it's Start and End,
because $from and $to are used by branches() to mean something else.

In branch():

* I got rid of the third element "VALUE" from the range. It was used to
be the start of the next range, but there was already code in the
caller to look ahead to the next range, so it didn't serve any purpose.

* If we rely on some compiler optimizations, it might be possible to
simplify branch() a bit, but I'm fine with the way it's done.


When looking around, I didn't find a lot of material discussing this
generated-branches approach. While it's mentioned a few places, I
couldn't even find a consistent name for it. If you know of something
where I can read some more analysis, please send a link.


Committed. Thank you!

Regards,
    Jeff Davis




Re: Optimization for lower(), upper(), casefold() functions.

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Committed. Thank you!

crake doesn't like your perl style:

./src/common/unicode/generate-unicode_case_table.pl: Loop iterator is not lexical at line 638, column 2.  See page 108
ofPBP.  ([Variables::RequireLexicalLoopIterators] Severity: 5) 


            regards, tom lane



Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Sat, Mar 15, 2025 at 1:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> Committed. Thank you!

crake doesn't like your perl style:

./src/common/unicode/generate-unicode_case_table.pl: Loop iterator is not lexical at line 638, column 2.  See page 108 of PBP.

I suppose pgperltidy didn't catch that. I will fix it shortly.

Regards,
      Jeff Davis

Re: Optimization for lower(), upper(), casefold() functions.

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, Mar 15, 2025 at 1:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> crake doesn't like your perl style:
>> ./src/common/unicode/generate-unicode_case_table.pl: Loop iterator is not
>> lexical at line 638, column 2.  See page 108 of PBP.

> I suppose pgperltidy didn't catch that. I will fix it shortly.

crake is running perlcritic, which is a whole different animal.

            regards, tom lane



Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
15.03.2025 23:07, Jeff Davis wrote:
> On Fri, 2025-03-14 at 15:00 +0300, Alexander Borisov wrote:
>> I tried adding a loop to create tables, and everything looks fine
>> (v7).

[...]

> I prefer to generalize when we have the other code in place. As it was,
> it was a bit confusing why the extra arguments and subroutines were
> there.

Thanks Jeff for the review of my patch!

> Also, make_ranges() doesn't seem to do what is described in the
> comment: it produces a single range. I changed "> $limit" to ">=
> $limit", but then it generates 3 ranges, not two. I rewrote it to be
> more clear what's going on:

[...]

I had some comments, but since the patch is already committed,
it's fine.

> When looking around, I didn't find a lot of material discussing this
> generated-branches approach. While it's mentioned a few places, I
> couldn't even find a consistent name for it. If you know of something
> where I can read some more analysis, please send a link.

Unfortunately, I can't send any links. I came up with this myself, for
my project https://github.com/lexbor/lexbor. I saw that Postgres could
improve the approach to Normalization Form, then I noticed that case is
not all good either. That's why I took it up.
In other words, this approach is purely my idea, and it is showing well.
I really like Postgres and want to help you improve it, and I'm very
good at Unicode.

I feel like my knowledge will be useful for Postgres.

Thank you Jeff again for your time.


Regards,
Alexander Borisov




Re: Optimization for lower(), upper(), casefold() functions.

From
Tom Lane
Date:
One more thing: I observe that headerscheck is now unhappy:

$ src/tools/pginclude/headerscheck
In file included from /tmp/headerscheck.yOpahZ/test.c:2:
./src/include/common/unicode_case_table.h:8598:24: warning: 'casekind_map' defined but not used [-Wunused-variable]
 static const pg_wchar *casekind_map[NCaseKind] =
                        ^~~~~~~~~~~~

It's not apparent to me why that table needs to be in a header
file and not in the sole user .c file?

Also, probably better to make it const:

-static const pg_wchar *casekind_map[NCaseKind] =
+static const pg_wchar * const casekind_map[NCaseKind] =

            regards, tom lane



Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Tue, 2025-03-18 at 11:11 -0400, Tom Lane wrote:
> It's not apparent to me why that table needs to be in a header
> file and not in the sole user .c file?

Thank you, fixed.

> Also, probably better to make it const:
>
> -static const pg_wchar *casekind_map[NCaseKind] =
> +static const pg_wchar * const casekind_map[NCaseKind] =

Fixed also (except pgindent had a slightly different opinion about
spaces).

Was this a general suggestion, or did you see something in particular
that would make it more optimizable this way?

Regards,
    Jeff Davis




Re: Optimization for lower(), upper(), casefold() functions.

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2025-03-18 at 11:11 -0400, Tom Lane wrote:
>> Also, probably better to make it const:
>> 
>> -static const pg_wchar *casekind_map[NCaseKind] =
>> +static const pg_wchar * const casekind_map[NCaseKind] =

> Was this a general suggestion, or did you see something in particular
> that would make it more optimizable this way?

No, just a general style position that tables that aren't supposed
to change should be const.  Cases like this are a tad insidious
because it looks like you did make the table const, only you didn't.

            regards, tom lane