Re: Optimization of the is_normalized() function. - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Optimization of the is_normalized() function.
Date
Msg-id 2f0ff747-0f8f-4627-9022-51eb59ee27c0@iki.fi
Whole thread Raw
In response to Re: Optimization of the is_normalized() function.  (Alexander Borisov <lex.borisov@gmail.com>)
List pgsql-hackers
On 21/10/2025 19:17, Alexander Borisov wrote:
> I continue to work on optimizations and improvements in Unicode.
> This patch changes the approach to storing data to determine whether
> a code point is normalized (YES, NO, MAYBE).
> In other words, we are talking about Normalization Quick Check.
> 
> Currently, a perfect hash is used to store knowledge about the
> code point. I suggest abandoning the use of perfect hash and storing
> code point data in bits packed into uint8.
> This will give us almost direct access to the data.
> 
> The essence is simple.
> 1. Take the code point and divide it by 8. This gives us the index
>    in the uint8 table.
> 2. We need to get the bit for the code point. Take the code point
>    modulo 8 to get the desired offset.
> 
> Since we need to store one of three values (YES, NO, MAYBE) for each
> code point, we are forced to use two bits. So everything is simply
> multiplied by 2.

I like the simplicity of that.

> This is how obtaining a value by code point looks like in code:
> 
> uc = ch * 2;
> index = uc / 8;
> bit = uc % 8;
> 
> found = UnicodeNormProps_NFC_QC[index];
> found >>= bit;
> found &= ~(PG_UINT8_MAX << 2);

This way of formulating it seems complicated though. But that's a small 
detail.

> In terms of speed:
> 
> Used code points: 0-9, a-z, A-Z (ASCII)
> With patch: tps = 256.279737
> Without patch: tps = 218.902135

That's nice, but ASCII codepoints are handled by the "ch < 
UNICODE_NFKC_QC_MIN" fastpath, and we could easily add such a fastpath 
to the perfect hash implementation too.

> Used code points: а-я, А-Я (Cyrillic)
> With patch: tps = 156.979941
> Without patch: tps = 146.339438

I was hoping for more gain here...

The new tables are larger, which might explain some of that. And that 
will hurt more with more real world workloads, as the tables consume 
cache space that could be used for other stuff.

Perhaps a 2-level radix tree would be more efficient here too? There are 
long ranges of YES values in the table, which could be collapsed 
together in a radix tree.

> Thoughts out loud:
> It is also known that in Postgres, Normalization Quick Check does
> not perform a fair check for NFD and NFKD because the hash tables became
> too large. For these forms, the result MAYBE is always returned.
> We can implement a fair check for NFD and NFKD forms using the specified
> approach. In this case, the binary file will increase in size by 99040
> bytes. Considering that this is only used on the backend.

I don't have a good feeling for how common these operations are, whether 
it's worth the tradeoff.

One thought with these quick check tables is that they don't need to be 
100% accurate, we replace any YES or NO with MAYBE and still get the 
correct result. We could take advantage of that and compress the tables 
in a lossy fashion, or even use something like a bloom filter here. 
Again not sure what the right tradeoff would be.

- Heikki




pgsql-hackers by date:

Previous
From: Lukas Fittl
Date:
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Next
From: Andres Freund
Date:
Subject: Re: Add errdetail() with PID and UID about source of termination signal