Hi, hackers!
I continue to work on optimizations and improvements in Unicode.
This patch changes the approach to storing data to determine whether
a code point is normalized (YES, NO, MAYBE).
In other words, we are talking about Normalization Quick Check.
Currently, a perfect hash is used to store knowledge about the
code point. I suggest abandoning the use of perfect hash and storing
code point data in bits packed into uint8.
This will give us almost direct access to the data.
The essence is simple.
1. Take the code point and divide it by 8. This gives us the index
in the uint8 table.
2. We need to get the bit for the code point. Take the code point
modulo 8 to get the desired offset.
Since we need to store one of three values (YES, NO, MAYBE) for each
code point, we are forced to use two bits. So everything is simply
multiplied by 2.
This is how obtaining a value by code point looks like in code:
uc = ch * 2;
index = uc / 8;
bit = uc % 8;
found = UnicodeNormProps_NFC_QC[index];
found >>= bit;
found &= ~(PG_UINT8_MAX << 2);
In terms of speed:
Used code points: 0-9, a-z, A-Z (ASCII)
With patch: tps = 256.279737
Without patch: tps = 218.902135
Used code points: а-я, А-Я (Cyrillic)
With patch: tps = 156.979941
Without patch: tps = 146.339438
Tested using pgbench.
# 620KB
select
is_normalized('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ...',
'NFC');
# 1280KB
select
is_normalized('абвгдежзийклмнопрстуфхцчшщъыьэюяАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ...',
'NFC');
Thoughts out loud:
It is also known that in Postgres, Normalization Quick Check does
not perform a fair check for NFD and NFKD because the hash tables became
too large. For these forms, the result MAYBE is always returned.
We can implement a fair check for NFD and NFKD forms using the specified
approach. In this case, the binary file will increase in size by 99040
bytes. Considering that this is only used on the backend.
P.S.:
If you review and accept the patches [0] that accelerate
Unicode Normalization Forms, the speed of Quick Check will automatically
increase.
[0]
https://www.postgresql.org/message-id/flat/844d3dd7-2955-4794-95d1-7f4c13cb89fc%40gmail.com
--
Best regards,
Alexander Borisov