On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
> Version 3 patches. In version 2 "make -s headerscheck" did not work.
I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:
cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt
cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt
export NORM_PERF_NFC_FILE=~/strings.nfc.txt
export NORM_PERF_NFD_FILE=~/strings.nfd.txt
The first is about 8MB, the second 9MB (because NFD is slightly
larger).
Then I added some testing code to norm_test.c. It's not intended for
committing, just to run the test. Note that it requires setting
environment variables to find the input files.
If patch v3j-0001 are applied, it's using perfect hashing. If patches
v3j-0002-4 are applied, it's using your code. In either case it
compares with ICU.
Results with perfect hashing (100 iterations):
Normalization from NFC to NFD with PG: 010.009
Normalization from NFC to NFD with ICU: 001.580
Normalization from NFC to NFKD with PG: 009.376
Normalization from NFC to NFKD with ICU: 000.857
Normalization from NFD to NFC with PG: 016.026
Normalization from NFD to NFC with ICU: 001.205
Normalization from NFD to NFKC with PG: 015.903
Normalization from NFD to NFKC with ICU: 000.654
Results with your code (100 iterations):
Normalization from NFC to NFD with PG: 004.626
Normalization from NFC to NFD with ICU: 001.577
Normalization from NFC to NFKD with PG: 004.024
Normalization from NFC to NFKD with ICU: 000.861
Normalization from NFD to NFC with PG: 006.846
Normalization from NFD to NFC with ICU: 001.209
Normalization from NFD to NFKC with PG: 006.655
Normalization from NFD to NFKC with ICU: 000.651
Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.
Regards,
Jeff Davis