Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Improve the performance of Unicode Normalization Forms.
Date
Msg-id adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com
Whole thread Raw
In response to Re: Improve the performance of Unicode Normalization Forms.  (Alexander Borisov <lex.borisov@gmail.com>)
Responses Re: Improve the performance of Unicode Normalization Forms.
List pgsql-hackers
On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
> Version 3 patches. In version 2 "make -s headerscheck" did not work.

I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:

  cat icu4j/perf-tests/data/collation/Test* \
    | uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt

  cat icu4j/perf-tests/data/collation/Test* \
    | uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt

  export NORM_PERF_NFC_FILE=~/strings.nfc.txt
  export NORM_PERF_NFD_FILE=~/strings.nfd.txt

The first is about 8MB, the second 9MB (because NFD is slightly
larger).

Then I added some testing code to norm_test.c. It's not intended for
committing, just to run the test. Note that it requires setting
environment variables to find the input files.

If patch v3j-0001 are applied, it's using perfect hashing. If patches
v3j-0002-4 are applied, it's using your code. In either case it
compares with ICU.

Results with perfect hashing (100 iterations):

  Normalization from NFC to  NFD with  PG: 010.009
  Normalization from NFC to  NFD with ICU: 001.580
  Normalization from NFC to NFKD with  PG: 009.376
  Normalization from NFC to NFKD with ICU: 000.857
  Normalization from NFD to  NFC with  PG: 016.026
  Normalization from NFD to  NFC with ICU: 001.205
  Normalization from NFD to NFKC with  PG: 015.903
  Normalization from NFD to NFKC with ICU: 000.654

Results with your code (100 iterations):

  Normalization from NFC to  NFD with  PG: 004.626
  Normalization from NFC to  NFD with ICU: 001.577
  Normalization from NFC to NFKD with  PG: 004.024
  Normalization from NFC to NFKD with ICU: 000.861
  Normalization from NFD to  NFC with  PG: 006.846
  Normalization from NFD to  NFC with ICU: 001.209
  Normalization from NFD to NFKC with  PG: 006.655
  Normalization from NFD to NFKC with ICU: 000.651

Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.

Regards,
    Jeff Davis


Attachment

pgsql-hackers by date:

Previous
From: Dagfinn Ilmari Mannsåker
Date:
Subject: Re: Improve tab completion for various SET/RESET forms
Next
From: Andres Freund
Date:
Subject: Re: headerscheck warnings with late-model gcc