Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From Alexander Borisov
Subject Re: Improve the performance of Unicode Normalization Forms.
Date
Msg-id 7859e5ef-a574-4199-a69b-6fee26711521@gmail.com
Whole thread Raw
In response to Re: Improve the performance of Unicode Normalization Forms.  (Alexander Borisov <lex.borisov@gmail.com>)
List pgsql-hackers
Hi, Jeff, hackers!

As promised, refactoring the C code for Unicode Normalization Forms.

In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
     a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
     but uint8 is allocated for the entire size for the CCC cache, which
     boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
     rewritten.

I am confident that we have achieved excellent results.

Jeff's test:
Without patch:
     Normalization from NFC to  NFD with  PG: 009.121
     Normalization from NFC to NFKD with  PG: 009.048
     Normalization from NFD to  NFC with  PG: 014.525
     Normalization from NFD to NFKC with  PG: 014.380

Whith patch:
     Normalization from NFC to  NFD with  PG: 001.580
     Normalization from NFC to NFKD with  PG: 001.634
     Normalization from NFD to  NFC with  PG: 002.979
     Normalization from NFD to NFKC with  PG: 003.050

Test with ICU (with path and ICU):
     Normalization from NFC to  NFD with  PG: 001.580
     Normalization from NFC to  NFD with ICU: 001.880
     Normalization from NFC to NFKD with  PG: 001.634
     Normalization from NFC to NFKD with ICU: 001.857

     Normalization from NFD to  NFC with  PG: 002.979
     Normalization from NFD to  NFC with ICU: 001.144
     Normalization from NFD to NFKC with  PG: 003.050
     Normalization from NFD to NFKC with ICU: 001.260

pgbench:
The files were sent via pgbench. The files contain all code points that
need to be normalized.

NFC:
     Patch: tps = 9701.568161
     Without: tps = 6820.828104

NFD:
     Patch: tps = 2707.155148
     Without: tps = 1745.949174

NFKC:
     Patch: tps = 9893.952804
     Without: tps = 6697.358888

NFKD:
     Patch: tps = 2580.785909
     Without: tps = 1521.058417

To ensure fairness in testing with ICU, I corrected Jeff's patch;
we calculate the size of the final buffer, and I placed ICU in
the same position.

I'm talking about:
Get size:
     length = unorm_normalize(u_input, -1, form, 0, NULL, 0, &status);
Normalize:
     unorm_normalize(u_input, -1, form, 0, u_result, length, &status);

Otherwise, it turned out that we were giving the ICU some huge buffer,
and it was writing to it.
And we ourselves calculate what buffer we need.


-- 
Regards,
Alexander Borisov
Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Use bool with synced field (src/include/replication/slot.h)
Next
From: Ilia Evdokimov
Date:
Subject: Re: pull-up subquery if JOIN-ON contains refs to upper-query