Re: speed up unicode normalization quick check - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: speed up unicode normalization quick check
Date
Msg-id B999AEF6-24A1-40E3-B548-7D28043EDCCE@enterprisedb.com
Whole thread Raw
In response to Re: speed up unicode normalization quick check  (John Naylor <john.naylor@2ndquadrant.com>)
Responses Re: speed up unicode normalization quick check  (John Naylor <john.naylor@2ndquadrant.com>)
List pgsql-hackers

> On Sep 18, 2020, at 9:41 AM, John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> Attached is version 4, which excludes the output file from pgindent,
> to match recent commit 74d4608f5. Since it won't be indented again, I
> also tweaked the generator script to match pgindent for the typedef,
> since we don't want to lose what pgindent has fixed already. This last
> part isn't new to v4, but I thought I'd highlight it anyway.
>
> --
> John Naylor                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
<v4-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch><v4-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch><v4-0003-Use-perfect-hashing-for-NKFC-Unicode-normalizatio.patch>

0002 and 0003 look good to me.  I like the way you cleaned up a bit with the unicode_norm_props struct, which makes the
codea bit more tidy, and on my compiler under -O2 it does not generate any extra runtime dereferences, as the compiler
cansee through the struct just fine.  My only concern would be if some other compilers might not see through the
struct,resulting in a runtime performance cost?  I wouldn't even ask, except that qc_hash_lookup is called in a fairly
tightloop. 

To clarify, the following changes to the generated code which remove the struct and corresponding dereferences (not
intendedas a patch submission) cause zero bytes of change in the compiled output for me on mac/clang, which is good,
andgenerate inconsequential changes on linux/gcc, which is also good, but I wonder if that is true for all compilers.
Inyour commit message for 0001 you mentioned testing on a multiplicity of compilers.  Did you do that for 0002 and 0003
aswell? 

diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c
index 1714837e64..976b96e332 100644
--- a/src/common/unicode_norm.c
+++ b/src/common/unicode_norm.c
@@ -476,8 +476,11 @@ qc_compare(const void *p1, const void *p2)
        return (v1 - v2);
 }

-static const pg_unicode_normprops *
-qc_hash_lookup(pg_wchar ch, const unicode_norm_info * norminfo)
+static inline const pg_unicode_normprops *
+qc_hash_lookup(pg_wchar ch,
+                          const pg_unicode_normprops *normprops,
+                          qc_hash_func hash,
+                          int num_normprops)
 {
        int                     h;
        uint32          hashkey;
@@ -487,21 +490,21 @@ qc_hash_lookup(pg_wchar ch, const unicode_norm_info * norminfo)
         * in network order.
         */
        hashkey = htonl(ch);
-       h = norminfo->hash(&hashkey);
+       h = hash(&hashkey);

        /* An out-of-range result implies no match */
-       if (h < 0 || h >= norminfo->num_normprops)
+       if (h < 0 || h >= num_normprops)
                return NULL;

        /*
         * Since it's a perfect hash, we need only match to the specific codepoint
         * it identifies.
         */
-       if (ch != norminfo->normprops[h].codepoint)
+       if (ch != normprops[h].codepoint)
                return NULL;

        /* Success! */
-       return &norminfo->normprops[h];
+       return &normprops[h];
 }

 /*
@@ -518,7 +521,10 @@ qc_is_allowed(UnicodeNormalizationForm form, pg_wchar ch)
        switch (form)
        {
                case UNICODE_NFC:
-                       found = qc_hash_lookup(ch, &UnicodeNormInfo_NFC_QC);
+                       found = qc_hash_lookup(ch,
+                                                                  UnicodeNormProps_NFC_QC,
+                                                                  NFC_QC_hash_func,
+                                                                  NFC_QC_num_normprops);
                        break;
                case UNICODE_NFKC:
                        found = bsearch(&key,
diff --git a/src/include/common/unicode_normprops_table.h b/src/include/common/unicode_normprops_table.h
index 5e1e382af5..38300cfa12 100644
--- a/src/include/common/unicode_normprops_table.h
+++ b/src/include/common/unicode_normprops_table.h
@@ -13,13 +13,6 @@ typedef struct
        signed int      quickcheck:4;   /* really UnicodeNormalizationQC */
 } pg_unicode_normprops;

-typedef struct
-{
-       const pg_unicode_normprops *normprops;
-       qc_hash_func hash;
-       int                     num_normprops;
-} unicode_norm_info;
-
 static const pg_unicode_normprops UnicodeNormProps_NFC_QC[] = {
        {0x0300, UNICODE_NORM_QC_MAYBE},
        {0x0301, UNICODE_NORM_QC_MAYBE},
@@ -1583,12 +1576,6 @@ NFC_QC_hash_func(const void *key)
        return h[a % 2463] + h[b % 2463];
 }

-static const unicode_norm_info UnicodeNormInfo_NFC_QC = {
-       UnicodeNormProps_NFC_QC,
-       NFC_QC_hash_func,
-       1231
-};
-
 static const pg_unicode_normprops UnicodeNormProps_NFKC_QC[] = {
        {0x00A0, UNICODE_NORM_QC_NO},
        {0x00A8, UNICODE_NORM_QC_NO},


--
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: XversionUpgrade tests broken by postfix operator removal
Next
From: Justin Pryzby
Date:
Subject: Re: Probable documentation errors or improvements