Speeding up ruleutils' name de-duplication code, redux - Mailing list pgsql-hackers

From Tom Lane
Subject Speeding up ruleutils' name de-duplication code, redux
Date
Msg-id 2885468.1722291250@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
When deparsing queries or expressions, ruleutils.c has to generate
unique names for RTEs and columns of RTEs.  (Often, they're unique
already, but this isn't reliably true.)  The original logic for that
involved just strcmp'ing a proposed name against all the ones already
assigned, which obviously is O(N^2) in the number of names being
considered.  Back in commit 8004953b5, we fixed that problem for
generation of unique RTE names, by using a hash table to remember the
already-assigned names.  However, that commit did not touch the logic
for de-duplicating the column names within each RTE, explaining
    
    In principle the same problem applies to the column-name-de-duplication
    code; but in practice that seems to be less of a problem, first because
    N is limited since we don't support extremely wide tables, and second
    because duplicate column names within an RTE are fairly rare, so that in
    practice the cost is more like O(N^2) not O(N^3).  It would be very much
    messier to fix the column-name code, so for now I've left that alone.

But I think the time has come to do something about it.  In [1]
I presented this Perl script to generate a database that gives
pg_upgrade indigestion:

-----
for (my $i = 0; $i < 100; $i++)
{
    print "CREATE TABLE test_inh_check$i (\n";
    for (my $j = 0; $j < 1000; $j++)
    {
        print "a$j float check (a$j > 10.2),\n";
    }
    print "b float);\n";
    print "CREATE TABLE test_inh_check_child$i() INHERITS(test_inh_check$i);\n";
}
-----

On my development machine, it takes over 14 minutes to pg_upgrade
this, and it turns out that that time is largely spent in column
name de-duplication while deparsing the CHECK constraints.  The
attached patch reduces that to about 3m45s.

(I think that we ought to reconsider MergeConstraintsIntoExisting's
use of deparsing to compare check constraints: it'd be faster and
probably more reliable to apply attnum translation to one parsetree
and then use equal().  But that's a matter for a different patch, and
this patch would still be useful for the pg_dump side of the problem.)

I was able to avoid a lot of the complexity I'd feared before by not
attempting to use hashing during set_using_names(), which only has to
consider columns merged by USING clauses, so it shouldn't have enough
of a performance problem to be worth touching.  The hashing code needs
to be optional anyway because it's unlikely to be a win for narrow
tables, so we can simply ignore it until we reach the potentially
expensive steps.  Also, things are already factored in such a way that
we only need to have one hashtable at a time, so this shouldn't cause
any large memory bloat.

I'll park this in the next CF.

            regards, tom lane

[1] https://www.postgresql.org/message-id/2422717.1722201869%40sss.pgh.pa.us

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 653685bffc..9eb4b858ff 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -223,6 +223,10 @@ typedef struct
  * of aliases to columns of the right input.  Thus, positions in the printable
  * column alias list are not necessarily one-for-one with varattnos of the
  * JOIN, so we need a separate new_colnames[] array for printing purposes.
+ *
+ * Finally, when dealing with wide tables we risk O(N^2) costs in assigning
+ * non-duplicate column names.  We ameliorate that by using a hash table that
+ * holds all the strings appearing in colnames, new_colnames, and parentUsing.
  */
 typedef struct
 {
@@ -290,6 +294,15 @@ typedef struct
     int           *leftattnos;        /* left-child varattnos of join cols, or 0 */
     int           *rightattnos;    /* right-child varattnos of join cols, or 0 */
     List       *usingNames;        /* names assigned to merged columns */
+
+    /*
+     * Hash table holding copies of all the strings appearing in this struct's
+     * colnames, new_colnames, and parentUsing.  We use a hash table only for
+     * sufficiently wide relations, and only during the colname-assignment
+     * functions set_relation_column_names and set_join_column_names;
+     * otherwise, names_hash is NULL.
+     */
+    HTAB       *names_hash;        /* entries are just strings */
 } deparse_columns;

 /* This macro is analogous to rt_fetch(), but for deparse_columns structs */
@@ -375,6 +388,9 @@ static bool colname_is_unique(const char *colname, deparse_namespace *dpns,
 static char *make_colname_unique(char *colname, deparse_namespace *dpns,
                                  deparse_columns *colinfo);
 static void expand_colnames_array_to(deparse_columns *colinfo, int n);
+static void build_colinfo_names_hash(deparse_columns *colinfo);
+static void add_to_names_hash(deparse_columns *colinfo, const char *name);
+static void destroy_colinfo_names_hash(deparse_columns *colinfo);
 static void identify_join_columns(JoinExpr *j, RangeTblEntry *jrte,
                                   deparse_columns *colinfo);
 static char *get_rtable_name(int rtindex, deparse_context *context);
@@ -4136,6 +4152,10 @@ has_dangerous_join_using(deparse_namespace *dpns, Node *jtnode)
  *
  * parentUsing is a list of all USING aliases assigned in parent joins of
  * the current jointree node.  (The passed-in list must not be modified.)
+ *
+ * Note that we do not use per-deparse_columns hash tables in this function.
+ * The number of names that need to be assigned should be small enough that
+ * we don't need to trouble with that.
  */
 static void
 set_using_names(deparse_namespace *dpns, Node *jtnode, List *parentUsing)
@@ -4411,6 +4431,9 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     colinfo->new_colnames = (char **) palloc(ncolumns * sizeof(char *));
     colinfo->is_new_col = (bool *) palloc(ncolumns * sizeof(bool));

+    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
+    build_colinfo_names_hash(colinfo);
+
     /*
      * Scan the columns, select a unique alias for each one, and store it in
      * colinfo->colnames and colinfo->new_colnames.  The former array has NULL
@@ -4446,6 +4469,7 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             colname = make_colname_unique(colname, dpns, colinfo);

             colinfo->colnames[i] = colname;
+            add_to_names_hash(colinfo, colname);
         }

         /* Put names of non-dropped columns in new_colnames[] too */
@@ -4459,6 +4483,9 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             changed_any = true;
     }

+    /* We're now done needing the colinfo's names_hash */
+    destroy_colinfo_names_hash(colinfo);
+
     /*
      * Set correct length for new_colnames[] array.  (Note: if columns have
      * been added, colinfo->num_cols includes them, which is not really quite
@@ -4529,6 +4556,9 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     expand_colnames_array_to(colinfo, noldcolumns);
     Assert(colinfo->num_cols == noldcolumns);

+    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
+    build_colinfo_names_hash(colinfo);
+
     /*
      * Scan the join output columns, select an alias for each one, and store
      * it in colinfo->colnames.  If there are USING columns, set_using_names()
@@ -4566,6 +4596,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
         if (rte->alias == NULL)
         {
             colinfo->colnames[i] = real_colname;
+            add_to_names_hash(colinfo, real_colname);
             continue;
         }

@@ -4582,6 +4613,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             colname = make_colname_unique(colname, dpns, colinfo);

             colinfo->colnames[i] = colname;
+            add_to_names_hash(colinfo, colname);
         }

         /* Remember if any assigned aliases differ from "real" name */
@@ -4680,6 +4712,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             }
             else
                 colinfo->new_colnames[j] = child_colname;
+            add_to_names_hash(colinfo, colinfo->new_colnames[j]);
         }

         colinfo->is_new_col[j] = leftcolinfo->is_new_col[jc];
@@ -4729,6 +4762,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             }
             else
                 colinfo->new_colnames[j] = child_colname;
+            add_to_names_hash(colinfo, colinfo->new_colnames[j]);
         }

         colinfo->is_new_col[j] = rightcolinfo->is_new_col[jc];
@@ -4743,6 +4777,9 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     Assert(j == nnewcolumns);
 #endif

+    /* We're now done needing the colinfo's names_hash */
+    destroy_colinfo_names_hash(colinfo);
+
     /*
      * For a named join, print column aliases if we changed any from the child
      * names.  Unnamed joins cannot print aliases.
@@ -4765,38 +4802,59 @@ colname_is_unique(const char *colname, deparse_namespace *dpns,
     int            i;
     ListCell   *lc;

-    /* Check against already-assigned column aliases within RTE */
-    for (i = 0; i < colinfo->num_cols; i++)
-    {
-        char       *oldname = colinfo->colnames[i];
-
-        if (oldname && strcmp(oldname, colname) == 0)
-            return false;
-    }
-
     /*
-     * If we're building a new_colnames array, check that too (this will be
-     * partially but not completely redundant with the previous checks)
+     * If we have a hash table, consult that instead of linearly scanning the
+     * colinfo's strings.
      */
-    for (i = 0; i < colinfo->num_new_cols; i++)
+    if (colinfo->names_hash)
     {
-        char       *oldname = colinfo->new_colnames[i];
-
-        if (oldname && strcmp(oldname, colname) == 0)
+        if (hash_search(colinfo->names_hash,
+                        colname,
+                        HASH_FIND,
+                        NULL) != NULL)
             return false;
     }
-
-    /* Also check against USING-column names that must be globally unique */
-    foreach(lc, dpns->using_names)
+    else
     {
-        char       *oldname = (char *) lfirst(lc);
+        /* Check against already-assigned column aliases within RTE */
+        for (i = 0; i < colinfo->num_cols; i++)
+        {
+            char       *oldname = colinfo->colnames[i];

-        if (strcmp(oldname, colname) == 0)
-            return false;
+            if (oldname && strcmp(oldname, colname) == 0)
+                return false;
+        }
+
+        /*
+         * If we're building a new_colnames array, check that too (this will
+         * be partially but not completely redundant with the previous checks)
+         */
+        for (i = 0; i < colinfo->num_new_cols; i++)
+        {
+            char       *oldname = colinfo->new_colnames[i];
+
+            if (oldname && strcmp(oldname, colname) == 0)
+                return false;
+        }
+
+        /*
+         * Also check against names already assigned for parent-join USING
+         * cols
+         */
+        foreach(lc, colinfo->parentUsing)
+        {
+            char       *oldname = (char *) lfirst(lc);
+
+            if (strcmp(oldname, colname) == 0)
+                return false;
+        }
     }

-    /* Also check against names already assigned for parent-join USING cols */
-    foreach(lc, colinfo->parentUsing)
+    /*
+     * Also check against USING-column names that must be globally unique.
+     * These are not hashed, but there should be few of them.
+     */
+    foreach(lc, dpns->using_names)
     {
         char       *oldname = (char *) lfirst(lc);

@@ -4864,6 +4922,90 @@ expand_colnames_array_to(deparse_columns *colinfo, int n)
     }
 }

+/*
+ * build_colinfo_names_hash: optionally construct a hash table for colinfo
+ */
+static void
+build_colinfo_names_hash(deparse_columns *colinfo)
+{
+    HASHCTL        hash_ctl;
+    int            i;
+    ListCell   *lc;
+
+    /*
+     * Use a hash table only for RTEs with at least 32 columns.  (The cutoff
+     * is somewhat arbitrary, but let's choose it so that this code does get
+     * exercised in the regression tests.)
+     */
+    if (colinfo->num_cols < 32)
+        return;
+
+    /*
+     * Set up the hash table.  The entries are just strings with no other
+     * payload.
+     */
+    hash_ctl.keysize = NAMEDATALEN;
+    hash_ctl.entrysize = NAMEDATALEN;
+    hash_ctl.hcxt = CurrentMemoryContext;
+    colinfo->names_hash = hash_create("deparse_columns names",
+                                      colinfo->num_cols + colinfo->num_new_cols,
+                                      &hash_ctl,
+                                      HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
+    /*
+     * Preload the hash table with any names already present (these would have
+     * come from set_using_names).
+     */
+    for (i = 0; i < colinfo->num_cols; i++)
+    {
+        char       *oldname = colinfo->colnames[i];
+
+        if (oldname)
+            add_to_names_hash(colinfo, oldname);
+    }
+
+    for (i = 0; i < colinfo->num_new_cols; i++)
+    {
+        char       *oldname = colinfo->new_colnames[i];
+
+        if (oldname)
+            add_to_names_hash(colinfo, oldname);
+    }
+
+    foreach(lc, colinfo->parentUsing)
+    {
+        char       *oldname = (char *) lfirst(lc);
+
+        add_to_names_hash(colinfo, oldname);
+    }
+}
+
+/*
+ * add_to_names_hash: add a string to the names_hash, if we're using one
+ */
+static void
+add_to_names_hash(deparse_columns *colinfo, const char *name)
+{
+    if (colinfo->names_hash)
+        (void) hash_search(colinfo->names_hash,
+                           name,
+                           HASH_ENTER,
+                           NULL);
+}
+
+/*
+ * destroy_colinfo_names_hash: destroy hash table when done with it
+ */
+static void
+destroy_colinfo_names_hash(deparse_columns *colinfo)
+{
+    if (colinfo->names_hash)
+    {
+        hash_destroy(colinfo->names_hash);
+        colinfo->names_hash = NULL;
+    }
+}
+
 /*
  * identify_join_columns: figure out where columns of a join come from
  *

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Optimize mul_var() for var1ndigits >= 8
Next
From: Daniel Gustafsson
Date:
Subject: Re: ecdh support causes unnecessary roundtrips