Re: Speeding up ruleutils' name de-duplication code, redux - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Speeding up ruleutils' name de-duplication code, redux
Date
Msg-id 3472644.1725980806@sss.pgh.pa.us
Whole thread Raw
In response to Speeding up ruleutils' name de-duplication code, redux  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Speeding up ruleutils' name de-duplication code, redux
List pgsql-hackers
David Rowley <dgrowleyml@gmail.com> writes:
> On Tue, 30 Jul 2024 at 10:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> On my development machine, it takes over 14 minutes to pg_upgrade
>> this, and it turns out that that time is largely spent in column
>> name de-duplication while deparsing the CHECK constraints.  The
>> attached patch reduces that to about 3m45s.

> I looked at the patch and tried it out.

Thanks for looking!

> This gives me what I'd expect to see. I wanted to ensure the point
> where you're switching to the hashing method was about the right
> place. It seems to be, at least for my test.

Yeah, I was just going by gut feel there.  It's good to have some
numbers showing it's not a totally silly choice.

> Perhaps you don't think it's worth the additional complexity, but I
> see that in both locations you're calling build_colinfo_names_hash(),
> it's done just after a call to expand_colnames_array_to(). I wondered
> if it was worthwhile unifying both of those functions maybe with a new
> name so that you don't need to loop over the always NULL element of
> the colnames[] array when building the hash table. This is likely
> quite a small overhead compared to the quadratic search you've
> removed, so it might not move the needle any. I just wanted to point
> it out as I've little else I can find to comment on.

Hmm, but there are quite a few expand_colnames_array_to calls that
are not associated with build_colinfo_names_hash.  On the whole it
feels like those are separate concerns that are better kept separate.

We could accomplish what you suggest by re-ordering the calls so that
we build the hash table before enlarging the array.  0001 attached
is the same as before (modulo line number changes from being rebased
up to HEAD) and then 0002 implements this idea on top.  On the whole
though I find 0002 fairly ugly and would prefer to stick to 0001.
I really doubt that scanning any newly-created column positions is
going to take long enough to justify intertwining things like this.

            regards, tom lane

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ee1b7f3dc9..badbf111ee 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -224,6 +224,10 @@ typedef struct
  * of aliases to columns of the right input.  Thus, positions in the printable
  * column alias list are not necessarily one-for-one with varattnos of the
  * JOIN, so we need a separate new_colnames[] array for printing purposes.
+ *
+ * Finally, when dealing with wide tables we risk O(N^2) costs in assigning
+ * non-duplicate column names.  We ameliorate that by using a hash table that
+ * holds all the strings appearing in colnames, new_colnames, and parentUsing.
  */
 typedef struct
 {
@@ -291,6 +295,15 @@ typedef struct
     int           *leftattnos;        /* left-child varattnos of join cols, or 0 */
     int           *rightattnos;    /* right-child varattnos of join cols, or 0 */
     List       *usingNames;        /* names assigned to merged columns */
+
+    /*
+     * Hash table holding copies of all the strings appearing in this struct's
+     * colnames, new_colnames, and parentUsing.  We use a hash table only for
+     * sufficiently wide relations, and only during the colname-assignment
+     * functions set_relation_column_names and set_join_column_names;
+     * otherwise, names_hash is NULL.
+     */
+    HTAB       *names_hash;        /* entries are just strings */
 } deparse_columns;

 /* This macro is analogous to rt_fetch(), but for deparse_columns structs */
@@ -376,6 +389,9 @@ static bool colname_is_unique(const char *colname, deparse_namespace *dpns,
 static char *make_colname_unique(char *colname, deparse_namespace *dpns,
                                  deparse_columns *colinfo);
 static void expand_colnames_array_to(deparse_columns *colinfo, int n);
+static void build_colinfo_names_hash(deparse_columns *colinfo);
+static void add_to_names_hash(deparse_columns *colinfo, const char *name);
+static void destroy_colinfo_names_hash(deparse_columns *colinfo);
 static void identify_join_columns(JoinExpr *j, RangeTblEntry *jrte,
                                   deparse_columns *colinfo);
 static char *get_rtable_name(int rtindex, deparse_context *context);
@@ -4133,6 +4149,10 @@ has_dangerous_join_using(deparse_namespace *dpns, Node *jtnode)
  *
  * parentUsing is a list of all USING aliases assigned in parent joins of
  * the current jointree node.  (The passed-in list must not be modified.)
+ *
+ * Note that we do not use per-deparse_columns hash tables in this function.
+ * The number of names that need to be assigned should be small enough that
+ * we don't need to trouble with that.
  */
 static void
 set_using_names(deparse_namespace *dpns, Node *jtnode, List *parentUsing)
@@ -4408,6 +4428,9 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     colinfo->new_colnames = (char **) palloc(ncolumns * sizeof(char *));
     colinfo->is_new_col = (bool *) palloc(ncolumns * sizeof(bool));

+    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
+    build_colinfo_names_hash(colinfo);
+
     /*
      * Scan the columns, select a unique alias for each one, and store it in
      * colinfo->colnames and colinfo->new_colnames.  The former array has NULL
@@ -4443,6 +4466,7 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             colname = make_colname_unique(colname, dpns, colinfo);

             colinfo->colnames[i] = colname;
+            add_to_names_hash(colinfo, colname);
         }

         /* Put names of non-dropped columns in new_colnames[] too */
@@ -4456,6 +4480,9 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             changed_any = true;
     }

+    /* We're now done needing the colinfo's names_hash */
+    destroy_colinfo_names_hash(colinfo);
+
     /*
      * Set correct length for new_colnames[] array.  (Note: if columns have
      * been added, colinfo->num_cols includes them, which is not really quite
@@ -4526,6 +4553,9 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     expand_colnames_array_to(colinfo, noldcolumns);
     Assert(colinfo->num_cols == noldcolumns);

+    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
+    build_colinfo_names_hash(colinfo);
+
     /*
      * Scan the join output columns, select an alias for each one, and store
      * it in colinfo->colnames.  If there are USING columns, set_using_names()
@@ -4563,6 +4593,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
         if (rte->alias == NULL)
         {
             colinfo->colnames[i] = real_colname;
+            add_to_names_hash(colinfo, real_colname);
             continue;
         }

@@ -4579,6 +4610,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             colname = make_colname_unique(colname, dpns, colinfo);

             colinfo->colnames[i] = colname;
+            add_to_names_hash(colinfo, colname);
         }

         /* Remember if any assigned aliases differ from "real" name */
@@ -4677,6 +4709,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             }
             else
                 colinfo->new_colnames[j] = child_colname;
+            add_to_names_hash(colinfo, colinfo->new_colnames[j]);
         }

         colinfo->is_new_col[j] = leftcolinfo->is_new_col[jc];
@@ -4726,6 +4759,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
             }
             else
                 colinfo->new_colnames[j] = child_colname;
+            add_to_names_hash(colinfo, colinfo->new_colnames[j]);
         }

         colinfo->is_new_col[j] = rightcolinfo->is_new_col[jc];
@@ -4740,6 +4774,9 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     Assert(j == nnewcolumns);
 #endif

+    /* We're now done needing the colinfo's names_hash */
+    destroy_colinfo_names_hash(colinfo);
+
     /*
      * For a named join, print column aliases if we changed any from the child
      * names.  Unnamed joins cannot print aliases.
@@ -4762,38 +4799,59 @@ colname_is_unique(const char *colname, deparse_namespace *dpns,
     int            i;
     ListCell   *lc;

-    /* Check against already-assigned column aliases within RTE */
-    for (i = 0; i < colinfo->num_cols; i++)
-    {
-        char       *oldname = colinfo->colnames[i];
-
-        if (oldname && strcmp(oldname, colname) == 0)
-            return false;
-    }
-
     /*
-     * If we're building a new_colnames array, check that too (this will be
-     * partially but not completely redundant with the previous checks)
+     * If we have a hash table, consult that instead of linearly scanning the
+     * colinfo's strings.
      */
-    for (i = 0; i < colinfo->num_new_cols; i++)
+    if (colinfo->names_hash)
     {
-        char       *oldname = colinfo->new_colnames[i];
-
-        if (oldname && strcmp(oldname, colname) == 0)
+        if (hash_search(colinfo->names_hash,
+                        colname,
+                        HASH_FIND,
+                        NULL) != NULL)
             return false;
     }
-
-    /* Also check against USING-column names that must be globally unique */
-    foreach(lc, dpns->using_names)
+    else
     {
-        char       *oldname = (char *) lfirst(lc);
+        /* Check against already-assigned column aliases within RTE */
+        for (i = 0; i < colinfo->num_cols; i++)
+        {
+            char       *oldname = colinfo->colnames[i];

-        if (strcmp(oldname, colname) == 0)
-            return false;
+            if (oldname && strcmp(oldname, colname) == 0)
+                return false;
+        }
+
+        /*
+         * If we're building a new_colnames array, check that too (this will
+         * be partially but not completely redundant with the previous checks)
+         */
+        for (i = 0; i < colinfo->num_new_cols; i++)
+        {
+            char       *oldname = colinfo->new_colnames[i];
+
+            if (oldname && strcmp(oldname, colname) == 0)
+                return false;
+        }
+
+        /*
+         * Also check against names already assigned for parent-join USING
+         * cols
+         */
+        foreach(lc, colinfo->parentUsing)
+        {
+            char       *oldname = (char *) lfirst(lc);
+
+            if (strcmp(oldname, colname) == 0)
+                return false;
+        }
     }

-    /* Also check against names already assigned for parent-join USING cols */
-    foreach(lc, colinfo->parentUsing)
+    /*
+     * Also check against USING-column names that must be globally unique.
+     * These are not hashed, but there should be few of them.
+     */
+    foreach(lc, dpns->using_names)
     {
         char       *oldname = (char *) lfirst(lc);

@@ -4861,6 +4919,90 @@ expand_colnames_array_to(deparse_columns *colinfo, int n)
     }
 }

+/*
+ * build_colinfo_names_hash: optionally construct a hash table for colinfo
+ */
+static void
+build_colinfo_names_hash(deparse_columns *colinfo)
+{
+    HASHCTL        hash_ctl;
+    int            i;
+    ListCell   *lc;
+
+    /*
+     * Use a hash table only for RTEs with at least 32 columns.  (The cutoff
+     * is somewhat arbitrary, but let's choose it so that this code does get
+     * exercised in the regression tests.)
+     */
+    if (colinfo->num_cols < 32)
+        return;
+
+    /*
+     * Set up the hash table.  The entries are just strings with no other
+     * payload.
+     */
+    hash_ctl.keysize = NAMEDATALEN;
+    hash_ctl.entrysize = NAMEDATALEN;
+    hash_ctl.hcxt = CurrentMemoryContext;
+    colinfo->names_hash = hash_create("deparse_columns names",
+                                      colinfo->num_cols + colinfo->num_new_cols,
+                                      &hash_ctl,
+                                      HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
+    /*
+     * Preload the hash table with any names already present (these would have
+     * come from set_using_names).
+     */
+    for (i = 0; i < colinfo->num_cols; i++)
+    {
+        char       *oldname = colinfo->colnames[i];
+
+        if (oldname)
+            add_to_names_hash(colinfo, oldname);
+    }
+
+    for (i = 0; i < colinfo->num_new_cols; i++)
+    {
+        char       *oldname = colinfo->new_colnames[i];
+
+        if (oldname)
+            add_to_names_hash(colinfo, oldname);
+    }
+
+    foreach(lc, colinfo->parentUsing)
+    {
+        char       *oldname = (char *) lfirst(lc);
+
+        add_to_names_hash(colinfo, oldname);
+    }
+}
+
+/*
+ * add_to_names_hash: add a string to the names_hash, if we're using one
+ */
+static void
+add_to_names_hash(deparse_columns *colinfo, const char *name)
+{
+    if (colinfo->names_hash)
+        (void) hash_search(colinfo->names_hash,
+                           name,
+                           HASH_ENTER,
+                           NULL);
+}
+
+/*
+ * destroy_colinfo_names_hash: destroy hash table when done with it
+ */
+static void
+destroy_colinfo_names_hash(deparse_columns *colinfo)
+{
+    if (colinfo->names_hash)
+    {
+        hash_destroy(colinfo->names_hash);
+        colinfo->names_hash = NULL;
+    }
+}
+
 /*
  * identify_join_columns: figure out where columns of a join come from
  *
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index badbf111ee..63b09805e0 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4407,6 +4407,10 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
         }
     }

+    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
+    if (ncolumns >= 32)
+        build_colinfo_names_hash(colinfo);
+
     /*
      * Ensure colinfo->colnames has a slot for each column.  (It could be long
      * enough already, if we pushed down a name for the last column.)  Note:
@@ -4428,9 +4432,6 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     colinfo->new_colnames = (char **) palloc(ncolumns * sizeof(char *));
     colinfo->is_new_col = (bool *) palloc(ncolumns * sizeof(bool));

-    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
-    build_colinfo_names_hash(colinfo);
-
     /*
      * Scan the columns, select a unique alias for each one, and store it in
      * colinfo->colnames and colinfo->new_colnames.  The former array has NULL
@@ -4542,6 +4543,11 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
     leftcolinfo = deparse_columns_fetch(colinfo->leftrti, dpns);
     rightcolinfo = deparse_columns_fetch(colinfo->rightrti, dpns);

+    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
+    noldcolumns = list_length(rte->eref->colnames);
+    if (noldcolumns >= 32)
+        build_colinfo_names_hash(colinfo);
+
     /*
      * Ensure colinfo->colnames has a slot for each column.  (It could be long
      * enough already, if we pushed down a name for the last column.)  Note:
@@ -4549,13 +4555,9 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
      * were when the query was parsed, but we'll deal with that below.  We
      * only need entries in colnames for pre-existing columns.
      */
-    noldcolumns = list_length(rte->eref->colnames);
     expand_colnames_array_to(colinfo, noldcolumns);
     Assert(colinfo->num_cols == noldcolumns);

-    /* If the RTE is wide enough, use a hash table to avoid O(N^2) costs */
-    build_colinfo_names_hash(colinfo);
-
     /*
      * Scan the join output columns, select an alias for each one, and store
      * it in colinfo->colnames.  If there are USING columns, set_using_names()
@@ -4920,7 +4922,10 @@ expand_colnames_array_to(deparse_columns *colinfo, int n)
 }

 /*
- * build_colinfo_names_hash: optionally construct a hash table for colinfo
+ * build_colinfo_names_hash: construct a hash table for colinfo
+ *
+ * We use this only for sufficiently wide RTEs: currently, those with at
+ * least 32 columns.  That check is made at the callers though.
  */
 static void
 build_colinfo_names_hash(deparse_columns *colinfo)
@@ -4929,14 +4934,6 @@ build_colinfo_names_hash(deparse_columns *colinfo)
     int            i;
     ListCell   *lc;

-    /*
-     * Use a hash table only for RTEs with at least 32 columns.  (The cutoff
-     * is somewhat arbitrary, but let's choose it so that this code does get
-     * exercised in the regression tests.)
-     */
-    if (colinfo->num_cols < 32)
-        return;
-
     /*
      * Set up the hash table.  The entries are just strings with no other
      * payload.

pgsql-hackers by date:

Previous
From: Greg Sabino Mullane
Date:
Subject: Re: Jargon and acronyms on this mailing list
Next
From: Nathan Bossart
Date:
Subject: Re: Proposal to Enable/Disable Index using ALTER INDEX