Re: Reducing connection overhead in pg_upgrade compat check phase - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: Reducing connection overhead in pg_upgrade compat check phase
Date
Msg-id 20230218204221.GI1653@telsasoft.com
Whole thread Raw
In response to Reducing connection overhead in pg_upgrade compat check phase  (Daniel Gustafsson <daniel@yesql.se>)
List pgsql-hackers
On Fri, Feb 17, 2023 at 10:44:49PM +0100, Daniel Gustafsson wrote:
> When adding a check to pg_upgrade a while back I noticed in a profile that the
> cluster compatibility check phase spend a lot of time in connectToServer.  Some
> of this can be attributed to data type checks which each run serially in turn
> connecting to each database to run the check, and this seemed like a place
> where we can do better.

 src/bin/pg_upgrade/check.c      | 371 +++++++++++++++---------------
 src/bin/pg_upgrade/pg_upgrade.h |  28 ++-
 src/bin/pg_upgrade/version.c    | 394 ++++++++++++++------------------
 3 files changed, 373 insertions(+), 420 deletions(-)

And saves 50 LOC.

The stated goal of the patch is to reduce overhead.  But it only updates
a couple functions, and there are (I think) nine functions which loop
around all DBs.  If you want to reduce the overhead, I assumed you'd
cache the DB connection for all tests ... but then I tried it, and first
ran into max_connections, and then ran into EMFILE.  Which is probably
enough to kill my idea.

But maybe the existing patch could be phrased in terms of moving all the
per-db checks from functions to data structures (which has its own
merits).  Then, there could be a single loop around DBs which executes
all the functions.  The test runner can also test the major version and
handle the textfile output.

However (as Nathan mentioned) what's currently done shows *all* the
problems of a given type - if there were 9 DBs with 99 relations with
OIDs, it'd show all of them at once.  It'd be a big step backwards to
only show problems for the first problematic DB.

But maybe that's an another opportunity to do better.  Right now, if I
run pg_upgrade, it'll show all the failing objects, but only for first
check that fails.  After fixing them, it might tell me about a 2nd
failing check.  I've never run into multiple types of failing checks,
but I do know that needing to re-run pg-upgrade is annoying (see
3c0471b5f).

You talked about improving the two data types tests, which aren't
conditional on a maximum server version.  The minimal improvement you'll
get is when only those two checks are run (like on a developer upgrade
v16=>v16).  But when more checks are run during a production upgrade
like v13=>v16, you'd see a larger gain.

I fooled around with that idea in the attached patch.  I have no
particular interest in optimizing --check for large numbers of DBs, so
I'm not planning to pursue it further, but maybe it'll be useful to you.

About your original patch:

+static DataTypesUsageChecks data_types_usage_checks[] = {
+    /*
+     * Look for composite types that were made during initdb *or* belong to
+     * information_schema; that's important in case information_schema was
+     * dropped and reloaded.
+     *
+     * The cutoff OID here should match the source cluster's value of
+     * FirstNormalObjectId.  We hardcode it rather than using that C #define
+     * because, if that #define is ever changed, our own version's value is
+     * NOT what to use.  Eventually we may need a test on the source cluster's
+     * version to select the correct value.
+     */
+    {"Checking for system-defined composite types in user tables",
+     "tables_using_composite.txt",

I think this might e cleaner using "named initializer" struct
initialization, rather than a comma-separated list (whatever that's
called).

Maybe instead of putting all checks into an array of
DataTypesUsageChecks, they should be defined in separate arrays, and
then an array defined with the list of checks?

+             * If the check failed, terminate the umbrella status and print
+             * the specific status line of the check to indicate which it was
+             * before terminating with the detailed error message.
+             */
+            if (found)
+            {
+                PQfinish(conn);
 
-    base_query = psprintf("SELECT '%s'::pg_catalog.regtype AS oid",
-                          type_name);
+                report_status(PG_REPORT, "failed");
+                prep_status("%s", cur_check->status);
+                pg_log(PG_REPORT, "fatal");
+                pg_fatal("%s    %s", cur_check->fatal_check, output_path);
+            }

I think this loses the message localization/translation that currently
exists.  It could be written like prep_status(cur_check->status) or
prep_status("%s", _(cur_check->status)).  And _(cur_check->fatal_check). 

-- 
Justin

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Handle TEMP_CONFIG for pg_regress style tests in pg_regress.c
Next
From: Andres Freund
Date:
Subject: Re: occasional valgrind reports for handle_sig_alarm on 32-bit ARM