Thread: Can we get rid of repeated queries from pg_dump?
Hi, I have following case: local pg_dump (v15) connecting to remote PostgreSQL (v12). I'm trying to get just schema (pg_dump -s). It's taking very long, which is kinda OK given that there is long distance and latency, but I got curious and checked queries that the pg_dump was running (select * from pg_stat_activity where application_name = 'pg_dump'). And I noticed that many of these queries repeat many times. The ones that I noticed were: SELECT pg_catalog.format_type('2281'::pg_catalog.oid, NULL) around the time that SELECT proretset, prosrc, probin, provolatile, proisstrict, prosecdef, lanname, proconfig, procost, prorows, pg_catalog.pg_get_function_arguments(p.oid) AS funcargs, pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs, pg_catalog.pg_get_function_result(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes, ' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE p.oid = '60188'::pg_catalog.oid AND l.oid = p.prolang was called too. It seems that for every function, pg_dump is getting it's data, and then runs format_type on each parameter/output type? I'm mostly guessing here, as I didn't read the code. Wouldn't it be possible to get all type formats at once, and cache them in pg_dump? Or at the very least reuse already received information? Unfortunately it seems I can't run pg_dump closer to the db server, and the latency of queries is killing me. It's been 15 minutes, and pg_dump (called: pg_dump -v -s -f schema.dump, with env variables configuring db connection) hasn't written even single byte to schema.dump) depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > It seems that for every function, pg_dump is getting it's data, and then > runs format_type on each parameter/output type? I'm mostly guessing > here, as I didn't read the code. > Wouldn't it be possible to get all type formats at once, and cache them > in pg_dump? Or at the very least reuse already received information? Send a patch ... regards, tom lane
On Thu, Aug 26, 2021 at 10:02:07AM -0400, Tom Lane wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > It seems that for every function, pg_dump is getting it's data, and then > > runs format_type on each parameter/output type? I'm mostly guessing > > here, as I didn't read the code. > > Wouldn't it be possible to get all type formats at once, and cache them > > in pg_dump? Or at the very least reuse already received information? > Send a patch ... Yeah, that's not going to work, my C skills are next-to-none :( I guess I'll have to wait till someone else will assume it's a problem, someone with skills to do something about it. Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Thu, Aug 26, 2021 at 10:02:07AM -0400, Tom Lane wrote: >> hubert depesz lubaczewski <depesz@depesz.com> writes: >>> Wouldn't it be possible to get all type formats at once, and cache them >>> in pg_dump? Or at the very least reuse already received information? >> Send a patch ... > Yeah, that's not going to work, my C skills are next-to-none :( > I guess I'll have to wait till someone else will assume it's a problem, > someone with skills to do something about it. Well, you could move it forward by doing the legwork to identify which queries are worth merging. Is it really sane to do a global "select format_type() from pg_type" query and save all the results on the client side? I wonder whether there are cases where that'd be a net loss. You could do the experimentation to figure that out without necessarily having the C skills to make pg_dump actually do it. regards, tom lane
On Thu, Aug 26, 2021 at 10:20:29AM -0400, Tom Lane wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > On Thu, Aug 26, 2021 at 10:02:07AM -0400, Tom Lane wrote: > >> hubert depesz lubaczewski <depesz@depesz.com> writes: > >>> Wouldn't it be possible to get all type formats at once, and cache them > >>> in pg_dump? Or at the very least reuse already received information? > > >> Send a patch ... > > > Yeah, that's not going to work, my C skills are next-to-none :( > > I guess I'll have to wait till someone else will assume it's a problem, > > someone with skills to do something about it. > > Well, you could move it forward by doing the legwork to identify which > queries are worth merging. Is it really sane to do a global "select Sure. On it. Will report back when I'll have more info. Best regards, depesz
On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote: > Hi, > I have following case: local pg_dump (v15) connecting to remote > PostgreSQL (v12). So you are using a dev version of pg_dump or is that a typo? > > It's been 15 minutes, and pg_dump (called: pg_dump -v -s -f schema.dump, > with env variables configuring db connection) hasn't written even single > byte to schema.dump) What happens if you run without the -v? > > depesz > > -- Adrian Klaver adrian.klaver@aklaver.com
On Thu, Aug 26, 2021 at 07:34:26AM -0700, Adrian Klaver wrote: > On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote: > > Hi, > > I have following case: local pg_dump (v15) connecting to remote > > PostgreSQL (v12). > So you are using a dev version of pg_dump or is that a typo? Yes. I'm running pg_dump from my computer to (very) remote db server. > > It's been 15 minutes, and pg_dump (called: pg_dump -v -s -f schema.dump, > > with env variables configuring db connection) hasn't written even single > > byte to schema.dump) > What happens if you run without the -v? Well, I guess it works, but with no output I can't judge how fast. Definitely doesn't seem to be going any faster. Best regards, depesz
On 8/26/21 7:35 AM, hubert depesz lubaczewski wrote: > On Thu, Aug 26, 2021 at 07:34:26AM -0700, Adrian Klaver wrote: >> On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote: >>> Hi, >>> I have following case: local pg_dump (v15) connecting to remote >>> PostgreSQL (v12). >> So you are using a dev version of pg_dump or is that a typo? > > Yes. I'm running pg_dump from my computer to (very) remote db server. SSHing and dumping on the remote is out as a short term solution? > Well, I guess it works, but with no output I can't judge how fast. > Definitely doesn't seem to be going any faster. Unknown slow, that didn't help. > > Best regards, > > depesz > -- Adrian Klaver adrian.klaver@aklaver.com
On Thu, Aug 26, 2021 at 07:46:46AM -0700, Adrian Klaver wrote: > On 8/26/21 7:35 AM, hubert depesz lubaczewski wrote: > > On Thu, Aug 26, 2021 at 07:34:26AM -0700, Adrian Klaver wrote: > > > On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote: > > > > Hi, > > > > I have following case: local pg_dump (v15) connecting to remote > > > > PostgreSQL (v12). > > > So you are using a dev version of pg_dump or is that a typo? > > > > Yes. I'm running pg_dump from my computer to (very) remote db server. > SSHing and dumping on the remote is out as a short term solution? As I mentioned in original post - I can't run pg_dump closer to server. SSH is not available, at least for me. Anyway - I got the dump, so I am good for now, but I think that this could be improved, so I'll work on getting some stats on queries. Best regards, depesz
On Thu, Aug 26, 2021 at 10:20:29AM -0400, Tom Lane wrote: > Well, you could move it forward by doing the legwork to identify which > queries are worth merging. Is it really sane to do a global "select > format_type() from pg_type" query and save all the results on the client > side? I wonder whether there are cases where that'd be a net loss. > You could do the experimentation to figure that out without necessarily > having the C skills to make pg_dump actually do it. So, I got some info. First, some stats. The DB contains: - 14 extensions - 1 aggregate - 107 functions - 5 schemas - 5 sequences - 188 logged tables - 1 unlogged table - 206 "normal" indexes - 30 unique indexes - 15 materialized views - 16 triggers - 87 types - 26 views pg_dump -s of it is ~ 670kB. Interestingly, while dumping (pg_dump -s -v), we can see progress going on, and then, after: ==== ... pg_dump: reading publications pg_dump: reading publication membership pg_dump: reading subscriptions pg_dump: reading dependency data pg_dump: saving encoding = UTF8 pg_dump: saving standard_conforming_strings = on pg_dump: saving search_path = ==== It stops (progress visible in console). And then, in pg logs I see queries like: #v+ SELECT proretset, prosrc, probin, provolatile, proisstrict, prosecdef, lanname, proconfig, procost, prorows, pg_catalog.pg_get_function_arguments(p.oid) AS funcargs, pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs, pg_catalog.pg_get_function_result(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes, ' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE p.oid = '43875'::pg_catalog.oid AND l.oid = p.prolang #v- Now for query stats. To dump it all, pg_dump needed 9173 queries (logged by log_min_duration_statement = 0 for this user). I extracted all queries to separate files, and made stats. In total there were only 4257 unique queries. Then I checked for repeated queries. Top 10 most repeated offenders were: 615 times : SELECT pg_catalog.format_type('25'::pg_catalog.oid, NULL) 599 times : SELECT pg_catalog.format_type('23'::pg_catalog.oid, NULL) 579 times : SELECT pg_catalog.format_type('2281'::pg_catalog.oid, NULL) 578 times : SELECT pg_catalog.format_type('41946'::pg_catalog.oid, NULL) 523 times : SELECT pg_catalog.format_type('701'::pg_catalog.oid, NULL) 459 times : SELECT pg_catalog.format_type('42923'::pg_catalog.oid, NULL) 258 times : SELECT pg_catalog.format_type('16'::pg_catalog.oid, NULL) 176 times : SELECT pg_catalog.format_type('19'::pg_catalog.oid, NULL) 110 times : SELECT pg_catalog.format_type('21'::pg_catalog.oid, NULL) 106 times : SELECT pg_catalog.format_type('42604'::pg_catalog.oid, NULL) In total, there were 5000 queries: SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL) But there were only 83 separate oids that were scanned. The only other repeated command was: SELECT pg_catalog.set_config('search_path', '', false); and it was called only twice. Based on my reading of queries in order it seems to follow the pattern of: One call for: SELECT proretset, prosrc, probin, provolatile, proisstrict, prosecdef, lanname, proconfig, procost, prorows, pg_catalog.pg_get_function_arguments(p.oid)AS funcargs, pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs, pg_catalog.pg_get_function_re sult(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes, ' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE p.oid = 'SOME_NUMBER'::pg_catalog.oidAND l.oid = p.prolang and then one or more: SELECT pg_catalog.format_type('SOME_NUMBER'::pg_catalog.oid, NULL) In one case, after proc query, there were 94 concecutive pg_catalog.format_type queries. I hope it helps. Best regards, depesz
On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote: > Now for query stats. > > To dump it all, pg_dump needed 9173 queries (logged by > log_min_duration_statement = 0 for this user). > > I extracted all queries to separate files, and made stats. In total there were > only 4257 unique queries. > > Then I checked for repeated queries. Top 10 most repeated offenders were: > > 615 times : SELECT pg_catalog.format_type('25'::pg_catalog.oid, NULL) > 599 times : SELECT pg_catalog.format_type('23'::pg_catalog.oid, NULL) > 579 times : SELECT pg_catalog.format_type('2281'::pg_catalog.oid, NULL) > 578 times : SELECT pg_catalog.format_type('41946'::pg_catalog.oid, NULL) > 523 times : SELECT pg_catalog.format_type('701'::pg_catalog.oid, NULL) > 459 times : SELECT pg_catalog.format_type('42923'::pg_catalog.oid, NULL) > 258 times : SELECT pg_catalog.format_type('16'::pg_catalog.oid, NULL) > 176 times : SELECT pg_catalog.format_type('19'::pg_catalog.oid, NULL) > 110 times : SELECT pg_catalog.format_type('21'::pg_catalog.oid, NULL) > 106 times : SELECT pg_catalog.format_type('42604'::pg_catalog.oid, NULL) > > In total, there were 5000 queries: > SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL) > > But there were only 83 separate oids that were scanned. That is a strong argument for using a hash table to cache the types. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote: >> In total, there were 5000 queries: >> SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL) >> But there were only 83 separate oids that were scanned. > That is a strong argument for using a hash table to cache the types. Those queries are coming from getFormattedTypeName(), which is used for function arguments and the like. I'm not quite sure why Hubert is seeing 5000 such calls in a database with only ~100 functions; surely they don't all have an average of 50 arguments? I experimented with the attached, very quick-n-dirty patch to collect format_type results during the initial scan of pg_type, instead. On the regression database in HEAD, it reduces the number of queries pg_dump issues from 3260 to 2905; but I'm having a hard time detecting any net performance change. (This is not meant for commit as-is; notably, I didn't bother to fix getTypes' code paths for pre-9.6 servers. It should be fine for performance testing though.) regards, tom lane diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 6adbd20778..efe1be533f 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -5088,6 +5088,7 @@ getTypes(Archive *fout, int *numTypes) int i_oid; int i_typname; int i_typnamespace; + int i_ftypname; int i_typacl; int i_rtypacl; int i_inittypacl; @@ -5129,6 +5130,7 @@ getTypes(Archive *fout, int *numTypes) appendPQExpBuffer(query, "SELECT t.tableoid, t.oid, t.typname, " "t.typnamespace, " + "pg_catalog.format_type(t.oid, NULL) AS ftypname, " "%s AS typacl, " "%s AS rtypacl, " "%s AS inittypacl, " @@ -5211,6 +5213,7 @@ getTypes(Archive *fout, int *numTypes) i_oid = PQfnumber(res, "oid"); i_typname = PQfnumber(res, "typname"); i_typnamespace = PQfnumber(res, "typnamespace"); + i_ftypname = PQfnumber(res, "ftypname"); i_typacl = PQfnumber(res, "typacl"); i_rtypacl = PQfnumber(res, "rtypacl"); i_inittypacl = PQfnumber(res, "inittypacl"); @@ -5232,6 +5235,7 @@ getTypes(Archive *fout, int *numTypes) tyinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_typname)); tyinfo[i].dobj.namespace = findNamespace(atooid(PQgetvalue(res, i, i_typnamespace))); + tyinfo[i].ftypname = pg_strdup(PQgetvalue(res, i, i_ftypname)); tyinfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_rolname)); tyinfo[i].typacl = pg_strdup(PQgetvalue(res, i, i_typacl)); tyinfo[i].rtypacl = pg_strdup(PQgetvalue(res, i, i_rtypacl)); @@ -18892,12 +18896,11 @@ findDumpableDependencies(ArchiveHandle *AH, const DumpableObject *dobj, * * This does not guarantee to schema-qualify the output, so it should not * be used to create the target object name for CREATE or ALTER commands. - * - * TODO: there might be some value in caching the results. */ static char * getFormattedTypeName(Archive *fout, Oid oid, OidOptions opts) { + TypeInfo *typeInfo; char *result; PQExpBuffer query; PGresult *res; @@ -18910,6 +18913,16 @@ getFormattedTypeName(Archive *fout, Oid oid, OidOptions opts) return pg_strdup("NONE"); } + typeInfo = findTypeByOid(oid); + if (typeInfo) + return pg_strdup(typeInfo->ftypname); + + /* + * Fall back to asking the server. XXX we could only reach this code if + * getTypes() didn't see the type, which should never happen. If it did + * happen, it'd imply that we don't have proper dependencies for the + * object we're trying to describe. Maybe just throw an error instead? + */ query = createPQExpBuffer(); appendPQExpBuffer(query, "SELECT pg_catalog.format_type('%u'::pg_catalog.oid, NULL)", oid); diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h index f5e170e0db..29af845ece 100644 --- a/src/bin/pg_dump/pg_dump.h +++ b/src/bin/pg_dump/pg_dump.h @@ -166,9 +166,11 @@ typedef struct _typeInfo DumpableObject dobj; /* - * Note: dobj.name is the pg_type.typname entry. format_type() might - * produce something different than typname + * Note: dobj.name is the raw pg_type.typname entry. ftypname is the + * result of format_type(), which will be quoted if needed, and might be + * schema-qualified too. */ + char *ftypname; char *rolname; /* name of owner, or empty string */ char *typacl; char *rtypacl;
On 8/27/21 2:23 PM, Tom Lane wrote: > Laurenz Albe <laurenz.albe@cybertec.at> writes: >> On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote: >>> In total, there were 5000 queries: >>> SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL) >>> But there were only 83 separate oids that were scanned. > >> That is a strong argument for using a hash table to cache the types. > > Those queries are coming from getFormattedTypeName(), which is used > for function arguments and the like. I'm not quite sure why Hubert > is seeing 5000 such calls in a database with only ~100 functions; > surely they don't all have an average of 50 arguments? Could be. From the stats post: "Based on my reading of queries in order it seems to follow the pattern of: One call for: SELECT proretset, prosrc, probin, provolatile, proisstrict, prosecdef, lanname, proconfig, procost, prorows, pg_catalog.pg_get_function_arguments(p.oid) AS funcargs, pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs, pg_catalog.pg_get_function_re sult(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes, ' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE p.oid = 'SOME_NUMBER'::pg_catalog.oid AND l.oid = p.prolang and then one or more: SELECT pg_catalog.format_type('SOME_NUMBER'::pg_catalog.oid, NULL) In one case, after proc query, there were 94 concecutive pg_catalog.format_type queries. " > > I experimented with the attached, very quick-n-dirty patch to collect > format_type results during the initial scan of pg_type, instead. On the > regression database in HEAD, it reduces the number of queries pg_dump > issues from 3260 to 2905; but I'm having a hard time detecting any net > performance change. > > (This is not meant for commit as-is; notably, I didn't bother to fix > getTypes' code paths for pre-9.6 servers. It should be fine for > performance testing though.) > > regards, tom lane > -- Adrian Klaver adrian.klaver@aklaver.com
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote: > >> In total, there were 5000 queries: > >> SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL) > >> But there were only 83 separate oids that were scanned. > > > That is a strong argument for using a hash table to cache the types. > > Those queries are coming from getFormattedTypeName(), which is used > for function arguments and the like. I'm not quite sure why Hubert > is seeing 5000 such calls in a database with only ~100 functions; > surely they don't all have an average of 50 arguments? > > I experimented with the attached, very quick-n-dirty patch to collect > format_type results during the initial scan of pg_type, instead. On the > regression database in HEAD, it reduces the number of queries pg_dump > issues from 3260 to 2905; but I'm having a hard time detecting any net > performance change. Seems like the issue here is mainly just the latency of each query being rather high compared to most use-cases, so local testing where there's basically zero latency wouldn't see any change in timing, but throw a trans-atlantic or worse amount of latency between the system running pg_dump and the PG server and you'd see notable wall-clock savings in time. Only took a quick look but generally +1 on reducing the number of queries that pg_dump is doing and the changes suggested looked good to me. Thanks, Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I experimented with the attached, very quick-n-dirty patch to collect >> format_type results during the initial scan of pg_type, instead. On the >> regression database in HEAD, it reduces the number of queries pg_dump >> issues from 3260 to 2905; but I'm having a hard time detecting any net >> performance change. > Seems like the issue here is mainly just the latency of each query being > rather high compared to most use-cases, so local testing where there's > basically zero latency wouldn't see any change in timing, but throw a > trans-atlantic or worse amount of latency between the system running > pg_dump and the PG server and you'd see notable wall-clock savings in > time. Yeah. What I was more concerned about was the potential downside of running format_type() for each pg_type row, even though we might use only a few of those results. The fact that I'm *not* seeing a performance hit with a local server is encouraging from that standpoint. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> I experimented with the attached, very quick-n-dirty patch to collect > >> format_type results during the initial scan of pg_type, instead. On the > >> regression database in HEAD, it reduces the number of queries pg_dump > >> issues from 3260 to 2905; but I'm having a hard time detecting any net > >> performance change. > > > Seems like the issue here is mainly just the latency of each query being > > rather high compared to most use-cases, so local testing where there's > > basically zero latency wouldn't see any change in timing, but throw a > > trans-atlantic or worse amount of latency between the system running > > pg_dump and the PG server and you'd see notable wall-clock savings in > > time. > > Yeah. What I was more concerned about was the potential downside > of running format_type() for each pg_type row, even though we might > use only a few of those results. The fact that I'm *not* seeing > a performance hit with a local server is encouraging from that > standpoint. Ah, yes, agreed. Thanks! Stephen
Attachment
Adrian Klaver <adrian.klaver@aklaver.com> writes: > On 8/27/21 2:23 PM, Tom Lane wrote: >> Those queries are coming from getFormattedTypeName(), which is used >> for function arguments and the like. I'm not quite sure why Hubert >> is seeing 5000 such calls in a database with only ~100 functions; >> surely they don't all have an average of 50 arguments? > Could be. Maybe. I'm disturbed by the discrepancy between my result (about 10% of pg_dump's queries are these) and Hubert's (over 50% are). I'd like to know the reason for that before we push forward. regards, tom lane
On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote: > I experimented with the attached, very quick-n-dirty patch to collect > format_type results during the initial scan of pg_type, instead. On the > regression database in HEAD, it reduces the number of queries pg_dump > issues from 3260 to 2905; but I'm having a hard time detecting any net > performance change. > (This is not meant for commit as-is; notably, I didn't bother to fix > getTypes' code paths for pre-9.6 servers. It should be fine for > performance testing though.) Hi, thanks a lot for this. Will test and report back, most likely on Monday, though. Best regards, depesz
On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote: > Those queries are coming from getFormattedTypeName(), which is used > for function arguments and the like. I'm not quite sure why Hubert > is seeing 5000 such calls in a database with only ~100 functions; > surely they don't all have an average of 50 arguments? Oh. missed that part. So I checked. In the mean time I got -Fc dump, so: #v+ =$ pg_restore -l schema.dump | \ grep -P '^\d*; \d+ \d+ FUNCTION ' | sed 's/^[^(]*(//; s/)[^)]*$//' | awk -F, '{print NF}' | sort -n | uniq -c 23 0 52 1 21 2 8 3 1 4 2 5 #v- 23 functions with 0 arguments, 52 with 1, and the max is 5 arguments - two functions have these. Not sure if it matters but there is a lot of enums. 83 of them. And they have up to 250 elements (2 such types). Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote: >> Those queries are coming from getFormattedTypeName(), which is used >> for function arguments and the like. I'm not quite sure why Hubert >> is seeing 5000 such calls in a database with only ~100 functions; >> surely they don't all have an average of 50 arguments? > 23 functions with 0 arguments, 52 with 1, and the max is 5 arguments - two > functions have these. > Not sure if it matters but there is a lot of enums. 83 of them. And they have > up to 250 elements (2 such types). Hmm, no, I don't see any getFormattedTypeName calls in dumpEnumType. There are two of 'em in dumpCast though. Does this DB by chance have a ton of user-defined casts? regards, tom lane
Here is a second patch, quite independent of the first one, that gets rid of some other repetitive queries. On the regression database, the number of queries needed to do "pg_dump -s regression" drops from 3260 to 2589, and on my machine it takes 1.8 sec instead of 2.1 sec. What's attacked here is a fairly silly decision in getPolicies() to query pg_policy once per table, when we could do so just once. It might have been okay if we skipped the per-table query for tables that lack policies, but it's not clear to me that we can know that without looking into pg_policy. In any case I doubt this is ever going to be less efficient than the original coding. regards, tom lane diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 6adbd20778..befe68de1a 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -3656,7 +3656,7 @@ dumpBlobs(Archive *fout, const void *arg) /* * getPolicies - * get information about policies on a dumpable table. + * get information about all RLS policies on dumpable tables. */ void getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) @@ -3666,6 +3666,7 @@ getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) PolicyInfo *polinfo; int i_oid; int i_tableoid; + int i_polrelid; int i_polname; int i_polcmd; int i_polpermissive; @@ -3681,6 +3682,10 @@ getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) query = createPQExpBuffer(); + /* + * First, check which tables have RLS enabled. We represent RLS being + * enabled on a table by creating a PolicyInfo object with null polname. + */ for (i = 0; i < numTables; i++) { TableInfo *tbinfo = &tblinfo[i]; @@ -3689,15 +3694,6 @@ getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) if (!(tbinfo->dobj.dump & DUMP_COMPONENT_POLICY)) continue; - pg_log_info("reading row security enabled for table \"%s.%s\"", - tbinfo->dobj.namespace->dobj.name, - tbinfo->dobj.name); - - /* - * Get row security enabled information for the table. We represent - * RLS being enabled on a table by creating a PolicyInfo object with - * null polname. - */ if (tbinfo->rowsec) { /* @@ -3719,51 +3715,35 @@ getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) polinfo->polqual = NULL; polinfo->polwithcheck = NULL; } + } - pg_log_info("reading policies for table \"%s.%s\"", - tbinfo->dobj.namespace->dobj.name, - tbinfo->dobj.name); - - resetPQExpBuffer(query); - - /* Get the policies for the table. */ - if (fout->remoteVersion >= 100000) - appendPQExpBuffer(query, - "SELECT oid, tableoid, pol.polname, pol.polcmd, pol.polpermissive, " - "CASE WHEN pol.polroles = '{0}' THEN NULL ELSE " - " pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(rolname) from pg_catalog.pg_rolesWHERE oid = ANY(pol.polroles)), ', ') END AS polroles, " - "pg_catalog.pg_get_expr(pol.polqual, pol.polrelid) AS polqual, " - "pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid) AS polwithcheck " - "FROM pg_catalog.pg_policy pol " - "WHERE polrelid = '%u'", - tbinfo->dobj.catId.oid); - else - appendPQExpBuffer(query, - "SELECT oid, tableoid, pol.polname, pol.polcmd, 't' as polpermissive, " - "CASE WHEN pol.polroles = '{0}' THEN NULL ELSE " - " pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(rolname) from pg_catalog.pg_rolesWHERE oid = ANY(pol.polroles)), ', ') END AS polroles, " - "pg_catalog.pg_get_expr(pol.polqual, pol.polrelid) AS polqual, " - "pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid) AS polwithcheck " - "FROM pg_catalog.pg_policy pol " - "WHERE polrelid = '%u'", - tbinfo->dobj.catId.oid); - res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK); + /* + * Now, read all RLS policies, and create PolicyInfo objects for all those + * that are of interest. + */ + pg_log_info("reading row-level security policies"); - ntups = PQntuples(res); + printfPQExpBuffer(query, + "SELECT oid, tableoid, pol.polrelid, pol.polname, pol.polcmd, "); + if (fout->remoteVersion >= 100000) + appendPQExpBuffer(query, "pol.polpermissive, "); + else + appendPQExpBuffer(query, "'t' as polpermissive, "); + appendPQExpBuffer(query, + "CASE WHEN pol.polroles = '{0}' THEN NULL ELSE " + " pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(rolname) from pg_catalog.pg_rolesWHERE oid = ANY(pol.polroles)), ', ') END AS polroles, " + "pg_catalog.pg_get_expr(pol.polqual, pol.polrelid) AS polqual, " + "pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid) AS polwithcheck " + "FROM pg_catalog.pg_policy pol"); - if (ntups == 0) - { - /* - * No explicit policies to handle (only the default-deny policy, - * which is handled as part of the table definition). Clean up - * and return. - */ - PQclear(res); - continue; - } + res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK); + ntups = PQntuples(res); + if (ntups > 0) + { i_oid = PQfnumber(res, "oid"); i_tableoid = PQfnumber(res, "tableoid"); + i_polrelid = PQfnumber(res, "polrelid"); i_polname = PQfnumber(res, "polname"); i_polcmd = PQfnumber(res, "polcmd"); i_polpermissive = PQfnumber(res, "polpermissive"); @@ -3775,6 +3755,16 @@ getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) for (j = 0; j < ntups; j++) { + Oid polrelid = atooid(PQgetvalue(res, j, i_polrelid)); + TableInfo *tbinfo = findTableByOid(polrelid); + + /* + * Ignore row security on tables not to be dumped. (This will + * result in some harmless wasted slots in polinfo[].) + */ + if (!(tbinfo->dobj.dump & DUMP_COMPONENT_POLICY)) + continue; + polinfo[j].dobj.objType = DO_POLICY; polinfo[j].dobj.catId.tableoid = atooid(PQgetvalue(res, j, i_tableoid)); @@ -3804,8 +3794,10 @@ getPolicies(Archive *fout, TableInfo tblinfo[], int numTables) polinfo[j].polwithcheck = pg_strdup(PQgetvalue(res, j, i_polwithcheck)); } - PQclear(res); } + + PQclear(res); + destroyPQExpBuffer(query); }
You guys are brilliant!
Regards,
Gus
On Sat, Aug 28, 2021 at 6:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Here is a second patch, quite independent of the first one, that
gets rid of some other repetitive queries. On the regression database,
the number of queries needed to do "pg_dump -s regression" drops from
3260 to 2589, and on my machine it takes 1.8 sec instead of 2.1 sec.
What's attacked here is a fairly silly decision in getPolicies()
to query pg_policy once per table, when we could do so just once.
It might have been okay if we skipped the per-table query for
tables that lack policies, but it's not clear to me that we can
know that without looking into pg_policy. In any case I doubt
this is ever going to be less efficient than the original coding.
regards, tom lane
On 2021-Aug-28, Tom Lane wrote: > Here is a second patch, quite independent of the first one, that > gets rid of some other repetitive queries. Another pointlessly repetitive query is in getTriggers, which we run once per table to be dumped containing triggers. We could reduce that by running it in bulk for many relations at a time. I suppose it's normally not hurtful, but as we grow the number of partitions we allow it's going to become a problem. No patch from me for now — if someone wantw to volunteer one, it looks simple enough ... -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Another pointlessly repetitive query is in getTriggers, which we run > once per table to be dumped containing triggers. We could reduce that > by running it in bulk for many relations at a time. I suppose it's > normally not hurtful, but as we grow the number of partitions we allow > it's going to become a problem. Perhaps. In the regression database, only ~10% of the tables have triggers, so it's likely not going to yield any measurable gain there. But databases that make heavier use of foreign keys might see a win. Another thing I've wondered about before is whether it could make sense to read pg_attribute once rather than once per table. There might be a fair amount of wasted work if the dump is selective, and in big DBs the sheer size of that result could be a problem. But those reads are definitely way up there on the number-of-queries scale. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Another pointlessly repetitive query is in getTriggers, which we run > > once per table to be dumped containing triggers. We could reduce that > > by running it in bulk for many relations at a time. I suppose it's > > normally not hurtful, but as we grow the number of partitions we allow > > it's going to become a problem. > > Perhaps. In the regression database, only ~10% of the tables have > triggers, so it's likely not going to yield any measurable gain there. > But databases that make heavier use of foreign keys might see a win. It sure seems like in just about all cases fewer queries is going to be better. > Another thing I've wondered about before is whether it could make sense > to read pg_attribute once rather than once per table. There might be > a fair amount of wasted work if the dump is selective, and in big DBs > the sheer size of that result could be a problem. But those reads are > definitely way up there on the number-of-queries scale. Yeah, I've thought about this before too. Would sure be nice if there was a way that we could query the catalog selectively based on the options the user has passed in but do so in as few queries as possible. Thanks, Stephen
Attachment
On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote: > I experimented with the attached, very quick-n-dirty patch to collect > format_type results during the initial scan of pg_type, instead. On the > regression database in HEAD, it reduces the number of queries pg_dump > issues from 3260 to 2905; but I'm having a hard time detecting any net > performance change. Hi, So, I applied it to brand new HEAD from git, Result: From total of 9173 queries it went down to 4178. Originally 5000 type queries, now 19! This is actually strange given that previously it was asking querying about 83 separate type oids. But, as far as I was able to check with "pg_restore -l" (from -Fc dump), results are the same. Dump time down from 17m 22s to 8m 12s. Then, I applied the patch from https://www.postgresql.org/message-id/1082810.1630189581%40sss.pgh.pa.us without removing first one, as you said they are quite independent. With both patches applied I got 3884 queries total, and dump from original db in 7m 35s. So this clearly helps. A LOT. But since we're looking at it, and with both patches applied, I looked at the next most common query. Which is: #v+ SELECT proretset, prosrc, probin, provolatile, proisstrict, prosecdef, lanname, proconfig, procost, prorows, pg_catalog.pg_get_function_arguments(p.oid) AS funcargs, pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs, pg_catalog.pg_get_function_result(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes, ' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE p.oid = '25491'::pg_catalog.oid AND l.oid = p.prolang #v- From the 3884 in the current pg_dump (with both patches applied) - these queries were called 1804 times. All of these calls where with different oids, so it's possible that there is nothing to be done about it, but figured I'll let you know. The thing is - even though it was called 1804 times, dump contains data only about 107 functions (pg_restore -l schema.dump | grep -c FUNCTION), so it kinda seems that 94% of these calls is not needed. Anyway, even if we can't get any help for function queries, improvement of over 50% is great. Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote: >> I experimented with the attached, very quick-n-dirty patch to collect >> format_type results during the initial scan of pg_type, instead. > So, I applied it to brand new HEAD from git, Result: > From total of 9173 queries it went down to 4178. > Originally 5000 type queries, now 19! > This is actually strange given that previously it was asking querying > about 83 separate type oids. But, as far as I was able to check with > "pg_restore -l" (from -Fc dump), results are the same. Hm. So we're still no wiser than before about how such a small (in terms of number of objects) database could have produced so many getFormattedTypeName calls. Plus, this result raises a new question: with the patch, I think you shouldn't have seen *any* queries of that form. Where are the 19 survivors coming from? I don't suppose you could send me a schema-only dump of that database, off-list? I'm now quite curious. > But since we're looking at it, and with both patches applied, I looked > at the next most common query. Which is: > [ collection of details about a function ] > The thing is - even though it was called 1804 times, dump contains data only > about 107 functions (pg_restore -l schema.dump | grep -c FUNCTION), so it kinda > seems that 94% of these calls is not needed. Hm. It's not doing that for *every* row in pg_proc, at least. I speculate that it is collecting and then not printing the info about functions that are in extensions --- can you check on how many there are of those? (Actually, if you've got a whole lot of objects inside extensions, maybe that explains the 5000 calls?) regards, tom lane
On Mon, Aug 30, 2021 at 10:11:22AM -0400, Tom Lane wrote: > I don't suppose you could send me a schema-only dump of that > database, off-list? I'm now quite curious. Asked the owners for their permission. > > The thing is - even though it was called 1804 times, dump contains data only > > about 107 functions (pg_restore -l schema.dump | grep -c FUNCTION), so it kinda > > seems that 94% of these calls is not needed. > Hm. It's not doing that for *every* row in pg_proc, at least. > I speculate that it is collecting and then not printing the info > about functions that are in extensions --- can you check on > how many there are of those? > (Actually, if you've got a whole lot of objects inside extensions, > maybe that explains the 5000 calls?) Well, not sure if that's a lot, but: there are 15 extensions, including plpgsql. SELECT count(*) FROM pg_catalog.pg_depend WHERE refclassid = 'pg_catalog.pg_extension'::pg_catalog.regclass AND deptype = 'e'; return 2110 objects: SELECT classid::regclass, count(*) FROM pg_catalog.pg_depend WHERE refclassid = 'pg_catalog.pg_extension'::pg_catalog.regclass AND deptype = 'e' GROUP BY 1 ORDER BY 1; classid │ count ─────────────────────────┼─────── pg_type │ 31 pg_proc │ 1729 pg_class │ 61 pg_foreign_data_wrapper │ 1 pg_cast │ 30 pg_language │ 1 pg_opclass │ 73 pg_operator │ 111 pg_opfamily │ 73 (9 rows) Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Mon, Aug 30, 2021 at 10:11:22AM -0400, Tom Lane wrote: >> I speculate that it is collecting and then not printing the info >> about functions that are in extensions --- can you check on >> how many there are of those? > classid │ count > ─────────────────────────┼─────── > pg_type │ 31 > pg_proc │ 1729 > pg_class │ 61 > pg_foreign_data_wrapper │ 1 > pg_cast │ 30 > pg_language │ 1 > pg_opclass │ 73 > pg_operator │ 111 > pg_opfamily │ 73 > (9 rows) Ah-hah. Those 1729 extension-owned functions account nicely for the extra probes into pg_proc, and I bet they are causing the unexplained getFormattedTypeName calls too. So the *real* problem here seems to be that we're doing too much work on objects that are not going to be dumped because they are extension members. I'll take a look at that later if nobody beats me to it. regards, tom lane
I wrote: > Ah-hah. Those 1729 extension-owned functions account nicely > for the extra probes into pg_proc, and I bet they are causing > the unexplained getFormattedTypeName calls too. So the > *real* problem here seems to be that we're doing too much > work on objects that are not going to be dumped because they > are extension members. I'll take a look at that later if > nobody beats me to it. I took a quick look at this, and it seems to be mostly the fault of the DUMP_COMPONENT refactorization that was done awhile ago. We create DumpableObjects for all these objects, because we need those to track dependencies. When we arrive at dumpFunc() for an extension-owned object, it has the DUMP_COMPONENT_SECLABEL and DUMP_COMPONENT_POLICY flag bits set, whether or not the function actually has any such properties. This causes dumpFunc to run through its data collection query, even though nothing at all is going to get output. I see that the reason those flags become set is that checkExtensionMembership does this for an extension member: dobj->dump = ext->dobj.dump_contains & (DUMP_COMPONENT_ACL | DUMP_COMPONENT_SECLABEL | DUMP_COMPONENT_POLICY); There is logic elsewhere that causes the DUMP_COMPONENT_ACL flag to get cleared if there's no interesting ACL for the object, but I see no such logic for SECLABEL or POLICY. That omission is costing us an awful lot of wasted queries in any database with a lot of extension-owned objects. I'm quite allergic to the way that the ACL logic is implemented anyhow, as there seem to be N copies of essentially identical logic, not to mention all the inefficient left joins and subqueries that were added to the fundamental data-gathering queries --- which are only supposed to find out which objects we want to dump, not expensively collect scads of detail about every object in the catalogs. I think this is less in need of a tweak than "burn it to the ground and start over". I wonder if we can't get to a place where there's only one query that actually looks into pg_init_privs, more like the way we do it for descriptions and seclabels (not that the seclabel code is perfect, as we've found here). Anyway, it doesn't look like there's much hope of improving this aspect without a significant rewrite. One band-aidy idea is that we could check --no-security-labels earlier and not allow that flag bit to become set in the first place, but that only helps if the user gives that flag, which few would. (I'm also wondering more than a little bit why we're allowing DUMP_COMPONENT_POLICY to become set on objects that aren't tables.) regards, tom lane
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Mon, Aug 30, 2021 at 10:11:22AM -0400, Tom Lane wrote: >> I don't suppose you could send me a schema-only dump of that >> database, off-list? I'm now quite curious. > Asked the owners for their permission. BTW, I think you can skip that part now --- it seems like the extensions sufficiently explain the extra queries. regards, tom lane
I wrote: > Anyway, it doesn't look like there's much hope of improving this > aspect without a significant rewrite. Just to close out this thread: I've now posted such a rewrite at https://www.postgresql.org/message-id/2273648.1634764485%40sss.pgh.pa.us regards, tom lane
On Wed, Oct 20, 2021 at 05:46:01PM -0400, Tom Lane wrote: > I wrote: > > Anyway, it doesn't look like there's much hope of improving this > > aspect without a significant rewrite. > > Just to close out this thread: I've now posted such a rewrite at > https://www.postgresql.org/message-id/2273648.1634764485%40sss.pgh.pa.us That looks amazing. Thanks a lot. Best regards, depesz