Thread: Avoid useless retrieval of defaults and check constraints in pg_dump -a
Hi, Currently, getTableAttrs() always retrieves info about columns defaults and check constraints, while this will never be used if --data-only option if used. This seems like a waste of resources, so here's a patch to skip those parts when the DDL won't be generated.
Attachment
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Justin Pryzby
Date:
On Sun, Jul 12, 2020 at 07:48:50AM +0200, Julien Rouhaud wrote: > Currently, getTableAttrs() always retrieves info about columns defaults and > check constraints, while this will never be used if --data-only option if used. > This seems like a waste of resources, so here's a patch to skip those parts > when the DDL won't be generated. Note that the speed of default and constraint handling has come up before: https://www.postgresql.org/message-id/flat/CAMkU%3D1xPqHP%3D7YPeChq6n1v_qd4WGf%2BZvtnR-b%2BgyzFqtJqMMQ%40mail.gmail.com https://www.postgresql.org/message-id/CAMkU=1x-e+maqefhM1yMeSiJ8J9Z+SJHgW7c9bqo3E3JMG4iJA@mail.gmail.com I'd pointed out that a significant fraction of our pg_upgrade time was in pg_dump, due to having wide tables with many child tables, and "default 0" on every column. (I've since dropped our defaults so this is no longer an issue here). It appears your patch would avoid doing unnecessary work in the --data-only case, but it wouldn't help the pg_upgrade case. -- Justin
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Julien Rouhaud
Date:
On Sun, Jul 12, 2020 at 4:29 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Sun, Jul 12, 2020 at 07:48:50AM +0200, Julien Rouhaud wrote: > > Currently, getTableAttrs() always retrieves info about columns defaults and > > check constraints, while this will never be used if --data-only option if used. > > This seems like a waste of resources, so here's a patch to skip those parts > > when the DDL won't be generated. > > Note that the speed of default and constraint handling has come up before: > https://www.postgresql.org/message-id/flat/CAMkU%3D1xPqHP%3D7YPeChq6n1v_qd4WGf%2BZvtnR-b%2BgyzFqtJqMMQ%40mail.gmail.com > https://www.postgresql.org/message-id/CAMkU=1x-e+maqefhM1yMeSiJ8J9Z+SJHgW7c9bqo3E3JMG4iJA@mail.gmail.com Oh, I wasn't aware of that. > I'd pointed out that a significant fraction of our pg_upgrade time was in > pg_dump, due to having wide tables with many child tables, and "default 0" on > every column. (I've since dropped our defaults so this is no longer an issue > here). > > It appears your patch would avoid doing unnecessary work in the --data-only > case, but it wouldn't help the pg_upgrade case. Indeed. Making the schema part faster would probably require a bigger refactoring. I'm wondering if we could introduce some facility to temporarily deny any DDL change, so that the initial pg_dump -s done by pg_upgrade can be performed before shutting down the instance. Note that those extraneous queries were found while trying to dump data out of a corrupted database. The issue wasn't an excessive runtime but corrupted catalog entries, so bypassing this code (since I was only interested in the data anyway) was simpler than trying to recover yet other corrupted rows.
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Daniel Gustafsson
Date:
> On 12 Jul 2020, at 07:48, Julien Rouhaud <rjuju123@gmail.com> wrote: > Currently, getTableAttrs() always retrieves info about columns defaults and > check constraints, while this will never be used if --data-only option if used. > This seems like a waste of resources, so here's a patch to skip those parts > when the DDL won't be generated. Given how unintrusive this optimization is, +1 from me to go ahead with this patch. pg_dump tests pass. Personally I would've updated the nearby comments to reflect why the check for dataOnly is there, but MMV there. I'm moving this patch to Ready for Committer. I'm fairly sure there is a lot more we can do to improve the performance of data-only dumps, but this nicely chips away at the problem. cheers ./daniel
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Michael Paquier
Date:
On Tue, Jul 14, 2020 at 11:14:50AM +0200, Julien Rouhaud wrote: > Note that those extraneous queries were found while trying to dump > data out of a corrupted database. The issue wasn't an excessive > runtime but corrupted catalog entries, so bypassing this code (since I > was only interested in the data anyway) was simpler than trying to > recover yet other corrupted rows. Yeah, I don't see actually why this argument can prevent us from doing a micro optimization if it proves to work correctly. -- Michael
Attachment
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Michael Paquier
Date:
On Thu, Sep 10, 2020 at 02:31:32PM +0200, Daniel Gustafsson wrote: > Given how unintrusive this optimization is, +1 from me to go ahead with this > patch. pg_dump tests pass. Personally I would've updated the nearby comments > to reflect why the check for dataOnly is there, but MMV there. I'm moving this > patch to Ready for Committer. We need two comments here. I would suggest something like: "Skip default/check for a data-only dump, as this is only needed for dumps of the table schema." Better wording is of course welcome. > I'm fairly sure there is a lot more we can do to improve the performance of > data-only dumps, but this nicely chips away at the problem. I was looking at that, and wondered about cases like the following, artistic, thing: CREATE FUNCTION check_data_zz() RETURNS boolean LANGUAGE sql STABLE STRICT AS $$select count(a) > 0 from zz$$; CREATE TABLE public.yy ( a integer, CONSTRAINT yy_check CHECK (check_data_zz()) ); CREATE TABLE zz (a integer); INSERT INTO zz VALUES (1); INSERT INTO yy VALUES (1); Even on HEAD, this causes the data load to fail because yy's data is inserted before zz, so keeping track of the CHECK dependency could have made sense for --data-only if we were to make a better work at detecting the dependency between both tables and made sure that zz data needs to appear before yy, but it is not like this would happen easily in pg_dump, and we document it this way (see the warning about dump/reload in ddl.sgml for check constraints). In short, I think that this patch looks like a good thing to have. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Jul 14, 2020 at 11:14:50AM +0200, Julien Rouhaud wrote: >> Note that those extraneous queries were found while trying to dump >> data out of a corrupted database. The issue wasn't an excessive >> runtime but corrupted catalog entries, so bypassing this code (since I >> was only interested in the data anyway) was simpler than trying to >> recover yet other corrupted rows. > Yeah, I don't see actually why this argument can prevent us from doing > a micro optimization if it proves to work correctly. The main thing I'm wondering about is whether not fetching these objects could lead to failing to detect an important dependency chain. IIRC, pg_dump simply ignores pg_depend entries that mention objects it has not loaded, so there is a possible mechanism for that. However, it's hard to see how a --data-only dump could end up choosing an invalid dump order on that basis. It doesn't seem like safe load orders for the table data objects could depend on what is referenced by defaults or CHECK constraints. But ... I've only spent a few minutes thinking about it, so maybe I'm missing something. (Note that we disallow sub-queries in CHECK constraints, and also disclaim responsibility for what happens if you cheat by hiding the subquery in a function. So while it's certainly possible to build CHECK constraints that only work if table X is loaded before table Y, pg_dump already doesn't guarantee that'll work, --data-only or otherwise.) regards, tom lane
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Michael Paquier
Date:
On Mon, Sep 14, 2020 at 10:56:01PM -0400, Tom Lane wrote: > (Note that we disallow sub-queries in CHECK constraints, and also > disclaim responsibility for what happens if you cheat by hiding > the subquery in a function. So while it's certainly possible to > build CHECK constraints that only work if table X is loaded before > table Y, pg_dump already doesn't guarantee that'll work, --data-only > or otherwise.) Yep, exactly what I was thinking upthread by cheating with a schema having cross-table references in a check constraint. -- Michael
Attachment
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Julien Rouhaud
Date:
On Tue, Sep 15, 2020 at 4:48 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 10, 2020 at 02:31:32PM +0200, Daniel Gustafsson wrote: > > Given how unintrusive this optimization is, +1 from me to go ahead with this > > patch. pg_dump tests pass. Personally I would've updated the nearby comments > > to reflect why the check for dataOnly is there, but MMV there. I'm moving this > > patch to Ready for Committer. > > We need two comments here. I would suggest something like: > "Skip default/check for a data-only dump, as this is only needed for > dumps of the table schema." > > Better wording is of course welcome. FWIW I'm fine with those news comments!
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Michael Paquier
Date:
On Tue, Sep 15, 2020 at 10:46:56AM +0200, Julien Rouhaud wrote: > FWIW I'm fine with those news comments! Okay, I got again on this one today and finished by committing the patch as of 5423853. -- Michael
Attachment
Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a
From
Julien Rouhaud
Date:
On Wed, Sep 16, 2020 at 3:06 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Sep 15, 2020 at 10:46:56AM +0200, Julien Rouhaud wrote: > > FWIW I'm fine with those news comments! > > Okay, I got again on this one today and finished by committing the > patch as of 5423853. Thanks Michael!