Thread: pg_dump restore time and Foreign Keys

pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
pg_dump restore times can be high when they include many ALTER TABLE ADD
FORIEGN KEY statements, since each statement checks the data to see if
it is fully valid in all cases.

I've been asked "why we run that at all?", since if we dumped the tables
together, we already know they match.

If we had a way of pg_dump passing on the information that the test
already passes, we would be able to skip the checks.

Proposal:

* Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
When we run WITHOUT CHECK, iff both the source and target table are
newly created in this transaction, then we skip the check. If the check
is skipped we mark the constraint as being unchecked, so we can tell
later if this has been used.

* Have pg_dump write the new syntax into its dumps, when both the source
and target table are dumped in same run

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> I'm guessing that the WITHOUT CHECK option would not be acceptable as an
> unprotected trap for our lazy and wicked users. :-)

Yes, that sounds scary.

Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD 
FOREIGN KEY. Or speeding up COPY into a table with foreign keys already 
defined. For example, you might want to build an in-memory hash table of 
the keys in the target table, instead of issuing a query on each INSERT, 
if the target table isn't huge.

Nothing beats the speed of simply not checking the constraint, of 
course, but I'd hate to lose the protection it gives.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I'm guessing that the WITHOUT CHECK option would not be acceptable as an
> > unprotected trap for our lazy and wicked users. :-)
> 
> Yes, that sounds scary.
> 
> Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD 
> FOREIGN KEY. 

I managed a suggestion for improving it for integers only, but if
anybody has any other ideas, I'm all ears. 

> Or speeding up COPY into a table with foreign keys already 
> defined. For example, you might want to build an in-memory hash table of 
> the keys in the target table, instead of issuing a query on each INSERT, 
> if the target table isn't huge.

No, that's not the problem, but I agree that is a problem also.

> Nothing beats the speed of simply not checking the constraint, of 
> course, but I'd hate to lose the protection it gives.

Are you saying you don't like the rest of the proposal, or just don't
like the idea of having that added as an unprotected option, but find
the proposal acceptable? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Richard Huxton
Date:
Simon Riggs wrote:
> 
> If we had a way of pg_dump passing on the information that the test
> already passes, we would be able to skip the checks.
> 
> Proposal:
> 
> * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];

> * Have pg_dump write the new syntax into its dumps, when both the source
> and target table are dumped in same I've been known to manually tweak dumps before now. I can see me 
forgetting this.

What about pg_dump writing out a row-count and MD5 of the rows in the 
COPY (just a textual calculation). Iff the restore checksum matches the 
dump checksum for both tables then the foreign-keys can be skipped.

If the restore checksum doesn't match the dump then it can issue a 
warning, but continue and run the full fkey check.

--   Richard Huxton  Archonet Ltd


Re: pg_dump restore time and Foreign Keys

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> pg_dump restore times can be high when they include many ALTER TABLE ADD
> FORIEGN KEY statements, since each statement checks the data to see if
> it is fully valid in all cases.
>
> I've been asked "why we run that at all?", since if we dumped the tables
> together, we already know they match.
>
> If we had a way of pg_dump passing on the information that the test
> already passes, we would be able to skip the checks.
>
> Proposal:
>
> * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
> When we run WITHOUT CHECK, iff both the source and target table are
> newly created in this transaction, then we skip the check. If the check
> is skipped we mark the constraint as being unchecked, so we can tell
> later if this has been used.
>
> * Have pg_dump write the new syntax into its dumps, when both the source
> and target table are dumped in same run
>
> I'm guessing that the WITHOUT CHECK option would not be acceptable as an
> unprotected trap for our lazy and wicked users. :-)
>   

This whole proposal would be a major footgun which would definitely be 
abused, IMNSHO.

I think Heikki's idea of speeding up the check using a hash table of the 
foreign keys possibly has merit.

cheers

andrew


Re: pg_dump restore time and Foreign Keys

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> Are you saying you don't like the rest of the proposal, or just don't
> like the idea of having that added as an unprotected option, but find
> the proposal acceptable? 

I don't like the idea of having an unprotected option. If we were going 
to have one, I wouldn't bother with the extra checks you proposed; it's 
going to be unsafe anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:
> 
> Simon Riggs wrote:
> > pg_dump restore times can be high when they include many ALTER TABLE ADD
> > FORIEGN KEY statements, since each statement checks the data to see if
> > it is fully valid in all cases.
> >
> > I've been asked "why we run that at all?", since if we dumped the tables
> > together, we already know they match.
> >
> > If we had a way of pg_dump passing on the information that the test
> > already passes, we would be able to skip the checks.
> >
> > Proposal:
> >
> > * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
> > When we run WITHOUT CHECK, iff both the source and target table are
> > newly created in this transaction, then we skip the check. If the check
> > is skipped we mark the constraint as being unchecked, so we can tell
> > later if this has been used.
> >
> > * Have pg_dump write the new syntax into its dumps, when both the source
> > and target table are dumped in same run
> >
> > I'm guessing that the WITHOUT CHECK option would not be acceptable as an
> > unprotected trap for our lazy and wicked users. :-)
> >   
> 
> This whole proposal would be a major footgun which would definitely be 
> abused, IMNSHO.

OK, understood. Two negatives is enough to sink it.

> I think Heikki's idea of speeding up the check using a hash table of the 
> foreign keys possibly has merit.

The query is sent through SPI, so if there was a way to speed this up,
we would already be using it implicitly. If we find a way to speed up
joins it will improve the FK check also.

The typical join plan for the check query is already a hash join,
assuming the target table is small enough. If not, its a huge sort/merge
join. So in a way, we already follow the suggestion.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> I'm guessing that the WITHOUT CHECK option would not be acceptable as an
>>> unprotected trap for our lazy and wicked users. :-)
>> Yes, that sounds scary.
>>
>> Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD 
>> FOREIGN KEY. 
> 
> I managed a suggestion for improving it for integers only, but if
> anybody has any other ideas, I'm all ears. 

Well, one idea would be to allow adding multiple foreign keys in one 
command, and checking them all at once with one SQL query instead of one 
per foreign key. Right now we need one seq scan over the table per 
foreign key, by checking all references at once we would only need one 
seq scan to check them all.

>> Or speeding up COPY into a table with foreign keys already 
>> defined. For example, you might want to build an in-memory hash table of 
>> the keys in the target table, instead of issuing a query on each INSERT, 
>> if the target table isn't huge.
> 
> No, that's not the problem, but I agree that is a problem also.

It is related, because if we can make COPY into a table with foreign 
keys fast enough, we could rearrange dumps so that foreign keys are 
created before loading data. That would save the seqscan over the table 
altogether.

Thinking about this idea a bit more, instead of loading the whole target 
table into memory, it would probably make more sense to keep a hash 
table as just a cache of the most recent keys that have been referenced.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Thu, 2008-06-05 at 16:01 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> I'm guessing that the WITHOUT CHECK option would not be acceptable as an
> >>> unprotected trap for our lazy and wicked users. :-)
> >> Yes, that sounds scary.
> >>
> >> Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD 
> >> FOREIGN KEY. 
> > 
> > I managed a suggestion for improving it for integers only, but if
> > anybody has any other ideas, I'm all ears. 
> 
> Well, one idea would be to allow adding multiple foreign keys in one 
> command, and checking them all at once with one SQL query instead of one 
> per foreign key. Right now we need one seq scan over the table per 
> foreign key, by checking all references at once we would only need one 
> seq scan to check them all.

No need. Just parallelise the restore with concurrent psql. Which would
speed up the index creation also. Does Greg have plans for further work?

> >> Or speeding up COPY into a table with foreign keys already 
> >> defined. For example, you might want to build an in-memory hash table of 
> >> the keys in the target table, instead of issuing a query on each INSERT, 
> >> if the target table isn't huge.
> > 
> > No, that's not the problem, but I agree that is a problem also.
> 
> It is related, because if we can make COPY into a table with foreign 
> keys fast enough, we could rearrange dumps so that foreign keys are 
> created before loading data. That would save the seqscan over the table 
> altogether.

True.

> Thinking about this idea a bit more, instead of loading the whole target 
> table into memory, it would probably make more sense to keep a hash 
> table as just a cache of the most recent keys that have been referenced.

If you can think of a way of improving hash joins generally, then it
will work for this specific case also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> On Thu, 2008-06-05 at 16:01 +0300, Heikki Linnakangas wrote:
>> Well, one idea would be to allow adding multiple foreign keys in one 
>> command, and checking them all at once with one SQL query instead of one 
>> per foreign key. Right now we need one seq scan over the table per 
>> foreign key, by checking all references at once we would only need one 
>> seq scan to check them all.
> 
> No need. Just parallelise the restore with concurrent psql. Which would
> speed up the index creation also.

True, you could do that.

>  Does Greg have plans for further work?

I believe he's busy with other stuff at the moment.

>> Thinking about this idea a bit more, instead of loading the whole target 
>> table into memory, it would probably make more sense to keep a hash 
>> table as just a cache of the most recent keys that have been referenced.
> 
> If you can think of a way of improving hash joins generally, then it
> will work for this specific case also.

Individual RI checks performed on inserts/COPY don't do a hash join. The 
bulk check done by ALTER TABLE ADD FOREIGN KEY does, but that's 
different issue.

This hash table would be a specific trick to speed up RI checks. If 
you're anyway I/O bound, it wouldn't help, and you'd already be better 
off creating the foreign key first and loading the data after that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: pg_dump restore time and Foreign Keys

From
Robert Treat
Date:
On Thursday 05 June 2008 08:56:35 Simon Riggs wrote:
> On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:
> > Simon Riggs wrote:
> > > pg_dump restore times can be high when they include many ALTER TABLE
> > > ADD FORIEGN KEY statements, since each statement checks the data to see
> > > if it is fully valid in all cases.
> > >
> > > I've been asked "why we run that at all?", since if we dumped the
> > > tables together, we already know they match.
> > >
> > > If we had a way of pg_dump passing on the information that the test
> > > already passes, we would be able to skip the checks.
> > >
> > > Proposal:
> > >
> > > * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
> > > When we run WITHOUT CHECK, iff both the source and target table are
> > > newly created in this transaction, then we skip the check. If the check
> > > is skipped we mark the constraint as being unchecked, so we can tell
> > > later if this has been used.
> > >
> > > * Have pg_dump write the new syntax into its dumps, when both the
> > > source and target table are dumped in same run
> > >
> > > I'm guessing that the WITHOUT CHECK option would not be acceptable as
> > > an unprotected trap for our lazy and wicked users. :-)
> >
> > This whole proposal would be a major footgun which would definitely be
> > abused, IMNSHO.
>
> OK, understood. Two negatives is enough to sink it.
>

Heh, I would have argued that the idea should go the other way and just make 
this part of the normal syntax.  Oracle DBA's have been doing this for years 
(MS SQL supports it too actually) and it really helps working around having 
to hold locks on large relations for lengthy periods of times. Heck, I'd like 
to see a no check option for all constraints really. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Sat, 2008-06-07 at 13:08 -0400, Robert Treat wrote:
> On Thursday 05 June 2008 08:56:35 Simon Riggs wrote:
> > On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:

> Heh, I would have argued that the idea should go the other way and
> just make this part of the normal syntax.  Oracle DBA's have been
> doing this for years (MS SQL supports it too actually) and it really
> helps working around having to hold locks on large relations for
> lengthy periods of times. Heck, I'd like to see a no check option for
> all constraints really. 

Interesting that SQL Server does it also.

Holding the lock for a long period is just one more problem. :-)

I'm always torn between the I-know-what-Im-doing-so-give-me-the-option
viewpoint and the some-dumbass-will-abuse-it viewpoint. I see the
results of both viewpoints daily.

Perhaps we need a GUC that says expert_mode = on. In expert_mode we are
allowed to do a range of things that are normally avoided - there would
be an explicit list. Managers can then take a single considered decision
as to whether the situation warrants extreme action and their DBA is
good enough to handle it. That might resolve our continued angst about
whether our users our smart enough to avoid the gotchas, or just smart
enough to win a DBA's Darwin Award. 

The UNIX philosophy has always been to allow the power to exist, yet
seek to minimise the number of people who exercise it. Another idea
might be to make such command options superuser only, to ensure the
power is available, yet only in the hands of, by-definition, the trusted
few.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> On Sat, 2008-06-07 at 13:08 -0400, Robert Treat wrote:
>   
>> On Thursday 05 June 2008 08:56:35 Simon Riggs wrote:
>>     
>>> On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:
>>>       
>
>   
>> Heh, I would have argued that the idea should go the other way and
>> just make this part of the normal syntax.  Oracle DBA's have been
>> doing this for years (MS SQL supports it too actually) and it really
>> helps working around having to hold locks on large relations for
>> lengthy periods of times. Heck, I'd like to see a no check option for
>> all constraints really. 
>>     
>
> Interesting that SQL Server does it also.
>
> Holding the lock for a long period is just one more problem. :-)
>
> I'm always torn between the I-know-what-Im-doing-so-give-me-the-option
> viewpoint and the some-dumbass-will-abuse-it viewpoint. I see the
> results of both viewpoints daily.
>
> Perhaps we need a GUC that says expert_mode = on. In expert_mode we are
> allowed to do a range of things that are normally avoided - there would
> be an explicit list. Managers can then take a single considered decision
> as to whether the situation warrants extreme action and their DBA is
> good enough to handle it. That might resolve our continued angst about
> whether our users our smart enough to avoid the gotchas, or just smart
> enough to win a DBA's Darwin Award. 
>
> The UNIX philosophy has always been to allow the power to exist, yet
> seek to minimise the number of people who exercise it. Another idea
> might be to make such command options superuser only, to ensure the
> power is available, yet only in the hands of, by-definition, the trusted
> few.
>
>   

If we go down this road then I would far rather we tried to devise some 
safe (or semi-safe) way of doing it instead of simply providing expert 
(a.k.a. footgun) mode.

For instance, I'm wondering if we could do something with checksums of 
the input lines or something else that would make this difficult to do 
in circumstances other than pg_restore.

cheers

andrew


Re: pg_dump restore time and Foreign Keys

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Perhaps we need a GUC that says expert_mode = on. ...  Another idea
> might be to make such command options superuser only, to ensure the
> power is available, yet only in the hands of, by-definition, the trusted
> few.

This all seems pretty useless, as the sort of user most likely to shoot
himself in the foot will also always be running as superuser.

I'd much rather see us expend more effort on speeding up the checks
than open holes in the system.
        regards, tom lane


Re: pg_dump restore time and Foreign Keys

From
Robert Treat
Date:
On Saturday 07 June 2008 16:22:56 Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Perhaps we need a GUC that says expert_mode = on. ...  Another idea
> > might be to make such command options superuser only, to ensure the
> > power is available, yet only in the hands of, by-definition, the trusted
> > few.
>
> This all seems pretty useless, as the sort of user most likely to shoot
> himself in the foot will also always be running as superuser.
>

yeah, i'm not a big fan of "set enable_footgun=true" since the people likely 
to get tripped up are going to blindly enable these modes. 

otoh, if we do such a thing, i would be a big fan of calling 
it "enable_footgun" :-)

> I'd much rather see us expend more effort on speeding up the checks
> than open holes in the system.
>

and i'm sure no one is against that idea, but you're never going to be able to 
match the performance of just avoiding the check. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: pg_dump restore time and Foreign Keys

From
Tom Lane
Date:
Robert Treat <xzilla@users.sourceforge.net> writes:
> and i'm sure no one is against that idea, but you're never going to be able to 
> match the performance of just avoiding the check. 

We'll never be able to match the performance of not having transactions,
either, but the community has never for a moment considered having a
"no transactions" mode.
        regards, tom lane


Re: pg_dump restore time and Foreign Keys

From
Robert Treat
Date:
On Sunday 08 June 2008 20:12:15 Tom Lane wrote:
> Robert Treat <xzilla@users.sourceforge.net> writes:
> > and i'm sure no one is against that idea, but you're never going to be
> > able to match the performance of just avoiding the check.
>
> We'll never be able to match the performance of not having transactions,
> either, but the community has never for a moment considered having a
> "no transactions" mode.
>

it's unclear what a "no transaction" mode would mean, but I'd be willing to 
guess some people have consider aspects of it (we've just never had 
agreement)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: pg_dump restore time and Foreign Keys

From
Decibel!
Date:
On Jun 7, 2008, at 2:00 PM, Andrew Dunstan wrote:
> If we go down this road then I would far rather we tried to devise  
> some safe (or semi-safe) way of doing it instead of simply  
> providing expert (a.k.a. footgun) mode.
>
> For instance, I'm wondering if we could do something with checksums  
> of the input lines or something else that would make this difficult  
> to do in circumstances other than pg_restore.


Yes, but that provides no help at all outside of pg_dump. Being able  
to add a FK with NO CHECK would be tremendously useful outside of  
pg_dump. Actually, in the interest of stating the problem and not the  
solution, what we need is a way to add FKs that doesn't lock  
everything up to perform the key checks. Perhaps there is some semi- 
safe way that the constraint could be added and the checks done in  
the background...

As for the footgun aspect, are we the enterprise-class OSS database  
or the one that caters itself to noobs that will go out of their way  
to make life hard on themselves? I'm all in favor of not adding  
footguns that don't have value, but this one holds a lot of value for  
anyone trying to maintain a large database in a 24/7 environment. To  
put this in perspective, the amount of revenue we would loose from  
adding just one FK to one of our larger tables would more than cover  
paying someone to develop this feature.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: pg_dump restore time and Foreign Keys

From
Andrew Dunstan
Date:

Decibel! wrote:
>
>
> Yes, but that provides no help at all outside of pg_dump. Being able 
> to add a FK with NO CHECK would be tremendously useful outside of 
> pg_dump. Actually, in the interest of stating the problem and not the 
> solution, what we need is a way to add FKs that doesn't lock 
> everything up to perform the key checks. Perhaps there is some 
> semi-safe way that the constraint could be added and the checks done 
> in the background...

I had some thoughts along the same lines.

But how do you propose to recover when the check fails? What should 
pg_restore do if the dump is corrupt causing an FK check to fail?

I suppose we could have some sort of marking for FK constraints along 
the lines of {checked, unchecked, invalid}.

>
> As for the footgun aspect, are we the enterprise-class OSS database or 
> the one that caters itself to noobs that will go out of their way to 
> make life hard on themselves? 

We are the database that tries very hard to keep its promises. If you 
want to change or relax those promises then the implications need to be 
very very clear.

> I'm all in favor of not adding footguns that don't have value, but 
> this one holds a lot of value for anyone trying to maintain a large 
> database in a 24/7 environment. To put this in perspective, the amount 
> of revenue we would loose from adding just one FK to one of our larger 
> tables would more than cover paying someone to develop this feature.
>


Come up with a good proposal and I'm your man :-)  I haven't seen one yet.

cheers

andrew


Re: pg_dump restore time and Foreign Keys

From
Tom Lane
Date:
Decibel! <decibel@decibel.org> writes:
> Actually, in the interest of stating the problem and not the  
> solution, what we need is a way to add FKs that doesn't lock  
> everything up to perform the key checks.

Ah, finally a useful comment.  I think it might be possible to do an
"add FK concurrently" type of command that would take exclusive lock
for just long enough to add the triggers, then scan the tables with just
AccessShareLock to see if the existing rows meet the constraint, and
if so finally mark the constraint "valid".  Meanwhile the constraint
would be enforced against newly-added rows by the triggers, so nothing
gets missed.  You'd still get a small hiccup in system performance
from the transient exclusive lock, but nothing like as bad as it is
now.  Would that solve your problem?
        regards, tom lane


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
> Decibel! <decibel@decibel.org> writes:
> > Actually, in the interest of stating the problem and not the  
> > solution, what we need is a way to add FKs that doesn't lock  
> > everything up to perform the key checks.
> 
> Ah, finally a useful comment.  I think it might be possible to do an
> "add FK concurrently" type of command that would take exclusive lock
> for just long enough to add the triggers, then scan the tables with just
> AccessShareLock to see if the existing rows meet the constraint, and
> if so finally mark the constraint "valid".  Meanwhile the constraint
> would be enforced against newly-added rows by the triggers, so nothing
> gets missed.  You'd still get a small hiccup in system performance
> from the transient exclusive lock, but nothing like as bad as it is
> now.  Would that solve your problem?

That's good, but it doesn't solve the original user complaint about
needing to re-run many, many large queries to which we already know the
answer.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
>   
>> Decibel! <decibel@decibel.org> writes:
>>     
>>> Actually, in the interest of stating the problem and not the  
>>> solution, what we need is a way to add FKs that doesn't lock  
>>> everything up to perform the key checks.
>>>       
>> Ah, finally a useful comment.  I think it might be possible to do an
>> "add FK concurrently" type of command that would take exclusive lock
>> for just long enough to add the triggers, then scan the tables with just
>> AccessShareLock to see if the existing rows meet the constraint, and
>> if so finally mark the constraint "valid".  Meanwhile the constraint
>> would be enforced against newly-added rows by the triggers, so nothing
>> gets missed.  You'd still get a small hiccup in system performance
>> from the transient exclusive lock, but nothing like as bad as it is
>> now.  Would that solve your problem?
>>     
>
> That's good, but it doesn't solve the original user complaint about
> needing to re-run many, many large queries to which we already know the
> answer.
>
>   

But we don't know it for dead sure, we only think we do. What if the 
data for one or other of the tables is corrupted? We'll end up with data 
we believe is consistent but in fact is not, ISTM. If you can somehow 
guarantee the integrity of data in both tables then we might be 
justified in assuming that the FK constraint will be consistent - that's 
why I suggested some sort of checksum mechanism might serve the purpose.

cheers

andrew


Re: pg_dump restore time and Foreign Keys

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
>> Ah, finally a useful comment.  I think it might be possible to do an
>> "add FK concurrently" type of command that would take exclusive lock

> That's good, but it doesn't solve the original user complaint about
> needing to re-run many, many large queries to which we already know the
> answer.

No, we are running a large query to which the user *thinks* he knows the
answer.  There are any number of reasons why he might be wrong.
        regards, tom lane


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Mon, 2008-06-09 at 11:23 -0400, Andrew Dunstan wrote:
> 
> Simon Riggs wrote:
> > On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
> >   
> >> Decibel! <decibel@decibel.org> writes:
> >>     
> >>> Actually, in the interest of stating the problem and not the  
> >>> solution, what we need is a way to add FKs that doesn't lock  
> >>> everything up to perform the key checks.
> >>>       
> >> Ah, finally a useful comment.  I think it might be possible to do an
> >> "add FK concurrently" type of command that would take exclusive lock
> >> for just long enough to add the triggers, then scan the tables with just
> >> AccessShareLock to see if the existing rows meet the constraint, and
> >> if so finally mark the constraint "valid".  Meanwhile the constraint
> >> would be enforced against newly-added rows by the triggers, so nothing
> >> gets missed.  You'd still get a small hiccup in system performance
> >> from the transient exclusive lock, but nothing like as bad as it is
> >> now.  Would that solve your problem?
> >>     
> >
> > That's good, but it doesn't solve the original user complaint about
> > needing to re-run many, many large queries to which we already know the
> > answer.
> >
> But we don't know it for dead sure, we only think we do. What if the 
> data for one or other of the tables is corrupted? We'll end up with data 
> we believe is consistent but in fact is not, ISTM. If you can somehow 
> guarantee the integrity of data in both tables then we might be 
> justified in assuming that the FK constraint will be consistent - that's 
> why I suggested some sort of checksum mechanism might serve the purpose.

Agreed.

Can we get COPY to output the checksum of its output as part of the
command tag? How else can we return the checksum? In $file.cksum for any
given output file?

We can then use an explicit checksum option in the COPY when we reload,
with CHECKSUM option.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
> >> Ah, finally a useful comment.  I think it might be possible to do an
> >> "add FK concurrently" type of command that would take exclusive lock
> 
> > That's good, but it doesn't solve the original user complaint about
> > needing to re-run many, many large queries to which we already know the
> > answer.
> 
> No, we are running a large query to which the user *thinks* he knows the
> answer.  There are any number of reasons why he might be wrong.

Of course. I should have said "to which we already know the answer" to
indicate I'm passing on others' criticisms of us.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote:
>> No, we are running a large query to which the user *thinks* he knows the
>> answer.  There are any number of reasons why he might be wrong.

> Of course. I should have said "to which we already know the answer" to
> indicate I'm passing on others' criticisms of us.

[ shrug... ]  We don't know the answer either, and anyone who says
we do is merely betraying his ignorance of the number of ways to load
a foot-gun.

I don't have any confidence in the "checksum" proposal either, as it's
still naively assuming that changes in the data are the only possible
problem.  Consider that you are loading the data into a new database,
which might be running under a different locale setting, might contain
a new implementation of a datatype with subtly (or not so subtly)
different semantics, or might just already contain data in the target
tables.  pg_dump scripts are not nearly bright enough to defend against
these types of threats.
        regards, tom lane


Re: pg_dump restore time and Foreign Keys

From
"Filip Rembiałkowski"
Date:


2008/6/9 Simon Riggs <simon@2ndquadrant.com>:

On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
> Decibel! <decibel@decibel.org> writes:
> > Actually, in the interest of stating the problem and not the
> > solution, what we need is a way to add FKs that doesn't lock
> > everything up to perform the key checks.
>
> Ah, finally a useful comment.  I think it might be possible to do an
> "add FK concurrently" type of command that would take exclusive lock
> for just long enough to add the triggers, then scan the tables with just
> AccessShareLock to see if the existing rows meet the constraint, and
> if so finally mark the constraint "valid".  Meanwhile the constraint
> would be enforced against newly-added rows by the triggers, so nothing
> gets missed.  You'd still get a small hiccup in system performance
> from the transient exclusive lock, but nothing like as bad as it is
> now.  Would that solve your problem?

That's good, but it doesn't solve the original user complaint about
needing to re-run many, many large queries to which we already know the
answer.

just a guess, but maybe "create FK concurrently" feature combined with "synchronized scan" feature _does_ resolve original problem.

if you run many "create FK concurrently" one after another, wouldn't the seq scan be reused?












--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Filip Rembiałkowski

Re: pg_dump restore time and Foreign Keys

From
Andrew Dunstan
Date:

Simon Riggs wrote:
>> But we don't know it for dead sure, we only think we do. What if the 
>> data for one or other of the tables is corrupted? We'll end up with data 
>> we believe is consistent but in fact is not, ISTM. If you can somehow 
>> guarantee the integrity of data in both tables then we might be 
>> justified in assuming that the FK constraint will be consistent - that's 
>> why I suggested some sort of checksum mechanism might serve the purpose.
>>     
>
> Agreed.
>
> Can we get COPY to output the checksum of its output as part of the
> command tag? How else can we return the checksum? In $file.cksum for any
> given output file?
>   

It seems a reasonable idea to use the command tag, unless that's going 
to break lots of stuff. I think the only thing we can usefully checksum 
is the output lines in the client encoding.

> We can then use an explicit checksum option in the COPY when we reload,
> with CHECKSUM option.
>
>   

We need rather more than this to make sure your facility isn't abused. 
That's the part that I haven't been able to think of a good answer for 
(yet).

cheers

andrew


Re: pg_dump restore time and Foreign Keys

From
Robert Treat
Date:
On Monday 09 June 2008 11:59:27 Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote:
> >> No, we are running a large query to which the user *thinks* he knows the
> >> answer.  There are any number of reasons why he might be wrong.
> >
> > Of course. I should have said "to which we already know the answer" to
> > indicate I'm passing on others' criticisms of us.
>
> [ shrug... ]  We don't know the answer either, and anyone who says
> we do is merely betraying his ignorance of the number of ways to load
> a foot-gun.
>

I think the more realistic scenario (based on the FK idea) is that you want to 
prevent any future rows from coming without validating the FK, and you're 
willing to clean up any violators after the fact, since you can make that 
an "out of the critical path" operation.

if you extend this to a more general "create constraint concurrently" (to 
handle normal constraint, not null constraints, etc...), it would certainly 
be a big win, and i think most would see it as a reasonable compromise. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Mon, 2008-06-09 at 12:37 -0400, Robert Treat wrote:
> On Monday 09 June 2008 11:59:27 Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote:
> > >> No, we are running a large query to which the user *thinks* he knows the
> > >> answer.  There are any number of reasons why he might be wrong.
> > >
> > > Of course. I should have said "to which we already know the answer" to
> > > indicate I'm passing on others' criticisms of us.
> >
> > [ shrug... ]  We don't know the answer either, and anyone who says
> > we do is merely betraying his ignorance of the number of ways to load
> > a foot-gun.
> >
> 
> I think the more realistic scenario (based on the FK idea) is that you want to 
> prevent any future rows from coming without validating the FK, and you're 
> willing to clean up any violators after the fact, since you can make that 
> an "out of the critical path" operation.
> 
> if you extend this to a more general "create constraint concurrently" (to 
> handle normal constraint, not null constraints, etc...), it would certainly 
> be a big win, and i think most would see it as a reasonable compromise. 

Agreed. I think the "out of the critical path" action is more likely to
be the intended path rather than the "never check at all" route.

If we break down the action into two parts.

ALTER TABLE ... ADD CONSTRAINT foo FOREIGN KEY ... NOVALIDATE;
which holds exclusive lock, but only momentarily
After this runs any new data is validated at moment of data change, but
the older data has yet to be validated.

ALTER TABLE ... VALIDATE CONSTRAINT foo
which runs lengthy check, though only grabs lock as last part of action

This way we have the ability to add them concurrently if we choose by
running one after the other, or we can run first part only for now and
run the other one at a more convenient moment.

On a full set of checks on a large complex database can easily take
hours or even days.

We should allow this. It's not a footgun, its an honest attempt by
people to add RI checks to their database. The only other alternative
for some people is to not add FKs at all, which is also a footgun, but
we don't seem bothered that they might take that option.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> If we break down the action into two parts.
> 
> ALTER TABLE ... ADD CONSTRAINT foo FOREIGN KEY ... NOVALIDATE;
> which holds exclusive lock, but only momentarily
> After this runs any new data is validated at moment of data change, but
> the older data has yet to be validated.
> 
> ALTER TABLE ... VALIDATE CONSTRAINT foo
> which runs lengthy check, though only grabs lock as last part of action

The problem I see with this approach in general (two-phase FK creation)
is that you have to keep the same transaction for the first and second
command, but you really want concurrent backends to see the tuple for
the not-yet-validated constraint row.

Another benefit that could arise from this is that the hypothetical
VALIDATE CONSTRAINT step could validate more than one constraint at a
time, possibly processing all the constraints with a single table scan.

Perhaps VALIDATE CONSTRAINT could be handled as an automatic commit-time
action.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: pg_dump restore time and Foreign Keys

From
Gregory Stark
Date:
"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> The problem I see with this approach in general (two-phase FK creation)
> is that you have to keep the same transaction for the first and second
> command, but you really want concurrent backends to see the tuple for
> the not-yet-validated constraint row.

Do you? It seems like having a constraint which is enforced on any new
operations but which doesn't guarantee that existing records satisfy it is a
useful feature in itself -- separating the two concepts "this property is true
for all records" and "any action taken must leave the record with this
property"

ISTM you can validate an "invalid" constraint using any snapshot taken at any
time >= the original snapshot. As long as the constraint is being enforced for
all transactions which start after the validating snapshot's xmin then when
it's done it can know the constraint is valid.

Taking a lock on the table to create the constraint certainly leaves that
property fulfilled. Actually it seems we could not take any lock and just
check when it comes time to do the validation that the snapshot's xmin is >=
the xmin on the constraint. I'm starting to get leery of all these tightly
argued bits of logic though. Each one on its own is safe but the resulting
system is getting to be quite complex.

> Another benefit that could arise from this is that the hypothetical
> VALIDATE CONSTRAINT step could validate more than one constraint at a
> time, possibly processing all the constraints with a single table scan.

Interesting.

> Perhaps VALIDATE CONSTRAINT could be handled as an automatic commit-time
> action.

I don't really like this, at least not as the only option, because as I said
above and Robert Treat also said, it could be useful to have the constraint in
place for new operations but check it for the existing data at some later
date. (Or even never)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: pg_dump restore time and Foreign Keys

From
Simon Riggs
Date:
On Mon, 2008-06-09 at 14:07 -0400, Alvaro Herrera wrote:
> Simon Riggs wrote:
> 
> > If we break down the action into two parts.
> > 
> > ALTER TABLE ... ADD CONSTRAINT foo FOREIGN KEY ... NOVALIDATE;
> > which holds exclusive lock, but only momentarily
> > After this runs any new data is validated at moment of data change, but
> > the older data has yet to be validated.
> > 
> > ALTER TABLE ... VALIDATE CONSTRAINT foo
> > which runs lengthy check, though only grabs lock as last part of action
> 
> The problem I see with this approach in general (two-phase FK creation)
> is that you have to keep the same transaction for the first and second
> command, but you really want concurrent backends to see the tuple for
> the not-yet-validated constraint row.

Well, they *must* be in separate transactions if we are to avoid holding
an AccessExclusiveLock while we perform the check. Plus the whole idea
is to perform the second part at some other non-critical time, though we
all agree that never performing the check at all is foolhardy.

Maybe we say that you can defer the check, but after a while autovacuum
runs it for you if you haven't done so. It would certainly be useful to
run the VALIDATE part as a background task with vacuum wait enabled.

> Another benefit that could arise from this is that the hypothetical
> VALIDATE CONSTRAINT step could validate more than one constraint at a
> time, possibly processing all the constraints with a single table scan.

Good thought, though not as useful for FK checks.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: pg_dump restore time and Foreign Keys

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> Maybe we say that you can defer the check, but after a while autovacuum
> runs it for you if you haven't done so. It would certainly be useful to
> run the VALIDATE part as a background task with vacuum wait enabled.

It would be useful if there was anywhere to report the error to, or an
action that could be taken automatically.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support