Thread: pg_dump restore time and Foreign Keys
pg_dump restore times can be high when they include many ALTER TABLE ADD FORIEGN KEY statements, since each statement checks the data to see if it is fully valid in all cases. I've been asked "why we run that at all?", since if we dumped the tables together, we already know they match. If we had a way of pg_dump passing on the information that the test already passes, we would be able to skip the checks. Proposal: * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK]; When we run WITHOUT CHECK, iff both the source and target table are newly created in this transaction, then we skip the check. If the check is skipped we mark the constraint as being unchecked, so we can tell later if this has been used. * Have pg_dump write the new syntax into its dumps, when both the source and target table are dumped in same run I'm guessing that the WITHOUT CHECK option would not be acceptable as an unprotected trap for our lazy and wicked users. :-) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I'm guessing that the WITHOUT CHECK option would not be acceptable as an > unprotected trap for our lazy and wicked users. :-) Yes, that sounds scary. Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD FOREIGN KEY. Or speeding up COPY into a table with foreign keys already defined. For example, you might want to build an in-memory hash table of the keys in the target table, instead of issuing a query on each INSERT, if the target table isn't huge. Nothing beats the speed of simply not checking the constraint, of course, but I'd hate to lose the protection it gives. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I'm guessing that the WITHOUT CHECK option would not be acceptable as an > > unprotected trap for our lazy and wicked users. :-) > > Yes, that sounds scary. > > Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD > FOREIGN KEY. I managed a suggestion for improving it for integers only, but if anybody has any other ideas, I'm all ears. > Or speeding up COPY into a table with foreign keys already > defined. For example, you might want to build an in-memory hash table of > the keys in the target table, instead of issuing a query on each INSERT, > if the target table isn't huge. No, that's not the problem, but I agree that is a problem also. > Nothing beats the speed of simply not checking the constraint, of > course, but I'd hate to lose the protection it gives. Are you saying you don't like the rest of the proposal, or just don't like the idea of having that added as an unprotected option, but find the proposal acceptable? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > > If we had a way of pg_dump passing on the information that the test > already passes, we would be able to skip the checks. > > Proposal: > > * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK]; > * Have pg_dump write the new syntax into its dumps, when both the source > and target table are dumped in same I've been known to manually tweak dumps before now. I can see me forgetting this. What about pg_dump writing out a row-count and MD5 of the rows in the COPY (just a textual calculation). Iff the restore checksum matches the dump checksum for both tables then the foreign-keys can be skipped. If the restore checksum doesn't match the dump then it can issue a warning, but continue and run the full fkey check. -- Richard Huxton Archonet Ltd
Simon Riggs wrote: > pg_dump restore times can be high when they include many ALTER TABLE ADD > FORIEGN KEY statements, since each statement checks the data to see if > it is fully valid in all cases. > > I've been asked "why we run that at all?", since if we dumped the tables > together, we already know they match. > > If we had a way of pg_dump passing on the information that the test > already passes, we would be able to skip the checks. > > Proposal: > > * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK]; > When we run WITHOUT CHECK, iff both the source and target table are > newly created in this transaction, then we skip the check. If the check > is skipped we mark the constraint as being unchecked, so we can tell > later if this has been used. > > * Have pg_dump write the new syntax into its dumps, when both the source > and target table are dumped in same run > > I'm guessing that the WITHOUT CHECK option would not be acceptable as an > unprotected trap for our lazy and wicked users. :-) > This whole proposal would be a major footgun which would definitely be abused, IMNSHO. I think Heikki's idea of speeding up the check using a hash table of the foreign keys possibly has merit. cheers andrew
Simon Riggs wrote: > Are you saying you don't like the rest of the proposal, or just don't > like the idea of having that added as an unprotected option, but find > the proposal acceptable? I don't like the idea of having an unprotected option. If we were going to have one, I wouldn't bother with the extra checks you proposed; it's going to be unsafe anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote: > > Simon Riggs wrote: > > pg_dump restore times can be high when they include many ALTER TABLE ADD > > FORIEGN KEY statements, since each statement checks the data to see if > > it is fully valid in all cases. > > > > I've been asked "why we run that at all?", since if we dumped the tables > > together, we already know they match. > > > > If we had a way of pg_dump passing on the information that the test > > already passes, we would be able to skip the checks. > > > > Proposal: > > > > * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK]; > > When we run WITHOUT CHECK, iff both the source and target table are > > newly created in this transaction, then we skip the check. If the check > > is skipped we mark the constraint as being unchecked, so we can tell > > later if this has been used. > > > > * Have pg_dump write the new syntax into its dumps, when both the source > > and target table are dumped in same run > > > > I'm guessing that the WITHOUT CHECK option would not be acceptable as an > > unprotected trap for our lazy and wicked users. :-) > > > > This whole proposal would be a major footgun which would definitely be > abused, IMNSHO. OK, understood. Two negatives is enough to sink it. > I think Heikki's idea of speeding up the check using a hash table of the > foreign keys possibly has merit. The query is sent through SPI, so if there was a way to speed this up, we would already be using it implicitly. If we find a way to speed up joins it will improve the FK check also. The typical join plan for the check query is already a hash join, assuming the target table is small enough. If not, its a huge sort/merge join. So in a way, we already follow the suggestion. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> I'm guessing that the WITHOUT CHECK option would not be acceptable as an >>> unprotected trap for our lazy and wicked users. :-) >> Yes, that sounds scary. >> >> Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD >> FOREIGN KEY. > > I managed a suggestion for improving it for integers only, but if > anybody has any other ideas, I'm all ears. Well, one idea would be to allow adding multiple foreign keys in one command, and checking them all at once with one SQL query instead of one per foreign key. Right now we need one seq scan over the table per foreign key, by checking all references at once we would only need one seq scan to check them all. >> Or speeding up COPY into a table with foreign keys already >> defined. For example, you might want to build an in-memory hash table of >> the keys in the target table, instead of issuing a query on each INSERT, >> if the target table isn't huge. > > No, that's not the problem, but I agree that is a problem also. It is related, because if we can make COPY into a table with foreign keys fast enough, we could rearrange dumps so that foreign keys are created before loading data. That would save the seqscan over the table altogether. Thinking about this idea a bit more, instead of loading the whole target table into memory, it would probably make more sense to keep a hash table as just a cache of the most recent keys that have been referenced. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-06-05 at 16:01 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> I'm guessing that the WITHOUT CHECK option would not be acceptable as an > >>> unprotected trap for our lazy and wicked users. :-) > >> Yes, that sounds scary. > >> > >> Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD > >> FOREIGN KEY. > > > > I managed a suggestion for improving it for integers only, but if > > anybody has any other ideas, I'm all ears. > > Well, one idea would be to allow adding multiple foreign keys in one > command, and checking them all at once with one SQL query instead of one > per foreign key. Right now we need one seq scan over the table per > foreign key, by checking all references at once we would only need one > seq scan to check them all. No need. Just parallelise the restore with concurrent psql. Which would speed up the index creation also. Does Greg have plans for further work? > >> Or speeding up COPY into a table with foreign keys already > >> defined. For example, you might want to build an in-memory hash table of > >> the keys in the target table, instead of issuing a query on each INSERT, > >> if the target table isn't huge. > > > > No, that's not the problem, but I agree that is a problem also. > > It is related, because if we can make COPY into a table with foreign > keys fast enough, we could rearrange dumps so that foreign keys are > created before loading data. That would save the seqscan over the table > altogether. True. > Thinking about this idea a bit more, instead of loading the whole target > table into memory, it would probably make more sense to keep a hash > table as just a cache of the most recent keys that have been referenced. If you can think of a way of improving hash joins generally, then it will work for this specific case also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-06-05 at 16:01 +0300, Heikki Linnakangas wrote: >> Well, one idea would be to allow adding multiple foreign keys in one >> command, and checking them all at once with one SQL query instead of one >> per foreign key. Right now we need one seq scan over the table per >> foreign key, by checking all references at once we would only need one >> seq scan to check them all. > > No need. Just parallelise the restore with concurrent psql. Which would > speed up the index creation also. True, you could do that. > Does Greg have plans for further work? I believe he's busy with other stuff at the moment. >> Thinking about this idea a bit more, instead of loading the whole target >> table into memory, it would probably make more sense to keep a hash >> table as just a cache of the most recent keys that have been referenced. > > If you can think of a way of improving hash joins generally, then it > will work for this specific case also. Individual RI checks performed on inserts/COPY don't do a hash join. The bulk check done by ALTER TABLE ADD FOREIGN KEY does, but that's different issue. This hash table would be a specific trick to speed up RI checks. If you're anyway I/O bound, it wouldn't help, and you'd already be better off creating the foreign key first and loading the data after that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thursday 05 June 2008 08:56:35 Simon Riggs wrote: > On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote: > > Simon Riggs wrote: > > > pg_dump restore times can be high when they include many ALTER TABLE > > > ADD FORIEGN KEY statements, since each statement checks the data to see > > > if it is fully valid in all cases. > > > > > > I've been asked "why we run that at all?", since if we dumped the > > > tables together, we already know they match. > > > > > > If we had a way of pg_dump passing on the information that the test > > > already passes, we would be able to skip the checks. > > > > > > Proposal: > > > > > > * Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK]; > > > When we run WITHOUT CHECK, iff both the source and target table are > > > newly created in this transaction, then we skip the check. If the check > > > is skipped we mark the constraint as being unchecked, so we can tell > > > later if this has been used. > > > > > > * Have pg_dump write the new syntax into its dumps, when both the > > > source and target table are dumped in same run > > > > > > I'm guessing that the WITHOUT CHECK option would not be acceptable as > > > an unprotected trap for our lazy and wicked users. :-) > > > > This whole proposal would be a major footgun which would definitely be > > abused, IMNSHO. > > OK, understood. Two negatives is enough to sink it. > Heh, I would have argued that the idea should go the other way and just make this part of the normal syntax. Oracle DBA's have been doing this for years (MS SQL supports it too actually) and it really helps working around having to hold locks on large relations for lengthy periods of times. Heck, I'd like to see a no check option for all constraints really. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Sat, 2008-06-07 at 13:08 -0400, Robert Treat wrote: > On Thursday 05 June 2008 08:56:35 Simon Riggs wrote: > > On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote: > Heh, I would have argued that the idea should go the other way and > just make this part of the normal syntax. Oracle DBA's have been > doing this for years (MS SQL supports it too actually) and it really > helps working around having to hold locks on large relations for > lengthy periods of times. Heck, I'd like to see a no check option for > all constraints really. Interesting that SQL Server does it also. Holding the lock for a long period is just one more problem. :-) I'm always torn between the I-know-what-Im-doing-so-give-me-the-option viewpoint and the some-dumbass-will-abuse-it viewpoint. I see the results of both viewpoints daily. Perhaps we need a GUC that says expert_mode = on. In expert_mode we are allowed to do a range of things that are normally avoided - there would be an explicit list. Managers can then take a single considered decision as to whether the situation warrants extreme action and their DBA is good enough to handle it. That might resolve our continued angst about whether our users our smart enough to avoid the gotchas, or just smart enough to win a DBA's Darwin Award. The UNIX philosophy has always been to allow the power to exist, yet seek to minimise the number of people who exercise it. Another idea might be to make such command options superuser only, to ensure the power is available, yet only in the hands of, by-definition, the trusted few. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Sat, 2008-06-07 at 13:08 -0400, Robert Treat wrote: > >> On Thursday 05 June 2008 08:56:35 Simon Riggs wrote: >> >>> On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote: >>> > > >> Heh, I would have argued that the idea should go the other way and >> just make this part of the normal syntax. Oracle DBA's have been >> doing this for years (MS SQL supports it too actually) and it really >> helps working around having to hold locks on large relations for >> lengthy periods of times. Heck, I'd like to see a no check option for >> all constraints really. >> > > Interesting that SQL Server does it also. > > Holding the lock for a long period is just one more problem. :-) > > I'm always torn between the I-know-what-Im-doing-so-give-me-the-option > viewpoint and the some-dumbass-will-abuse-it viewpoint. I see the > results of both viewpoints daily. > > Perhaps we need a GUC that says expert_mode = on. In expert_mode we are > allowed to do a range of things that are normally avoided - there would > be an explicit list. Managers can then take a single considered decision > as to whether the situation warrants extreme action and their DBA is > good enough to handle it. That might resolve our continued angst about > whether our users our smart enough to avoid the gotchas, or just smart > enough to win a DBA's Darwin Award. > > The UNIX philosophy has always been to allow the power to exist, yet > seek to minimise the number of people who exercise it. Another idea > might be to make such command options superuser only, to ensure the > power is available, yet only in the hands of, by-definition, the trusted > few. > > If we go down this road then I would far rather we tried to devise some safe (or semi-safe) way of doing it instead of simply providing expert (a.k.a. footgun) mode. For instance, I'm wondering if we could do something with checksums of the input lines or something else that would make this difficult to do in circumstances other than pg_restore. cheers andrew
Simon Riggs <simon@2ndquadrant.com> writes: > Perhaps we need a GUC that says expert_mode = on. ... Another idea > might be to make such command options superuser only, to ensure the > power is available, yet only in the hands of, by-definition, the trusted > few. This all seems pretty useless, as the sort of user most likely to shoot himself in the foot will also always be running as superuser. I'd much rather see us expend more effort on speeding up the checks than open holes in the system. regards, tom lane
On Saturday 07 June 2008 16:22:56 Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Perhaps we need a GUC that says expert_mode = on. ... Another idea > > might be to make such command options superuser only, to ensure the > > power is available, yet only in the hands of, by-definition, the trusted > > few. > > This all seems pretty useless, as the sort of user most likely to shoot > himself in the foot will also always be running as superuser. > yeah, i'm not a big fan of "set enable_footgun=true" since the people likely to get tripped up are going to blindly enable these modes. otoh, if we do such a thing, i would be a big fan of calling it "enable_footgun" :-) > I'd much rather see us expend more effort on speeding up the checks > than open holes in the system. > and i'm sure no one is against that idea, but you're never going to be able to match the performance of just avoiding the check. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Robert Treat <xzilla@users.sourceforge.net> writes: > and i'm sure no one is against that idea, but you're never going to be able to > match the performance of just avoiding the check. We'll never be able to match the performance of not having transactions, either, but the community has never for a moment considered having a "no transactions" mode. regards, tom lane
On Sunday 08 June 2008 20:12:15 Tom Lane wrote: > Robert Treat <xzilla@users.sourceforge.net> writes: > > and i'm sure no one is against that idea, but you're never going to be > > able to match the performance of just avoiding the check. > > We'll never be able to match the performance of not having transactions, > either, but the community has never for a moment considered having a > "no transactions" mode. > it's unclear what a "no transaction" mode would mean, but I'd be willing to guess some people have consider aspects of it (we've just never had agreement) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Jun 7, 2008, at 2:00 PM, Andrew Dunstan wrote: > If we go down this road then I would far rather we tried to devise > some safe (or semi-safe) way of doing it instead of simply > providing expert (a.k.a. footgun) mode. > > For instance, I'm wondering if we could do something with checksums > of the input lines or something else that would make this difficult > to do in circumstances other than pg_restore. Yes, but that provides no help at all outside of pg_dump. Being able to add a FK with NO CHECK would be tremendously useful outside of pg_dump. Actually, in the interest of stating the problem and not the solution, what we need is a way to add FKs that doesn't lock everything up to perform the key checks. Perhaps there is some semi- safe way that the constraint could be added and the checks done in the background... As for the footgun aspect, are we the enterprise-class OSS database or the one that caters itself to noobs that will go out of their way to make life hard on themselves? I'm all in favor of not adding footguns that don't have value, but this one holds a lot of value for anyone trying to maintain a large database in a 24/7 environment. To put this in perspective, the amount of revenue we would loose from adding just one FK to one of our larger tables would more than cover paying someone to develop this feature. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Decibel! wrote: > > > Yes, but that provides no help at all outside of pg_dump. Being able > to add a FK with NO CHECK would be tremendously useful outside of > pg_dump. Actually, in the interest of stating the problem and not the > solution, what we need is a way to add FKs that doesn't lock > everything up to perform the key checks. Perhaps there is some > semi-safe way that the constraint could be added and the checks done > in the background... I had some thoughts along the same lines. But how do you propose to recover when the check fails? What should pg_restore do if the dump is corrupt causing an FK check to fail? I suppose we could have some sort of marking for FK constraints along the lines of {checked, unchecked, invalid}. > > As for the footgun aspect, are we the enterprise-class OSS database or > the one that caters itself to noobs that will go out of their way to > make life hard on themselves? We are the database that tries very hard to keep its promises. If you want to change or relax those promises then the implications need to be very very clear. > I'm all in favor of not adding footguns that don't have value, but > this one holds a lot of value for anyone trying to maintain a large > database in a 24/7 environment. To put this in perspective, the amount > of revenue we would loose from adding just one FK to one of our larger > tables would more than cover paying someone to develop this feature. > Come up with a good proposal and I'm your man :-) I haven't seen one yet. cheers andrew
Decibel! <decibel@decibel.org> writes: > Actually, in the interest of stating the problem and not the > solution, what we need is a way to add FKs that doesn't lock > everything up to perform the key checks. Ah, finally a useful comment. I think it might be possible to do an "add FK concurrently" type of command that would take exclusive lock for just long enough to add the triggers, then scan the tables with just AccessShareLock to see if the existing rows meet the constraint, and if so finally mark the constraint "valid". Meanwhile the constraint would be enforced against newly-added rows by the triggers, so nothing gets missed. You'd still get a small hiccup in system performance from the transient exclusive lock, but nothing like as bad as it is now. Would that solve your problem? regards, tom lane
On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote: > Decibel! <decibel@decibel.org> writes: > > Actually, in the interest of stating the problem and not the > > solution, what we need is a way to add FKs that doesn't lock > > everything up to perform the key checks. > > Ah, finally a useful comment. I think it might be possible to do an > "add FK concurrently" type of command that would take exclusive lock > for just long enough to add the triggers, then scan the tables with just > AccessShareLock to see if the existing rows meet the constraint, and > if so finally mark the constraint "valid". Meanwhile the constraint > would be enforced against newly-added rows by the triggers, so nothing > gets missed. You'd still get a small hiccup in system performance > from the transient exclusive lock, but nothing like as bad as it is > now. Would that solve your problem? That's good, but it doesn't solve the original user complaint about needing to re-run many, many large queries to which we already know the answer. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote: > >> Decibel! <decibel@decibel.org> writes: >> >>> Actually, in the interest of stating the problem and not the >>> solution, what we need is a way to add FKs that doesn't lock >>> everything up to perform the key checks. >>> >> Ah, finally a useful comment. I think it might be possible to do an >> "add FK concurrently" type of command that would take exclusive lock >> for just long enough to add the triggers, then scan the tables with just >> AccessShareLock to see if the existing rows meet the constraint, and >> if so finally mark the constraint "valid". Meanwhile the constraint >> would be enforced against newly-added rows by the triggers, so nothing >> gets missed. You'd still get a small hiccup in system performance >> from the transient exclusive lock, but nothing like as bad as it is >> now. Would that solve your problem? >> > > That's good, but it doesn't solve the original user complaint about > needing to re-run many, many large queries to which we already know the > answer. > > But we don't know it for dead sure, we only think we do. What if the data for one or other of the tables is corrupted? We'll end up with data we believe is consistent but in fact is not, ISTM. If you can somehow guarantee the integrity of data in both tables then we might be justified in assuming that the FK constraint will be consistent - that's why I suggested some sort of checksum mechanism might serve the purpose. cheers andrew
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote: >> Ah, finally a useful comment. I think it might be possible to do an >> "add FK concurrently" type of command that would take exclusive lock > That's good, but it doesn't solve the original user complaint about > needing to re-run many, many large queries to which we already know the > answer. No, we are running a large query to which the user *thinks* he knows the answer. There are any number of reasons why he might be wrong. regards, tom lane
On Mon, 2008-06-09 at 11:23 -0400, Andrew Dunstan wrote: > > Simon Riggs wrote: > > On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote: > > > >> Decibel! <decibel@decibel.org> writes: > >> > >>> Actually, in the interest of stating the problem and not the > >>> solution, what we need is a way to add FKs that doesn't lock > >>> everything up to perform the key checks. > >>> > >> Ah, finally a useful comment. I think it might be possible to do an > >> "add FK concurrently" type of command that would take exclusive lock > >> for just long enough to add the triggers, then scan the tables with just > >> AccessShareLock to see if the existing rows meet the constraint, and > >> if so finally mark the constraint "valid". Meanwhile the constraint > >> would be enforced against newly-added rows by the triggers, so nothing > >> gets missed. You'd still get a small hiccup in system performance > >> from the transient exclusive lock, but nothing like as bad as it is > >> now. Would that solve your problem? > >> > > > > That's good, but it doesn't solve the original user complaint about > > needing to re-run many, many large queries to which we already know the > > answer. > > > But we don't know it for dead sure, we only think we do. What if the > data for one or other of the tables is corrupted? We'll end up with data > we believe is consistent but in fact is not, ISTM. If you can somehow > guarantee the integrity of data in both tables then we might be > justified in assuming that the FK constraint will be consistent - that's > why I suggested some sort of checksum mechanism might serve the purpose. Agreed. Can we get COPY to output the checksum of its output as part of the command tag? How else can we return the checksum? In $file.cksum for any given output file? We can then use an explicit checksum option in the COPY when we reload, with CHECKSUM option. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote: > >> Ah, finally a useful comment. I think it might be possible to do an > >> "add FK concurrently" type of command that would take exclusive lock > > > That's good, but it doesn't solve the original user complaint about > > needing to re-run many, many large queries to which we already know the > > answer. > > No, we are running a large query to which the user *thinks* he knows the > answer. There are any number of reasons why he might be wrong. Of course. I should have said "to which we already know the answer" to indicate I'm passing on others' criticisms of us. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote: >> No, we are running a large query to which the user *thinks* he knows the >> answer. There are any number of reasons why he might be wrong. > Of course. I should have said "to which we already know the answer" to > indicate I'm passing on others' criticisms of us. [ shrug... ] We don't know the answer either, and anyone who says we do is merely betraying his ignorance of the number of ways to load a foot-gun. I don't have any confidence in the "checksum" proposal either, as it's still naively assuming that changes in the data are the only possible problem. Consider that you are loading the data into a new database, which might be running under a different locale setting, might contain a new implementation of a datatype with subtly (or not so subtly) different semantics, or might just already contain data in the target tables. pg_dump scripts are not nearly bright enough to defend against these types of threats. regards, tom lane
2008/6/9 Simon Riggs <simon@2ndquadrant.com>:
just a guess, but maybe "create FK concurrently" feature combined with "synchronized scan" feature _does_ resolve original problem.
if you run many "create FK concurrently" one after another, wouldn't the seq scan be reused?
That's good, but it doesn't solve the original user complaint about
On Mon, 2008-06-09 at 10:57 -0400, Tom Lane wrote:
> Decibel! <decibel@decibel.org> writes:
> > Actually, in the interest of stating the problem and not the
> > solution, what we need is a way to add FKs that doesn't lock
> > everything up to perform the key checks.
>
> Ah, finally a useful comment. I think it might be possible to do an
> "add FK concurrently" type of command that would take exclusive lock
> for just long enough to add the triggers, then scan the tables with just
> AccessShareLock to see if the existing rows meet the constraint, and
> if so finally mark the constraint "valid". Meanwhile the constraint
> would be enforced against newly-added rows by the triggers, so nothing
> gets missed. You'd still get a small hiccup in system performance
> from the transient exclusive lock, but nothing like as bad as it is
> now. Would that solve your problem?
needing to re-run many, many large queries to which we already know the
answer.
just a guess, but maybe "create FK concurrently" feature combined with "synchronized scan" feature _does_ resolve original problem.
if you run many "create FK concurrently" one after another, wouldn't the seq scan be reused?
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Filip Rembiałkowski
Simon Riggs wrote: >> But we don't know it for dead sure, we only think we do. What if the >> data for one or other of the tables is corrupted? We'll end up with data >> we believe is consistent but in fact is not, ISTM. If you can somehow >> guarantee the integrity of data in both tables then we might be >> justified in assuming that the FK constraint will be consistent - that's >> why I suggested some sort of checksum mechanism might serve the purpose. >> > > Agreed. > > Can we get COPY to output the checksum of its output as part of the > command tag? How else can we return the checksum? In $file.cksum for any > given output file? > It seems a reasonable idea to use the command tag, unless that's going to break lots of stuff. I think the only thing we can usefully checksum is the output lines in the client encoding. > We can then use an explicit checksum option in the COPY when we reload, > with CHECKSUM option. > > We need rather more than this to make sure your facility isn't abused. That's the part that I haven't been able to think of a good answer for (yet). cheers andrew
On Monday 09 June 2008 11:59:27 Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote: > >> No, we are running a large query to which the user *thinks* he knows the > >> answer. There are any number of reasons why he might be wrong. > > > > Of course. I should have said "to which we already know the answer" to > > indicate I'm passing on others' criticisms of us. > > [ shrug... ] We don't know the answer either, and anyone who says > we do is merely betraying his ignorance of the number of ways to load > a foot-gun. > I think the more realistic scenario (based on the FK idea) is that you want to prevent any future rows from coming without validating the FK, and you're willing to clean up any violators after the fact, since you can make that an "out of the critical path" operation. if you extend this to a more general "create constraint concurrently" (to handle normal constraint, not null constraints, etc...), it would certainly be a big win, and i think most would see it as a reasonable compromise. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Mon, 2008-06-09 at 12:37 -0400, Robert Treat wrote: > On Monday 09 June 2008 11:59:27 Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > On Mon, 2008-06-09 at 11:33 -0400, Tom Lane wrote: > > >> No, we are running a large query to which the user *thinks* he knows the > > >> answer. There are any number of reasons why he might be wrong. > > > > > > Of course. I should have said "to which we already know the answer" to > > > indicate I'm passing on others' criticisms of us. > > > > [ shrug... ] We don't know the answer either, and anyone who says > > we do is merely betraying his ignorance of the number of ways to load > > a foot-gun. > > > > I think the more realistic scenario (based on the FK idea) is that you want to > prevent any future rows from coming without validating the FK, and you're > willing to clean up any violators after the fact, since you can make that > an "out of the critical path" operation. > > if you extend this to a more general "create constraint concurrently" (to > handle normal constraint, not null constraints, etc...), it would certainly > be a big win, and i think most would see it as a reasonable compromise. Agreed. I think the "out of the critical path" action is more likely to be the intended path rather than the "never check at all" route. If we break down the action into two parts. ALTER TABLE ... ADD CONSTRAINT foo FOREIGN KEY ... NOVALIDATE; which holds exclusive lock, but only momentarily After this runs any new data is validated at moment of data change, but the older data has yet to be validated. ALTER TABLE ... VALIDATE CONSTRAINT foo which runs lengthy check, though only grabs lock as last part of action This way we have the ability to add them concurrently if we choose by running one after the other, or we can run first part only for now and run the other one at a more convenient moment. On a full set of checks on a large complex database can easily take hours or even days. We should allow this. It's not a footgun, its an honest attempt by people to add RI checks to their database. The only other alternative for some people is to not add FKs at all, which is also a footgun, but we don't seem bothered that they might take that option. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > If we break down the action into two parts. > > ALTER TABLE ... ADD CONSTRAINT foo FOREIGN KEY ... NOVALIDATE; > which holds exclusive lock, but only momentarily > After this runs any new data is validated at moment of data change, but > the older data has yet to be validated. > > ALTER TABLE ... VALIDATE CONSTRAINT foo > which runs lengthy check, though only grabs lock as last part of action The problem I see with this approach in general (two-phase FK creation) is that you have to keep the same transaction for the first and second command, but you really want concurrent backends to see the tuple for the not-yet-validated constraint row. Another benefit that could arise from this is that the hypothetical VALIDATE CONSTRAINT step could validate more than one constraint at a time, possibly processing all the constraints with a single table scan. Perhaps VALIDATE CONSTRAINT could be handled as an automatic commit-time action. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > The problem I see with this approach in general (two-phase FK creation) > is that you have to keep the same transaction for the first and second > command, but you really want concurrent backends to see the tuple for > the not-yet-validated constraint row. Do you? It seems like having a constraint which is enforced on any new operations but which doesn't guarantee that existing records satisfy it is a useful feature in itself -- separating the two concepts "this property is true for all records" and "any action taken must leave the record with this property" ISTM you can validate an "invalid" constraint using any snapshot taken at any time >= the original snapshot. As long as the constraint is being enforced for all transactions which start after the validating snapshot's xmin then when it's done it can know the constraint is valid. Taking a lock on the table to create the constraint certainly leaves that property fulfilled. Actually it seems we could not take any lock and just check when it comes time to do the validation that the snapshot's xmin is >= the xmin on the constraint. I'm starting to get leery of all these tightly argued bits of logic though. Each one on its own is safe but the resulting system is getting to be quite complex. > Another benefit that could arise from this is that the hypothetical > VALIDATE CONSTRAINT step could validate more than one constraint at a > time, possibly processing all the constraints with a single table scan. Interesting. > Perhaps VALIDATE CONSTRAINT could be handled as an automatic commit-time > action. I don't really like this, at least not as the only option, because as I said above and Robert Treat also said, it could be useful to have the constraint in place for new operations but check it for the existing data at some later date. (Or even never) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Mon, 2008-06-09 at 14:07 -0400, Alvaro Herrera wrote: > Simon Riggs wrote: > > > If we break down the action into two parts. > > > > ALTER TABLE ... ADD CONSTRAINT foo FOREIGN KEY ... NOVALIDATE; > > which holds exclusive lock, but only momentarily > > After this runs any new data is validated at moment of data change, but > > the older data has yet to be validated. > > > > ALTER TABLE ... VALIDATE CONSTRAINT foo > > which runs lengthy check, though only grabs lock as last part of action > > The problem I see with this approach in general (two-phase FK creation) > is that you have to keep the same transaction for the first and second > command, but you really want concurrent backends to see the tuple for > the not-yet-validated constraint row. Well, they *must* be in separate transactions if we are to avoid holding an AccessExclusiveLock while we perform the check. Plus the whole idea is to perform the second part at some other non-critical time, though we all agree that never performing the check at all is foolhardy. Maybe we say that you can defer the check, but after a while autovacuum runs it for you if you haven't done so. It would certainly be useful to run the VALIDATE part as a background task with vacuum wait enabled. > Another benefit that could arise from this is that the hypothetical > VALIDATE CONSTRAINT step could validate more than one constraint at a > time, possibly processing all the constraints with a single table scan. Good thought, though not as useful for FK checks. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Maybe we say that you can defer the check, but after a while autovacuum > runs it for you if you haven't done so. It would certainly be useful to > run the VALIDATE part as a background task with vacuum wait enabled. It would be useful if there was anywhere to report the error to, or an action that could be taken automatically. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support