Thread: Unexpected VACUUM FULL failure
This is a bit disturbing: *** ./expected/vacuum.out Sat Jul 20 00:58:14 2002 --- ./results/vacuum.out Wed Aug 8 20:16:45 2007 *************** *** 50,55 **** --- 50,56 ---- DELETE FROM vactst WHERE i != 0; VACUUM FULL vactst; + ERROR: HEAP_MOVED_OFF was expected DELETE FROM vactst; SELECT * FROM vactst; i ====================================================================== This is today's CVS HEAD, plus some startup/shutdown logic changes in postmaster.c that hardly seem like they could be related. I couldn't reproduce it in a few tries. A reasonable guess is that it's triggered by autovacuum deciding to vacuum the table sometime before the VACUUM FULL starts. Anyone want to try to reproduce it? regards, tom lane
Tom Lane wrote: > This is a bit disturbing: > > *** ./expected/vacuum.out Sat Jul 20 00:58:14 2002 > --- ./results/vacuum.out Wed Aug 8 20:16:45 2007 > *************** > *** 50,55 **** > --- 50,56 ---- > > DELETE FROM vactst WHERE i != 0; > VACUUM FULL vactst; > + ERROR: HEAP_MOVED_OFF was expected > DELETE FROM vactst; > SELECT * FROM vactst; > i > > ====================================================================== > > This is today's CVS HEAD, plus some startup/shutdown logic changes in > postmaster.c that hardly seem like they could be related. > > I couldn't reproduce it in a few tries. A reasonable guess is that > it's triggered by autovacuum deciding to vacuum the table sometime > before the VACUUM FULL starts. Anyone want to try to reproduce it? Hum, aren't vacuums supposed to be blocked by each other? Maybe this is about a toast table not being locked enough against concurrent vacuuming of the main table. I'm currently away on vacation, which is why I've missed all the stuff going on here. Sorry for not letting people know. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I couldn't reproduce it in a few tries. A reasonable guess is that >> it's triggered by autovacuum deciding to vacuum the table sometime >> before the VACUUM FULL starts. Anyone want to try to reproduce it? > Hum, aren't vacuums supposed to be blocked by each other? Sure. I'm not thinking it's a case of concurrent vacuums (if it is, we've got worse problems than anyone imagined), but rather that the autovac left the table in a state that exposes a bug in the subsequent VACUUM FULL. Since we've whacked the tqual.c logic around recently, the problem might actually lie there... regards, tom lane
I wrote: > ... Since we've whacked the tqual.c logic around recently, > the problem might actually lie there... In fact, I bet this is a result of the async-commit patch. The places where vacuum.c bleats "HEAP_MOVED_OFF was expected" are all places where it is looking at a tuple not marked XMIN_COMMITTED; it expects that after its first pass over the table, *every* tuple is either XMIN_COMMITTED or one that it moved. Async commit changed tqual.c so that tuples that are in fact known committed might not get marked XMIN_COMMITTED right away. The patch tries to prevent this from happening within VACUUM FULL by means of /* * VACUUM FULL assumes that all tuple states are well-known prior to * moving tuples around --- see comment "knowndead" in repair_frag(), * as well as simplifications in tqual.c. So before we start we must * ensure that anyasynchronously-committed transactions with changes * against this table have been flushed to disk. It's sufficientto do * this once after we've acquired AccessExclusiveLock. */ XLogAsyncCommitFlush(); but I bet lunch that that's not good enough. I still haven't reproduced it, but I'm thinking that the inexact bookkeeping that we created for clog page LSNs allows tuples to not get marked if the right sort of timing of concurrent transactions happens. Not sure about the best solution for this. regards, tom lane
On Wed, 2007-08-08 at 23:23 -0400, Tom Lane wrote: > I wrote: > > ... Since we've whacked the tqual.c logic around recently, > > the problem might actually lie there... > > In fact, I bet this is a result of the async-commit patch. The places > where vacuum.c bleats "HEAP_MOVED_OFF was expected" are all places where > it is looking at a tuple not marked XMIN_COMMITTED; it expects that > after its first pass over the table, *every* tuple is either > XMIN_COMMITTED or one that it moved. Async commit changed tqual.c > so that tuples that are in fact known committed might not get marked > XMIN_COMMITTED right away. The patch tries to prevent this from > happening within VACUUM FULL by means of > > /* > * VACUUM FULL assumes that all tuple states are well-known prior to > * moving tuples around --- see comment "known dead" in repair_frag(), > * as well as simplifications in tqual.c. So before we start we must > * ensure that any asynchronously-committed transactions with changes > * against this table have been flushed to disk. It's sufficient to do > * this once after we've acquired AccessExclusiveLock. > */ > XLogAsyncCommitFlush(); > > but I bet lunch that that's not good enough. I still haven't reproduced > it, but I'm thinking that the inexact bookkeeping that we created for > clog page LSNs allows tuples to not get marked if the right sort of > timing of concurrent transactions happens. > > Not sure about the best solution for this. Good hunch. I plugged this hole earlier, but on further inspection I can see the plug wasn't wide enough. XLogAsyncCommitFlush() is good enough, but HeapTupleSatisfiesVacuum() still allowed the inexact bookkeeping to sometimes skip hint bit setting, when executed with concurrent transactions touching other tables. ISTM that if we call HeapTupleSatisfiesVacuum() with an additional boolean parameter, force, we can tell VF to always set the hint bits in every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT. Patch enclosed, but a little crufty. Gotta run now, talk later. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
"Simon Riggs" <simon@2ndquadrant.com> writes: > Good hunch. I plugged this hole earlier, but on further inspection I can > see the plug wasn't wide enough. XLogAsyncCommitFlush() is good enough, > but HeapTupleSatisfiesVacuum() still allowed the inexact bookkeeping to > sometimes skip hint bit setting, when executed with concurrent > transactions touching other tables. > ISTM that if we call HeapTupleSatisfiesVacuum() with an additional > boolean parameter, force, we can tell VF to always set the hint bits in > every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT. Surely this approach is no good: won't it allow hint bits to reach disk in advance of their transaction? I think it'd be safer, and a lot less ugly, to recast the tests in VACUUM FULL. If we make the first pass clear any old MOVED_IN/MOVED_OUT bits then the last pass can key off those instead of assuming that XMIN_COMMITTED is set everywhere. Then we'd not need XLogAsyncCommitFlush, which is a kluge anyway. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> ISTM that if we call HeapTupleSatisfiesVacuum() with an additional >> boolean parameter, force, we can tell VF to always set the hint bits in >> every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT. > > Surely this approach is no good: won't it allow hint bits to reach disk > in advance of their transaction? I don't think so since it sounds like he's saying to still sync the log and VACUUM FULL has an exclusive lock on the table. So any committed (or aborted) changes it sees in the table must have been committed or aborted before the log sync. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > >> "Simon Riggs" <simon@2ndquadrant.com> writes: >> >>> ISTM that if we call HeapTupleSatisfiesVacuum() with an additional >>> boolean parameter, force, we can tell VF to always set the hint bits in >>> every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT. >> Surely this approach is no good: won't it allow hint bits to reach disk >> in advance of their transaction? > > I don't think so since it sounds like he's saying to still sync the log and > VACUUM FULL has an exclusive lock on the table. So any committed (or aborted) > changes it sees in the table must have been committed or aborted before the > log sync. Hint bit updates are not WAL-logged, so there's no mechanism to keep the data page from hitting the disk before the COMMIT record. That's the reason why we can't just set the hint bits for async committed transactions in the first place. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Gregory Stark wrote: >> I don't think so since it sounds like he's saying to still sync the log and >> VACUUM FULL has an exclusive lock on the table. So any committed (or aborted) >> changes it sees in the table must have been committed or aborted before the >> log sync. > Hint bit updates are not WAL-logged, so there's no mechanism to keep the > data page from hitting the disk before the COMMIT record. That's the > reason why we can't just set the hint bits for async committed > transactions in the first place. Well the theory was that the flush at the start would ensure that VF's first scan could always set the hint bits. But I remembered this morning why I felt uneasy about that: it's not guaranteed true for system catalogs. We sometimes release locks on catalogs before committing. (Whether that is a good idea is a different discussion.) What I'm now thinking we should do is to have the first scan check whether XMIN_COMMITTED is actually set (after HeapTupleSatisfiesVacuum claims it's good), and abandon defrag if not, the same as we already do in the other corner cases where we find not-guaranteed-committed tuples (see INSERT_IN_PROGRESS/DELETE_IN_PROGRESS cases in scan_heap). If we do that, we don't actually need XLogAsyncCommitFlush() at the start. I'm inclined to remove it just because it's ugly. Comments? BTW, something else that just penetrated my brain is that this failure can only happen with synchronous_commit = off. In the sync-commit case, even though we have inexact bookkeeping for clog LSNs, it's always true that every LSN recorded for a clog page will have been flushed first. So we will always think we can set hint bits even though we might be testing an LSN later than the particular transaction in question. I just rechecked and verified that I'd been (without really thinking about it) running my test build with synchronous_commit = off for the past few days. If I happened to see this in one of the relatively small number of parallel regression sets I've run since then, then it should have been obvious in the buildfarm. The reason it wasn't was that none of the buildfarm machines are testing async-commit. We need to do something about that. regards, tom lane
On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > Gregory Stark wrote: > >> I don't think so since it sounds like he's saying to still sync the log and > >> VACUUM FULL has an exclusive lock on the table. So any committed (or aborted) > >> changes it sees in the table must have been committed or aborted before the > >> log sync. > > > Hint bit updates are not WAL-logged, so there's no mechanism to keep the > > data page from hitting the disk before the COMMIT record. That's the > > reason why we can't just set the hint bits for async committed > > transactions in the first place. > > Well the theory was that the flush at the start would ensure that VF's > first scan could always set the hint bits. But I remembered this > morning why I felt uneasy about that: it's not guaranteed true for > system catalogs. We sometimes release locks on catalogs before > committing. (Whether that is a good idea is a different discussion.) Yes, I see comments in the VF code along those lines. Seems like we should have code comments explaining exactly where we do this, why and which things it causes issues for. > What I'm now thinking we should do is to have the first scan check > whether XMIN_COMMITTED is actually set (after HeapTupleSatisfiesVacuum > claims it's good), and abandon defrag if not, the same as we already do > in the other corner cases where we find not-guaranteed-committed tuples > (see INSERT_IN_PROGRESS/DELETE_IN_PROGRESS cases in scan_heap). I now think we *must* do this for system catalogs, or something else at least. Personally I would prefer preventing async commits from occurring when a system catalog has been touched at all, which would make the VF situation and a few other situations go away entirely. However, I see no reason to do as you suggest for other tables. It seems like this would be an annoying feature to have VF sometimes fail, depending upon the timing of other transactions. There would be a rare failure if an async commit touched an early block in a table immediately prior to a VF. It's rare but possible. This would be extremely annoying if a DBA ran a job to VF a table overnight and then came back in to see the ERROR message. It is fairly easy to code it this way though, if you really think this is the best way. > If we do that, we don't actually need XLogAsyncCommitFlush() at the start. > I'm inclined to remove it just because it's ugly. Comments? I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code that doesn't make anything else any harder. Apart from system catalogs, doing it that way would be error free for all normal VFs. > BTW, something else that just penetrated my brain is that this failure > can only happen with synchronous_commit = off. In the sync-commit case, > even though we have inexact bookkeeping for clog LSNs, it's always true > that every LSN recorded for a clog page will have been flushed first. > So we will always think we can set hint bits even though we might be > testing an LSN later than the particular transaction in question. > I just rechecked and verified that I'd been (without really thinking > about it) running my test build with synchronous_commit = off for the > past few days. If I happened to see this in one of the relatively small > number of parallel regression sets I've run since then, then it should > have been obvious in the buildfarm. The reason it wasn't was that none > of the buildfarm machines are testing async-commit. We need to do > something about that. Yes, this issue effects async commit only. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote: >> If we do that, we don't actually need XLogAsyncCommitFlush() at the start. >> I'm inclined to remove it just because it's ugly. Comments? > I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code > that doesn't make anything else any harder. Apart from system catalogs, > doing it that way would be error free for all normal VFs. Please recall that the failure that started this thread was on a regular user table. To do what you want will introduce what I'm now thinking is an unacceptable amount of fragility. In particular your patch of yesterday to force hint-bit setting in VF scares the heck out of me. The reason why XLogAsyncCommitFlush() is ugly is that it doesn't actually accomplish a darn thing, as we now see from this failure: it does not guarantee that hint bits will get set, because of the inexact bookkeeping in clog.c. It might marginally improve the probability that they get set, but that's all. The reason I want to take it out is mainly that its very existence tempts people to make the same mistake that was made here. > I now think we *must* do this for system catalogs, or something else at > least. Personally I would prefer preventing async commits from occurring > when a system catalog has been touched at all, which would make the VF > situation and a few other situations go away entirely. I think that that is completely the wrong line of thought: we should be making async commit work everywhere, or as nearly so as possible, not inventing arbitrary restrictions to place on it. The restriction you suggest here would probably cost more performance (by forcing sync commits) than could ever be gained back by being able to assume hint bits are set. In the patch as committed, I took out most of the restrictions you had on async commit --- the only utility commands that force sync commit are the ones that have direct effects on the filesystem. Another argument is that VACUUM FULL is a dinosaur that should probably go away entirely someday (a view I believe you share); it should therefore not be allowed to drive the design of other parts of the system. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > The reason why XLogAsyncCommitFlush() is ugly is that it doesn't > actually accomplish a darn thing, as we now see from this failure: > it does not guarantee that hint bits will get set, because of the > inexact bookkeeping in clog.c. It might marginally improve the > probability that they get set, but that's all. The reason I want > to take it out is mainly that its very existence tempts people to make > the same mistake that was made here. I don't understand your reasoning here and I would like to understand it so if you don't mind I would like to see if I can work out what you're talking about. Regardless of this point I think the impact of what you were proposing to do to VF instead was much less than Simon seemed to think it was so it seems like a perfectly acceptable solution. As far as I understand the Xlog flush in combination with keeping an exclusive lock on table and always holding locks until the end of the transaction make forcing the hint bits entirely safe. The fragility you see comes from depending on how those three things interact and the difficulty in ensuring that all of those properties are always true. By "marginally improve the probability" you're making a judgement of the probability that programmers will manage to maintain all those properties consistently, not about the probability that the race will occur at run-time? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Another argument is that VACUUM FULL is a dinosaur that should probably > go away entirely someday (a view I believe you share); it should > therefore not be allowed to drive the design of other parts of the > system. Incidentally, every time it comes up we recommend using CLUSTER or ALTER TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I wonder if it would make sense to add a "VACUUM REWRITE" which just did the same as the noop ALTER TABLE we're recommending people do anyways. Then we could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE. I would think this would be 8.4 stuff except if all we want it to do is a straight noop alter table it's pretty trivial. The hardest part is coming up with a name for it. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 10, 2007 at 07:53:06PM +0100, Gregory Stark wrote: > > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > Another argument is that VACUUM FULL is a dinosaur that should probably > > go away entirely someday (a view I believe you share); it should > > therefore not be allowed to drive the design of other parts of the > > system. > > Incidentally, every time it comes up we recommend using CLUSTER or ALTER > TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I > wonder if it would make sense to add a "VACUUM REWRITE" which just did the > same as the noop ALTER TABLE we're recommending people do anyways. Then we > could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE. > > I would think this would be 8.4 stuff except if all we want it to do is a > straight noop alter table it's pretty trivial. The hardest part is coming up > with a name for it. One question... should we have a vacuum variant that also reindexes? Or does that just naturally fall out of the rewrite? BTW, rewrite sounds fine to me... anything but full, which is constantly confused with a "full database vacuum". -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Fri, 2007-08-10 at 13:47 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote: > >> If we do that, we don't actually need XLogAsyncCommitFlush() at the start. > >> I'm inclined to remove it just because it's ugly. Comments? > > > I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code > > that doesn't make anything else any harder. Apart from system catalogs, > > doing it that way would be error free for all normal VFs. > > Please recall that the failure that started this thread was on a regular > user table. To do what you want will introduce what I'm now thinking > is an unacceptable amount of fragility. In particular your patch of > yesterday to force hint-bit setting in VF scares the heck out of me. > > The reason why XLogAsyncCommitFlush() is ugly is that it doesn't > actually accomplish a darn thing, as we now see from this failure: > it does not guarantee that hint bits will get set, because of the > inexact bookkeeping in clog.c. It might marginally improve the > probability that they get set, but that's all. The reason I want > to take it out is mainly that its very existence tempts people to make > the same mistake that was made here. I think this *can* work, but I accept you don't like it, even if I'm not sure why. The bug was in the assumption that all flushed async commits can be confirmed to be flushed, which isn't true, not in the flush itself. If VACUUM FULL has problems with catalog tables, then that is an existing bug, not one introduced by the async patch. Catalog tables might be unlocked and yet have uncommitted transactions in them, which violates the assumption in the current VF code that all hint bits will be set prior to repair_frag(). But the comments in vacuum.c line 1821 say "A tuple is either: (a) a tuple in a system catalog, inserted or deleted by a not yet committed transaction... in case (a) we would not be in repair_frag() at all" (it doesn't give a reason). If the current comments are correct, then we're OK to fix this by having a special case HeapTupleSatisfies, maybe coded slightly differently. If the current comments are false, in that (a) is possible, then VF has a pre-existing bug, that *must* be fixed in the way you suggest, but that has nothing to do with async. ...but... > Another argument is that VACUUM FULL is a dinosaur that should probably > go away entirely someday (a view I believe you share); it should > therefore not be allowed to drive the design of other parts of the > system. You're right. Let's make that day today. I vote to completely replace VF now with a cluster-style rebuild of the table. That way we won't have to keep fixing something we're gonna kill anyway. It would be better to release 8.3 with a new, clean, fast implementation of VF, than to release it with code that we *think* is right, but has so far proved a source of some truly obscure bugs. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > It would be better to release 8.3 with a new, clean, fast implementation > of VF, than to release it with code that we *think* is right, but has so > far proved a source of some truly obscure bugs. Well building a suitable replacement for VACUUM FULL is more work than I was proposing. The no-op ALTER TABLE rebuild still has the disadvantage of requiring 2x the space taken by the table (and potentially more if you have large indexes). I also think it's a safer course of action to create a new command, spend one or two releases improving it until we're sure it meets everyone's needs, then drop the old command. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> It would be better to release 8.3 with a new, clean, fast implementation >> of VF, than to release it with code that we *think* is right, but has so >> far proved a source of some truly obscure bugs. > Well building a suitable replacement for VACUUM FULL is more work than I was > proposing. Reimplementing VACUUM FULL for 8.3 is right out :-(. I do want to ship a release in calendar 2007 ... > I also think it's a safer course of action to create a new command, spend one > or two releases improving it until we're sure it meets everyone's needs, then > drop the old command. Agreed. While I'd like to see V.F. dead, we need a proven replacement first. (At the same time, we shouldn't spend more effort on V.F. than we have to.) regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes: > Incidentally, every time it comes up we recommend using CLUSTER or ALTER > TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I > wonder if it would make sense to add a "VACUUM REWRITE" which just did the > same as the noop ALTER TABLE we're recommending people do anyways. Then we > could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE. Not that syntax, please :-(. The trouble with VACUUM [adjective] is that "adjective" has to become a fully reserved keyword, else the parser can't tell it from a table name. This is all right for FULL because that's a reserved word anyway due to the outer join syntax, but I really don't want to do it for any words that aren't otherwise reserved. regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> The reason why XLogAsyncCommitFlush() is ugly is that it doesn't >> actually accomplish a darn thing, as we now see from this failure: >> it does not guarantee that hint bits will get set, because of the >> inexact bookkeeping in clog.c. It might marginally improve the >> probability that they get set, but that's all. The reason I want >> to take it out is mainly that its very existence tempts people to make >> the same mistake that was made here. > I don't understand your reasoning here and I would like to understand it so if > you don't mind I would like to see if I can work out what you're talking > about. Well, both Simon and I thought that XLogAsyncCommitFlush would allow VACUUM FULL to assume that hint bits were good. We were both wrong about that, and in hindsight that goes against the whole trend of PG development on this topic: hint bits are hints, full stop. I think that having XLogAsyncCommitFlush in the API will just encourage people to use it when they shouldn't. > As far as I understand the Xlog flush in combination with keeping an exclusive > lock on table and always holding locks until the end of the transaction make > forcing the hint bits entirely safe. I don't currently see a hole in that line of reasoning, but it's a bit longer chain of reasoning than I like for a critical correctness property. Especially when we have a boatload of code that violates the last assumption, including deeply-embedded APIs (heap_close) that make it easy to violate the assumption. We could maybe go out next week and fix all the core code to not release any locks early, but what of third-party backend add-ons? Worse, might we not regret the change later due to increased deadlock risks? > By "marginally improve the probability" you're making a judgement of the > probability that programmers will manage to maintain all those properties > consistently, not about the probability that the race will occur at run-time? No, I was thinking about the latter. The current CVS tip code doesn't have a huge window between logically committing a transaction and being able to set hint bits for it. In situations where you think "hmm, I just made a big update, maybe a VACUUM FULL would be a good idea", you'll be quite safe by the time you've managed to type VACUUM FULL. A V.F. that is automatically issued while a lot of other stuff is going on will be at larger risk of having the defragment step disabled, but at the same time it's not obvious that anyone will care much. regards, tom lane