Thread: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted
[HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted
From
Masahiko Sawada
Date:
Hi, Commit e2c79e14 prevented multiple cleanup process for pending list in GIN index. But I think that there is still possibility that vacuum could miss tuples to be deleted if someone else is cleaning up the pending list. In ginInsertCleanup(), we lock the GIN meta page by LockPage and could wait for the concurrent cleaning up process if stats == NULL. And the source code comment says that this happen is when ginINsertCleanup is called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree with this behavior. However, looking at the callers the stats is NULL only either if pending list exceeds to threshold during insertions or if only analyzing is performed by an autovacum worker or ANALYZE command. So I think we should inVacuum = (stats != NULL) instead. Also, we might want autoanalyze and ANALYZE command to wait for concurrent process as well. Attached patch fixes these two issue. If this is a bug we should back-patch to 9.6. void ginInsertCleanup(GinState *ginstate, bool full_clean, bool fill_fsm, IndexBulkDeleteResult *stats) { (snip) bool inVacuum = (stats == NULL); /* * We would like to prevent concurrent cleanup process. For that we will * lock metapage in exclusive mode using LockPage() call. Nobody other * will use that lock for metapage, so we keep possibility of concurrent * insertion into pending list */ if (inVacuum) { /* * We are called from [auto]vacuum/analyze or gin_clean_pending_list() * and we would like to wait concurrent cleanup to finish. */ LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock); workMemory = (IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1) ? autovacuum_work_mem : maintenance_work_mem; } else { /* * We are called from regular insert and if we see concurrent cleanup * just exit in hope that concurrent process will clean up pending * list. */ if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock)) return; workMemory = work_mem; } Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Mon, Nov 13, 2017 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Commit e2c79e14 prevented multiple cleanup process for pending list in > GIN index. But I think that there is still possibility that vacuum > could miss tuples to be deleted if someone else is cleaning up the > pending list. I've been suspicious of that commit (and related commits) for a while now [1]. I think that it explains a couple of different problem reports that we have seen. > In ginInsertCleanup(), we lock the GIN meta page by LockPage and could > wait for the concurrent cleaning up process if stats == NULL. And the > source code comment says that this happen is when ginINsertCleanup is > called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree > with this behavior. However, looking at the callers the stats is NULL > only either if pending list exceeds to threshold during insertions or > if only analyzing is performed by an autovacum worker or ANALYZE > command. So I think we should inVacuum = (stats != NULL) instead. > Also, we might want autoanalyze and ANALYZE command to wait for > concurrent process as well. Attached patch fixes these two issue. If > this is a bug we should back-patch to 9.6. How did you figure this out? Did you just notice that the code wasn't doing what it claimed to do, or was there a problem that you saw in production? [1] https://postgr.es/m/CAH2-WzmtLXbs8+c19t1T=Rj0KyP7vK9q8hQJULgDLdVMuEeeUw@mail.gmail.com -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Masahiko Sawada
Date:
On Tue, Nov 14, 2017 at 3:01 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Nov 13, 2017 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Commit e2c79e14 prevented multiple cleanup process for pending list in >> GIN index. But I think that there is still possibility that vacuum >> could miss tuples to be deleted if someone else is cleaning up the >> pending list. > > I've been suspicious of that commit (and related commits) for a while > now [1]. I think that it explains a couple of different problem > reports that we have seen. Yeah, the problem here is that vacuum and analyze don't acquire a heavy weight lock for meta page using properly function. it seems not relevant with that problem. > >> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could >> wait for the concurrent cleaning up process if stats == NULL. And the >> source code comment says that this happen is when ginINsertCleanup is >> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree >> with this behavior. However, looking at the callers the stats is NULL >> only either if pending list exceeds to threshold during insertions or >> if only analyzing is performed by an autovacum worker or ANALYZE >> command. So I think we should inVacuum = (stats != NULL) instead. >> Also, we might want autoanalyze and ANALYZE command to wait for >> concurrent process as well. Attached patch fixes these two issue. If >> this is a bug we should back-patch to 9.6. > > How did you figure this out? Did you just notice that the code wasn't > doing what it claimed to do, or was there a problem that you saw in > production? > I just noticed it during surveying the GIN code for the another patch[1]. [1] https://commitfest.postgresql.org/15/1133/ Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Mon, Nov 13, 2017 at 5:07 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I've been suspicious of that commit (and related commits) for a while >> now [1]. I think that it explains a couple of different problem >> reports that we have seen. > > Yeah, the problem here is that vacuum and analyze don't acquire a > heavy weight lock for meta page using properly function. it seems not > relevant with that problem. One thing that really bothers me about commit e2c79e14 is that LockPage() is called, not LockBuffer(). GIN had no LockPage() calls before that commit, and is now the only code in the entire system that calls LockPage()/ConditionalLockPage() (the hash am no longer uses page heavyweight locks following recent work there). Historically, the only reason that an AM would ever call LockPage() instead of LockBuffer() (at least after LWLocks were introduced many years ago) was that there was a small though acceptable risk of deadlock for concurrent inserters. It would hardly ever happen, but the possibility could not be ruled out, so deadlock detection was required. This was definitely true of hash, which is one reason why it was a second class index am for so long. I think this was even true of nbtree, up until about 15 years ago. The high level question that I have to ask about e2c79e14 is: can it deadlock? If not, why was LockPage() used at all? It seems like a bad sign that none of this is explained in the code. My guess is that bugs in this area have caused data corruption (not just undetectable deadlock issues), which was reported and commenting on elsewhere [1]; maybe this proposed fix of yours could have prevented that. But we clearly still need to do a careful analysis of e2c79e14, because it seems like it probably has fundamental design problems (it's not *just* buggy). [1] https://postgr.es/m/CAH2-WznBt2+7qc65btjxNNwa9BW+jKEBgVjb=F+26iUQHMy+6A@mail.gmail.com -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Eric Lam
Date:
Hi,
Please unsubscribe me.
thanks
Eric
On Tuesday, November 14, 2017, 2:02:04 AM GMT+8, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Nov 13, 2017 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Commit e2c79e14 prevented multiple cleanup process for pending list in
> GIN index. But I think that there is still possibility that vacuum
> could miss tuples to be deleted if someone else is cleaning up the
> pending list.
I've been suspicious of that commit (and related commits) for a while
now [1]. I think that it explains a couple of different problem
reports that we have seen.
> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
> wait for the concurrent cleaning up process if stats == NULL. And the
> source code comment says that this happen is when ginINsertCleanup is
> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
> with this behavior. However, looking at the callers the stats is NULL
> only either if pending list exceeds to threshold during insertions or
> if only analyzing is performed by an autovacum worker or ANALYZE
> command. So I think we should inVacuum = (stats != NULL) instead.
> Also, we might want autoanalyze and ANALYZE command to wait for
> concurrent process as well. Attached patch fixes these two issue. If
> this is a bug we should back-patch to 9.6.
How did you figure this out? Did you just notice that the code wasn't
doing what it claimed to do, or was there a problem that you saw in
production?
--
Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Michael Paquier
Date:
On Tue, Nov 14, 2017 at 10:52 AM, Eric Lam <cpegeric@yahoo.com> wrote: > Please unsubscribe me. (Please do not hijack the threads) Help yourself: https://www.postgresql.org/community/lists/subscribe/ -- Michael
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Masahiko Sawada
Date:
On Tue, Nov 14, 2017 at 10:48 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Nov 13, 2017 at 5:07 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> I've been suspicious of that commit (and related commits) for a while >>> now [1]. I think that it explains a couple of different problem >>> reports that we have seen. >> >> Yeah, the problem here is that vacuum and analyze don't acquire a >> heavy weight lock for meta page using properly function. it seems not >> relevant with that problem. > > One thing that really bothers me about commit e2c79e14 is that > LockPage() is called, not LockBuffer(). GIN had no LockPage() calls > before that commit, and is now the only code in the entire system that > calls LockPage()/ConditionalLockPage() (the hash am no longer uses > page heavyweight locks following recent work there). > > Historically, the only reason that an AM would ever call LockPage() > instead of LockBuffer() (at least after LWLocks were introduced many > years ago) was that there was a small though acceptable risk of > deadlock for concurrent inserters. It would hardly ever happen, but > the possibility could not be ruled out, so deadlock detection was > required. This was definitely true of hash, which is one reason why it > was a second class index am for so long. I think this was even true of > nbtree, up until about 15 years ago. > > The high level question that I have to ask about e2c79e14 is: can it > deadlock? If not, why was LockPage() used at all? It seems like a bad > sign that none of this is explained in the code. I'm not sure the exact reason but I guess that because we keep holding the lock for meta page during cleaning up pending list we would eliminate the possibility of concurrent insertion into pending list if we acquire the lock of meta page using LockBuffer() instead.This is explained in ginInsertCleanup(). /* * We would like to prevent concurrent cleanup process. For that we will * lock metapage in exclusive mode usingLockPage() call. Nobody other * will use that lock for metapage, so we keep possibility of concurrent * insertioninto pending list */ So I conjecture that this has been introduced for not the reason why we need to detect deadlock but the reason why we need to a different lock from the lock used by insertion into pending list. > > My guess is that bugs in this area have caused data corruption (not > just undetectable deadlock issues), which was reported and commenting > on elsewhere [1]; maybe this proposed fix of yours could have > prevented that. But we clearly still need to do a careful analysis of > e2c79e14, because it seems like it probably has fundamental design > problems (it's not *just* buggy). > Yeah, I agree. I'll manage to get the time to look carefully at the GIN code related to fast insertion and cleanup pending list. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Mon, Nov 13, 2017 at 6:56 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > /* > * We would like to prevent concurrent cleanup process. For that we will > * lock metapage in exclusive mode using LockPage() call. Nobody other > * will use that lock for metapage, so we keep possibility of concurrent > * insertion into pending list > */ > > So I conjecture that this has been introduced for not the reason why > we need to detect deadlock but the reason why we need to a different > lock from the lock used by insertion into pending list. I understood that much, but I think that we need to detect problems and recover from them (something like _bt_page_recyclable()), rather than preventing them with pessimistic locking -- or, at least, there is no reason I know to think that the HW lock is sufficient, and I am tempted to go that way to fix this. Commit e9568083, which added the feature that led to commit e2c79e14, may itself be the basic problem here. Another complication is that 218f5158 changed things here again. It's possible that the report about 10 [1] (not the undetected deadlock report) involved that commit, but either way we have big problems here. GIN doesn't have something like nbtree's RecentGlobalXmin/_bt_page_recyclable() interlock. I don't know that much about GIN (I would like to learn more), but I will try to break this down, for myself, and for others. We know: * There are some cases in which we simply don't ever delete pages (in the entry tree), so that's obviously why there is no need for RecentGlobalXmin interlock there. Easy. * According to the GIN README, the pending list cleanup by VACUUM has a super-exclusive lock on the root, to block out concurrent inserters (that hold a pin on the root throughout). That's why it was/is okay that VACUUM recycled pending list pages without a RecentGlobalXmin interlock. Not so easy, but still not hard. * Commit e9568083 and follow-ups let inserters free space from deleted pages. Why should inserters not need a super-exclusive lock, just like VACUUM? The README still says that they need one! This is the first time we ever called RecordFreeIndexPage() from anywhere that is not strictly VACUUM code. That feels wrong to me. The obvious solution is to not recycle in inserter's calls to ginInsertCleanup(), but I have no idea if that will work, in part because I don't understand GIN very well. > Yeah, I agree. I'll manage to get the time to look carefully at the > GIN code related to fast insertion and cleanup pending list. Cool. I'll try to spare some more time for this, too. [1] https://www.postgresql.org/message-id/A5C88C48-4573-453B-BF52-3B34A97D7A0A%40xmission.com -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Robert Haas
Date:
On Mon, Nov 13, 2017 at 3:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > In ginInsertCleanup(), we lock the GIN meta page by LockPage and could > wait for the concurrent cleaning up process if stats == NULL. And the > source code comment says that this happen is when ginINsertCleanup is > called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree > with this behavior. However, looking at the callers the stats is NULL > only either if pending list exceeds to threshold during insertions or > if only analyzing is performed by an autovacum worker or ANALYZE > command. So I think we should inVacuum = (stats != NULL) instead. Argh. Yeah, that looks wrong. Instead of relying on this indirect method, how about passing an explicit inVacuum argument to that function? And while we're at it, how about renaming inVacuum to forceCleanup? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Masahiko Sawada
Date:
On Thu, Nov 16, 2017 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 13, 2017 at 3:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could >> wait for the concurrent cleaning up process if stats == NULL. And the >> source code comment says that this happen is when ginINsertCleanup is >> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree >> with this behavior. However, looking at the callers the stats is NULL >> only either if pending list exceeds to threshold during insertions or >> if only analyzing is performed by an autovacum worker or ANALYZE >> command. So I think we should inVacuum = (stats != NULL) instead. > > Argh. Yeah, that looks wrong. > > Instead of relying on this indirect method, how about passing an > explicit inVacuum argument to that function? And while we're at it, > how about renaming inVacuum to forceCleanup? > Agreed, that's better. Attached updated patch. Also I've added this to the next CF so as not to forget. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Robert Haas
Date:
On Thu, Nov 16, 2017 at 7:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Agreed, that's better. Attached updated patch. > Also I've added this to the next CF so as not to forget. Committed and back-patched. While I'm fairly sure this is a correct fix, post-commit review from someone who knows GIN better than I do would be a great idea. I am also clear on what the consequences of this bug are. It seems like it could harm insert performance by making us wait when we shouldn't, but can it cause corruption? That I'm not sure about. If there's already a cleanup of the pending list in progress, how do things go wrong? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Thu, Nov 16, 2017 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I am also clear on what the consequences of this bug are. It seems > like it could harm insert performance by making us wait when we > shouldn't, but can it cause corruption? That I'm not sure about. If > there's already a cleanup of the pending list in progress, how do > things go wrong? While it was probably the right thing for you to go ahead and commit the patch, I think that the bug undermines whatever confidence there was in the correctness of the GIN pending list cleanup code. I have far more questions than answers right now, though. -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Robert Haas
Date:
On Mon, Nov 13, 2017 at 8:48 PM, Peter Geoghegan <pg@bowt.ie> wrote: > One thing that really bothers me about commit e2c79e14 is that > LockPage() is called, not LockBuffer(). GIN had no LockPage() calls > before that commit, and is now the only code in the entire system that > calls LockPage()/ConditionalLockPage() (the hash am no longer uses > page heavyweight locks following recent work there). I would like to get rid of that LockPage() call, for sure, because it's problematic in terms of allowing writes in parallel mode. However, I think the reason here is the same as why the hash AM used to use them. If you use a buffer lock, you really can't -- or shoudn't, at least -- hold it across a whole series of operations, because anyone waiting for that lock is waiting *uninterruptibly*. The hash AM wanted to iterate through all of the pages in a bucket chain and do something to each one while preventing concurrent scans; the need here is similar. Aside from the uninterruptible-wait problem, such coding patterns are extremely prone to deadlock. If there's any chance that a process waiting for the buffer lock you hold might be holding a buffer lock you try to acquire, you have got a problem. I don't view the use of LockPage() here as wrong or scary. I find that it's impeding some of my own development goals, but that's not to say it wasn't a good choice for the problem that they were trying to solve. Had they solved it some other way, parallel writes might be easier, but you can't win 'em all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Thu, Nov 16, 2017 at 12:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I would like to get rid of that LockPage() call, for sure, because > it's problematic in terms of allowing writes in parallel mode. > However, I think the reason here is the same as why the hash AM used > to use them. If you use a buffer lock, you really can't -- or > shoudn't, at least -- hold it across a whole series of operations, > because anyone waiting for that lock is waiting *uninterruptibly*. I'm well aware of that. > The hash AM wanted to iterate through all of the pages in a bucket > chain and do something to each one while preventing concurrent scans; > the need here is similar. Aside from the uninterruptible-wait > problem, such coding patterns are extremely prone to deadlock. If > there's any chance that a process waiting for the buffer lock you hold > might be holding a buffer lock you try to acquire, you have got a > problem. But there is only ever one page locked, the meta-page. And, it's always an ExclusiveLock. I don't see much use for deadlock avoidance. In any case, it's unusual to have a patch that uses LockPage() without explaining the choice. -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Thu, Nov 16, 2017 at 12:49 PM, Peter Geoghegan <pg@bowt.ie> wrote: > But there is only ever one page locked, the meta-page. And, it's > always an ExclusiveLock. I don't see much use for deadlock avoidance. I meant deadlock detection. -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Jeff Janes
Date:
On Mon, Nov 13, 2017 at 8:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Nov 13, 2017 at 6:56 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> /*
> * We would like to prevent concurrent cleanup process. For that we will
> * lock metapage in exclusive mode using LockPage() call. Nobody other
> * will use that lock for metapage, so we keep possibility of concurrent
> * insertion into pending list
> */
>
> So I conjecture that this has been introduced for not the reason why
> we need to detect deadlock but the reason why we need to a different
> lock from the lock used by insertion into pending list.
I understood that much, but I think that we need to detect problems
and recover from them (something like _bt_page_recyclable()), rather
than preventing them with pessimistic locking -- or, at least, there
is no reason I know to think that the HW lock is sufficient, and I am
tempted to go that way to fix this. Commit e9568083, which added the
feature that led to commit e2c79e14, may itself be the basic problem
here.
e2c79e14 was to fix a pre-existing bug, but probably e9568083 made that bug easier to hit than it was before. (Which is not to say that e9568083 can't contain bugs of its own, of course)
* According to the GIN README, the pending list cleanup by VACUUM has
a super-exclusive lock on the root, to block out concurrent inserters
(that hold a pin on the root throughout). That's why it was/is okay
that VACUUM recycled pending list pages without a RecentGlobalXmin
interlock. Not so easy, but still not hard.
The only reference to super-exclusive lock in src/backend/access/gin/README, that I can find, is about posting trees, not pending lists. Can you quote or give line numbers of the section you are referring to?
Cheers,
Jeff
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > The only reference to super-exclusive lock in src/backend/access/gin/README, > that I can find, is about posting trees, not pending lists. Can you quote > or give line numbers of the section you are referring to? That's the section I was referring to. I apologize for being unclear. I would like to see a *general* explanation of why it's okay that the pending list can have its pages recycled early. How can we be sure that somebody looking through the pending list won't land on a just-recycled page and somehow get confused? We see this for the entry tree (no deletion is possible in the first place), and we also see it for the posting tree (the dance with inserters having a pin on the root, and so on). Not mentioning why pending list recycling is safe in terms of locking protocols seems like an omission that should be addressed. I'm no GIN expert, but I would expect this because the pending list seems like a kind of extension of the posting tree. As I've said, it feels slightly off to me that a user backend that is performing opportunistic cleanup during insertion can call RecordFreeIndexPage() in GIN. -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > e2c79e14 was to fix a pre-existing bug, but probably e9568083 made that bug > easier to hit than it was before. (Which is not to say that e9568083 can't > contain bugs of its own, of course) I get that. I think that the same thing may have happened again, in fact. A pending list recycling interlock bug may have merely been unmasked by your commit e9568083; I'm speculating that your commit made it more likely to hit in practice, just as with the earlier bug you mentioned. As you know, there is a significant history of VACUUM page recycling bugs in GIN. I wouldn't be surprised if we found one more in the pending list logic. Consider commit ac4ab97e, a bugfix from late 2013 for posting list page deletion, where the current posting list locking protocol was more or less established. The commit message says: """ ... If a page is deleted, and reused for something else, just as a search is following a rightlink to it from its left sibling, the search would continue scanning whatever the new contents of the page are. That could lead to incorrect query results, or even something more curious if the page is reused for a different kind of a page. To fix, modify the search algorithm to lock the next page before releasing the previous one, and refrain from deleting pages from the leftmost branch of the [posting] tree. ... """ I believe that this commit had GIN avoid deletion of the leftmost branch of posting trees because that makes it safe to get to the posting tree root from a raw root block number (a raw block number from the B-tree index constructed over key values). The buffer lock pairing/crabbing this commit adds to posting lists (similar to the pairing/crabbing that pending lists had from day one, when first added in 2011) can prevent a concurrent deletion once you reach the posting tree root. But, as I said, you still need to reliably get to the root in the first place, which the "don't delete leftmost" rule ensures (if you can never delete it, you can never recycle it, and no reader gets confused). It's not clear why it's safe to recycle the pending list pages at all (though deletion alone might well be okay, because readers can notice a page is deleted if it isn't yet recycled). Why is it okay that we can jump to the first block from the meta page in the face of concurrent recycling? Looking at scanPendingInsert(), it does appear that readers do buffer lock crabbing/coupling for the meta page and the first pending page. So that's an encouraging sign. However, ginInsertCleanup() *also* uses shared buffer locks for both meta page and list head page (never exclusive buffer locks) in exactly the same way when first establishing what block is at the head of the list to be zapped. It's acting as if it is a reader, but it's actually a writer. I don't follow why this is correct. It looks like scanPendingInsert() can decide that it knows where the head of the pending list is *after* ginInsertCleanup() has already decided that that same head of the list block needs to possibly be deleted/recycled. Have I missed something? We really ought to make the asserts on gin page type into "can't happen" elog errors, as a defensive measure, in both scanPendingInsert() and ginInsertCleanup(). The 2013 bug fix actually did this for the posting tree, but didn't touch similar pending list code. -- Peter Geoghegan
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Jeff Janes
Date:
On Thu, Nov 16, 2017 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It can cause corruption. Attached is a test case to show it. It is based on my usual crash-recovery test, but to get corruption I had to turn off the crash-injection (which means the test can be run on an unpatched server) and increase gin_pending_list_limit.On Thu, Nov 16, 2017 at 7:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Agreed, that's better. Attached updated patch.
> Also I've added this to the next CF so as not to forget.
Committed and back-patched. While I'm fairly sure this is a correct
fix, post-commit review from someone who knows GIN better than I do
would be a great idea.
I am also clear on what the consequences of this bug are. It seems
like it could harm insert performance by making us wait when we
shouldn't, but can it cause corruption? That I'm not sure about. If
there's already a cleanup of the pending list in progress, how do
things go wrong?
On 8 CPU, it took anywhere from 15 minutes to 8 hours to see corruption, which presents as something like this in the script output:
child abnormal exit update did not update 1 row: key 8117 updated 2 at count.pl line 193.\n at count.pl line 201.
Your commit has apparently fixed it, but I will run some more extensive tests.
The source of the corruption is this:
A tuple is inserted into the pending list. It is quickly updated again, or deleted, and then passes the dead-to-all horizon, so is eligible to vacuum.
A user back end takes the lock to start cleaning the pending list.
A vacuum back end attempts the lock, but due to the bug, it does so conditionally and then decides it is done when that fails, and starts the index vacuum proper.
The user back end reaches the dead tuple in the pending list, and inserts into the main part of the index, in a part that the vacuum has already passed over.
The vacuum goes back to the table and marks the tid slot as re-usable, even though there is still an index tuple pointing to it.
Cheers,
Jeff
Attachment
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Jeff Janes
Date:
On Thu, Nov 16, 2017 at 2:43 PM, Peter Geoghegan wrote:
> On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes wrote:
> > The only reference to super-exclusive lock in
> src/backend/access/gin/README,
> > that I can find, is about posting trees, not pending lists. Can you
> quote
> > or give line numbers of the section you are referring to?
>
> That's the section I was referring to. I apologize for being unclear.
>
> I would like to see a *general* explanation of why it's okay that the
> pending list can have its pages recycled early. How can we be sure
> that somebody looking through the pending list won't land on a
> just-recycled page and somehow get confused?
>
> We see this for the entry tree (no deletion is possible in the first
> place), and we also see it for the posting tree (the dance with
> inserters having a pin on the root, and so on). Not mentioning why
> pending list recycling is safe in terms of locking protocols seems
> like an omission that should be addressed. I'm no GIN expert, but I
> would expect this because the pending list seems like a kind of
> extension of the posting tree.
>
The pending list follows the same rule as the right-links on the posting
trees:
"To prevent a backend from reaching a deleted page via a right-link, when
following a right-link the lock on the previous page is not released until
the lock on next page has been acquired."
This lock-chaining rule is also followed when stepping from the meta page
to the first page in the pending list.
(Yes, neither of these are mentioned in the README).
Since pending lists don't have downlinks, the dance with the
super-exclusive locks is not needed.
I think what is true for deleted pages should also be sufficient for
deleted-and-freed-and-recycled pages. Since you are only allowed to follow
a link from a locked buffer and not from private memory, you can't follow a
stale link as long as the links are updated before the page is deleted and
then recycled.
Is this convincing?
Also missing is an argument for the deadlock-freeness. I think what is
needed for that is that you never lock a page earlier in the pending list,
nor the metapage, while holding a lock on a pending list page. (Which as
far as I can tell is true, but I don't know how to exhaustively verify it)
As I've said, it feels slightly off to me that a user backend that is
> performing opportunistic cleanup during insertion can call
> RecordFreeIndexPage() in GIN.
>
I don't know how to address that, because I don't know why it seems off.
No one else does this because no one else generates large number of free
pages during insertions. If we ever implement something like fractal
indexes for BTrees, then we would need to do this from the BTree insertion
(or clean-up-from-insertion) code as well. There are no warnings in the .h
or .c for RecordFreeIndexPage() about where one is allowed to call it
from. I don't see any hazards of this beyond those associated with
deleting the page in the first place. Any deleted page unpinned can be
recycled at any time by a vacuum that comes along while you are swapped
out, no?
Cheers,
Jeff
Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
From
Peter Geoghegan
Date:
On Fri, Nov 24, 2017 at 6:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> We see this for the entry tree (no deletion is possible in the first >> place), and we also see it for the posting tree (the dance with >> inserters having a pin on the root, and so on). Not mentioning why >> pending list recycling is safe in terms of locking protocols seems >> like an omission that should be addressed. I'm no GIN expert, but I >> would expect this because the pending list seems like a kind of >> extension of the posting tree. > > The pending list follows the same rule as the right-links on the posting > trees: > > "To prevent a backend from reaching a deleted page via a right-link, when > following a right-link the lock on the previous page is not released until > the lock on next page has been acquired." > > This lock-chaining rule is also followed when stepping from the meta page to > the first page in the pending list. > > (Yes, neither of these are mentioned in the README). But, as I said, we lock-chain with shared buffer locks in ginInsertCleanup(). Not exclusive buffer locks. (We also take the HW lock on the metapage so that ginInsertCleanup() callers block each other out). With B-Tree index structures that don't use Lehman & Yao's technique, in general, we must do lock coupling. There is, in general, no point in lock-chaining at all if there is no exclusive buffer locker (writer) involved at some point. Writers must lock with exclusive buffer locks from the beginning of the chain (e.g. root) for traditional coupling to be correct. However, the only exclusive buffer lock in ginInsertCleanup() is taken on the metapage at quite a late point in cleanup, just before insertion of pending items is performed. Crucially, this comes after we've decided what the head and tail of the pending list are. The metapage HW lock is sufficient for writers to block writers, but AFAICT it is not sufficient for writers to block readers. > Since pending lists don't have downlinks, the dance with the super-exclusive > locks is not needed. Isn't the pending list's head block number in the metapage very much like a downlink? I think I would be convinced by what you say here if we held an exclusive buffer lock on the metapage from the beginning within ginInsertCleanup() (we'd then remove the LockPage() HW lock calls entirely). We don't, though. I'm not proposing that we actually change the code in that way. My point is that that looks like it would make the code correct from the angle I'm looking at it, which may make my thinking clearer to you. > I think what is true for deleted pages should also be sufficient for > deleted-and-freed-and-recycled pages. Since you are only allowed to follow > a link from a locked buffer and not from private memory, you can't follow a > stale link as long as the links are updated before the page is deleted and > then recycled. > > Is this convincing? In general, yes, but the devil is in the details. I very much like the "pending list is analogous to posting list" framing. i just doubt that we get all the details right in practice. > Also missing is an argument for the deadlock-freeness. I think what is > needed for that is that you never lock a page earlier in the pending list, > nor the metapage, while holding a lock on a pending list page. (Which as > far as I can tell is true, but I don't know how to exhaustively verify it) Defensively checking the page type (pending, posting, etc) within scanPendingInsert() would be a good start. That's already something that we do for posting trees. If we're following the same rules as posting trees (which sounds like a good idea), then we should have the same defenses. Same applies to using elogs within ginInsertCleanup() -- we should promote those Assert()s to elog()s. I also suggest that you add a random, artificial sleep within ginInsertCleanup(), like this: /** Read and lock head of pending list*/ blkno = metadata->head; // Random sleep goes here buffer = ReadBuffer(index, blkno); LockBuffer(buffer, GIN_SHARE); page = BufferGetPage(buffer); If there is a race here, then this should make it more likely to hit. I should remind you that deadlocks might be caused by recycling races. While we should explain why pending list cleaning is deadlock-free, as you suggest, reports of deadlocks might not be due to a flaw in locking as such. >> As I've said, it feels slightly off to me that a user backend that is >> performing opportunistic cleanup during insertion can call >> RecordFreeIndexPage() in GIN. > > > I don't know how to address that, because I don't know why it seems off. No > one else does this because no one else generates large number of free pages > during insertions. My point is that it seems like having little distinction between deletion and recycling is problematic. If you don't have very strong locking to block out all possible index scans (locking at some central choke point), then you risk concurrent recycling races. I am wearing my nbtree hat in assessing the GIN code, which admittedly might not always be the best way of looking at it. I'm explaining my perspective because I think it might be useful for you to understand my thought process. I hope I don't come off as peevish. -- Peter Geoghegan