Thread: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted

[HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted

From
Masahiko Sawada
Date:
Hi,

Commit e2c79e14 prevented multiple cleanup process for pending list in
GIN index. But I think that there is still possibility that vacuum
could miss tuples to be deleted if someone else is cleaning up the
pending list.

In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
wait for the concurrent cleaning up process if stats == NULL. And the
source code comment says that this happen is when ginINsertCleanup is
called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
with this behavior. However, looking at the callers the stats is NULL
only either if pending list exceeds to threshold during insertions or
if only analyzing is performed by an autovacum worker or ANALYZE
command. So I think we should inVacuum = (stats != NULL) instead.
Also, we might want autoanalyze and ANALYZE command to wait for
concurrent process as well. Attached patch fixes these two issue. If
this is a bug we should back-patch to 9.6.

void
ginInsertCleanup(GinState *ginstate, bool full_clean,
                 bool fill_fsm, IndexBulkDeleteResult *stats)
{

(snip)

    bool        inVacuum = (stats == NULL);

    /*
     * We would like to prevent concurrent cleanup process. For that we will
     * lock metapage in exclusive mode using LockPage() call. Nobody other
     * will use that lock for metapage, so we keep possibility of concurrent
     * insertion into pending list
     */

    if (inVacuum)
    {
        /*
         * We are called from [auto]vacuum/analyze or gin_clean_pending_list()
         * and we would like to wait concurrent cleanup to finish.
         */
        LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock);
        workMemory =
            (IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1) ?
            autovacuum_work_mem : maintenance_work_mem;
    }
    else
    {
        /*
         * We are called from regular insert and if we see concurrent cleanup
         * just exit in hope that concurrent process will clean up pending
         * list.
         */
        if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
            return;
        workMemory = work_mem;
    }

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Nov 13, 2017 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Commit e2c79e14 prevented multiple cleanup process for pending list in
> GIN index. But I think that there is still possibility that vacuum
> could miss tuples to be deleted if someone else is cleaning up the
> pending list.

I've been suspicious of that commit (and related commits) for a while
now [1]. I think that it explains a couple of different problem
reports that we have seen.

> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
> wait for the concurrent cleaning up process if stats == NULL. And the
> source code comment says that this happen is when ginINsertCleanup is
> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
> with this behavior. However, looking at the callers the stats is NULL
> only either if pending list exceeds to threshold during insertions or
> if only analyzing is performed by an autovacum worker or ANALYZE
> command. So I think we should inVacuum = (stats != NULL) instead.
> Also, we might want autoanalyze and ANALYZE command to wait for
> concurrent process as well. Attached patch fixes these two issue. If
> this is a bug we should back-patch to 9.6.

How did you figure this out? Did you just notice that the code wasn't
doing what it claimed to do, or was there a problem that you saw in
production?

[1] https://postgr.es/m/CAH2-WzmtLXbs8+c19t1T=Rj0KyP7vK9q8hQJULgDLdVMuEeeUw@mail.gmail.com
-- 
Peter Geoghegan


On Tue, Nov 14, 2017 at 3:01 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Nov 13, 2017 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Commit e2c79e14 prevented multiple cleanup process for pending list in
>> GIN index. But I think that there is still possibility that vacuum
>> could miss tuples to be deleted if someone else is cleaning up the
>> pending list.
>
> I've been suspicious of that commit (and related commits) for a while
> now [1]. I think that it explains a couple of different problem
> reports that we have seen.

Yeah, the problem here is that vacuum and analyze don't acquire a
heavy weight lock for meta page using properly function. it seems not
relevant with that problem.

>
>> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
>> wait for the concurrent cleaning up process if stats == NULL. And the
>> source code comment says that this happen is when ginINsertCleanup is
>> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
>> with this behavior. However, looking at the callers the stats is NULL
>> only either if pending list exceeds to threshold during insertions or
>> if only analyzing is performed by an autovacum worker or ANALYZE
>> command. So I think we should inVacuum = (stats != NULL) instead.
>> Also, we might want autoanalyze and ANALYZE command to wait for
>> concurrent process as well. Attached patch fixes these two issue. If
>> this is a bug we should back-patch to 9.6.
>
> How did you figure this out? Did you just notice that the code wasn't
> doing what it claimed to do, or was there a problem that you saw in
> production?
>

I just noticed it during surveying the GIN code for the another patch[1].

[1] https://commitfest.postgresql.org/15/1133/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Mon, Nov 13, 2017 at 5:07 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I've been suspicious of that commit (and related commits) for a while
>> now [1]. I think that it explains a couple of different problem
>> reports that we have seen.
>
> Yeah, the problem here is that vacuum and analyze don't acquire a
> heavy weight lock for meta page using properly function. it seems not
> relevant with that problem.

One thing that really bothers me about commit e2c79e14 is that
LockPage() is called, not LockBuffer(). GIN had no LockPage() calls
before that commit, and is now the only code in the entire system that
calls LockPage()/ConditionalLockPage() (the hash am no longer uses
page heavyweight locks following recent work there).

Historically, the only reason that an AM would ever call LockPage()
instead of LockBuffer() (at least after LWLocks were introduced many
years ago) was that there was a small though acceptable risk of
deadlock for concurrent inserters. It would hardly ever happen, but
the possibility could not be ruled out, so deadlock detection was
required. This was definitely true of hash, which is one reason why it
was a second class index am for so long. I think this was even true of
nbtree, up until about 15 years ago.

The high level question that I have to ask about e2c79e14 is: can it
deadlock? If not, why was LockPage() used at all? It seems like a bad
sign that none of this is explained in the code.

My guess is that bugs in this area have caused data corruption (not
just undetectable deadlock issues), which was reported and commenting
on elsewhere [1]; maybe this proposed fix of yours could have
prevented that. But we clearly still need to do a careful analysis of
e2c79e14, because it seems like it probably has fundamental design
problems (it's not *just* buggy).

[1] https://postgr.es/m/CAH2-WznBt2+7qc65btjxNNwa9BW+jKEBgVjb=F+26iUQHMy+6A@mail.gmail.com
-- 
Peter Geoghegan


Hi,

Please unsubscribe me.

thanks
Eric


On Tuesday, November 14, 2017, 2:02:04 AM GMT+8, Peter Geoghegan <pg@bowt.ie> wrote:


On Mon, Nov 13, 2017 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Commit e2c79e14 prevented multiple cleanup process for pending list in
> GIN index. But I think that there is still possibility that vacuum
> could miss tuples to be deleted if someone else is cleaning up the
> pending list.

I've been suspicious of that commit (and related commits) for a while
now [1]. I think that it explains a couple of different problem
reports that we have seen.

> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
> wait for the concurrent cleaning up process if stats == NULL. And the
> source code comment says that this happen is when ginINsertCleanup is
> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
> with this behavior. However, looking at the callers the stats is NULL
> only either if pending list exceeds to threshold during insertions or
> if only analyzing is performed by an autovacum worker or ANALYZE
> command. So I think we should inVacuum = (stats != NULL) instead.
> Also, we might want autoanalyze and ANALYZE command to wait for
> concurrent process as well. Attached patch fixes these two issue. If
> this is a bug we should back-patch to 9.6.

How did you figure this out? Did you just notice that the code wasn't
doing what it claimed to do, or was there a problem that you saw in
production?

--
Peter Geoghegan

On Tue, Nov 14, 2017 at 10:52 AM, Eric Lam <cpegeric@yahoo.com> wrote:
> Please unsubscribe me.

(Please do not hijack the threads)
Help yourself:
https://www.postgresql.org/community/lists/subscribe/
-- 
Michael


On Tue, Nov 14, 2017 at 10:48 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Nov 13, 2017 at 5:07 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> I've been suspicious of that commit (and related commits) for a while
>>> now [1]. I think that it explains a couple of different problem
>>> reports that we have seen.
>>
>> Yeah, the problem here is that vacuum and analyze don't acquire a
>> heavy weight lock for meta page using properly function. it seems not
>> relevant with that problem.
>
> One thing that really bothers me about commit e2c79e14 is that
> LockPage() is called, not LockBuffer(). GIN had no LockPage() calls
> before that commit, and is now the only code in the entire system that
> calls LockPage()/ConditionalLockPage() (the hash am no longer uses
> page heavyweight locks following recent work there).
>
> Historically, the only reason that an AM would ever call LockPage()
> instead of LockBuffer() (at least after LWLocks were introduced many
> years ago) was that there was a small though acceptable risk of
> deadlock for concurrent inserters. It would hardly ever happen, but
> the possibility could not be ruled out, so deadlock detection was
> required. This was definitely true of hash, which is one reason why it
> was a second class index am for so long. I think this was even true of
> nbtree, up until about 15 years ago.
>
> The high level question that I have to ask about e2c79e14 is: can it
> deadlock? If not, why was LockPage() used at all? It seems like a bad
> sign that none of this is explained in the code.

I'm not sure the exact reason but I guess that because we keep holding
the lock for meta page during cleaning up pending list we would
eliminate the possibility of concurrent insertion into pending list if
we acquire the lock of meta page using LockBuffer() instead.This is
explained in ginInsertCleanup().
  /*   * We would like to prevent concurrent cleanup process. For that we will   * lock metapage in exclusive mode
usingLockPage() call. Nobody other   * will use that lock for metapage, so we keep possibility of concurrent   *
insertioninto pending list   */
 

So I conjecture that this has been introduced for not the reason why
we need to detect deadlock but the reason why we need to a different
lock from the lock used by insertion into pending list.

>
> My guess is that bugs in this area have caused data corruption (not
> just undetectable deadlock issues), which was reported and commenting
> on elsewhere [1]; maybe this proposed fix of yours could have
> prevented that. But we clearly still need to do a careful analysis of
> e2c79e14, because it seems like it probably has fundamental design
> problems (it's not *just* buggy).
>

Yeah, I agree. I'll manage to get the time to look carefully at the
GIN code related to fast insertion and cleanup pending list.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Mon, Nov 13, 2017 at 6:56 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>    /*
>     * We would like to prevent concurrent cleanup process. For that we will
>     * lock metapage in exclusive mode using LockPage() call. Nobody other
>     * will use that lock for metapage, so we keep possibility of concurrent
>     * insertion into pending list
>     */
>
> So I conjecture that this has been introduced for not the reason why
> we need to detect deadlock but the reason why we need to a different
> lock from the lock used by insertion into pending list.

I understood that much, but I think that we need to detect problems
and recover from them (something like _bt_page_recyclable()), rather
than preventing them with pessimistic locking -- or, at least, there
is no reason I know to think that the HW lock is sufficient, and I am
tempted to go that way to fix this. Commit e9568083, which added the
feature that led to commit e2c79e14, may itself be the basic problem
here. Another complication is that 218f5158 changed things here again.
It's possible that the report about 10 [1] (not the undetected
deadlock report) involved that commit, but either way we have big
problems here.

GIN doesn't have something like nbtree's
RecentGlobalXmin/_bt_page_recyclable() interlock. I don't know that
much about GIN (I would like to learn more), but I will try to break
this down, for myself, and for others. We know:

* There are some cases in which we simply don't ever delete pages (in
the entry tree), so that's obviously why there is no need for
RecentGlobalXmin interlock there. Easy.

* According to the GIN README, the pending list cleanup by VACUUM has
a super-exclusive lock on the root, to block out concurrent inserters
(that hold a pin on the root throughout). That's why it was/is okay
that VACUUM recycled pending list pages without a RecentGlobalXmin
interlock. Not so easy, but still not hard.

* Commit e9568083 and follow-ups let inserters free space from deleted
pages. Why should inserters not need a super-exclusive lock, just like
VACUUM? The README still says that they need one!

This is the first time we ever called RecordFreeIndexPage() from
anywhere that is not strictly VACUUM code. That feels wrong to me. The
obvious solution is to not recycle in inserter's calls to
ginInsertCleanup(), but I have no idea if that will work, in part
because I don't understand GIN very well.

> Yeah, I agree. I'll manage to get the time to look carefully at the
> GIN code related to fast insertion and cleanup pending list.

Cool. I'll try to spare some more time for this, too.

[1] https://www.postgresql.org/message-id/A5C88C48-4573-453B-BF52-3B34A97D7A0A%40xmission.com
-- 
Peter Geoghegan


On Mon, Nov 13, 2017 at 3:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
> wait for the concurrent cleaning up process if stats == NULL. And the
> source code comment says that this happen is when ginINsertCleanup is
> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
> with this behavior. However, looking at the callers the stats is NULL
> only either if pending list exceeds to threshold during insertions or
> if only analyzing is performed by an autovacum worker or ANALYZE
> command. So I think we should inVacuum = (stats != NULL) instead.

Argh.  Yeah, that looks wrong.

Instead of relying on this indirect method, how about passing an
explicit inVacuum argument to that function?  And while we're at it,
how about renaming inVacuum to forceCleanup?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, Nov 16, 2017 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 13, 2017 at 3:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> In ginInsertCleanup(), we lock the GIN meta page by LockPage and could
>> wait for the concurrent cleaning up process if stats == NULL. And the
>> source code comment says that this happen is when ginINsertCleanup is
>> called by [auto]vacuum/analyze or gin_clean_pending_list(). I agree
>> with this behavior. However, looking at the callers the stats is NULL
>> only either if pending list exceeds to threshold during insertions or
>> if only analyzing is performed by an autovacum worker or ANALYZE
>> command. So I think we should inVacuum = (stats != NULL) instead.
>
> Argh.  Yeah, that looks wrong.
>
> Instead of relying on this indirect method, how about passing an
> explicit inVacuum argument to that function?  And while we're at it,
> how about renaming inVacuum to forceCleanup?
>

Agreed, that's better. Attached updated patch.
Also I've added this to the next CF so as not to forget.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment
On Thu, Nov 16, 2017 at 7:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Agreed, that's better. Attached updated patch.
> Also I've added this to the next CF so as not to forget.

Committed and back-patched.  While I'm fairly sure this is a correct
fix, post-commit review from someone who knows GIN better than I do
would be a great idea.

I am also clear on what the consequences of this bug are.  It seems
like it could harm insert performance by making us wait when we
shouldn't, but can it cause corruption?  That I'm not sure about.  If
there's already a cleanup of the pending list in progress, how do
things go wrong?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, Nov 16, 2017 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I am also clear on what the consequences of this bug are.  It seems
> like it could harm insert performance by making us wait when we
> shouldn't, but can it cause corruption?  That I'm not sure about.  If
> there's already a cleanup of the pending list in progress, how do
> things go wrong?

While it was probably the right thing for you to go ahead and commit
the patch, I think that the bug undermines whatever confidence there
was in the correctness of the GIN pending list cleanup code. I have
far more questions than answers right now, though.


-- 
Peter Geoghegan


On Mon, Nov 13, 2017 at 8:48 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> One thing that really bothers me about commit e2c79e14 is that
> LockPage() is called, not LockBuffer(). GIN had no LockPage() calls
> before that commit, and is now the only code in the entire system that
> calls LockPage()/ConditionalLockPage() (the hash am no longer uses
> page heavyweight locks following recent work there).

I would like to get rid of that LockPage() call, for sure, because
it's problematic in terms of allowing writes in parallel mode.
However, I think the reason here is the same as why the hash AM used
to use them.  If you use a buffer lock, you really can't -- or
shoudn't, at least -- hold it across a whole series of operations,
because anyone waiting for that lock is waiting *uninterruptibly*.
The hash AM wanted to iterate through all of the pages in a bucket
chain and do something to each one while preventing concurrent scans;
the need here is similar.  Aside from the uninterruptible-wait
problem, such coding patterns are extremely prone to deadlock.  If
there's any chance that a process waiting for the buffer lock you hold
might be holding a buffer lock you try to acquire, you have got a
problem.

I don't view the use of LockPage() here as wrong or scary.  I find
that it's impeding some of my own development goals, but that's not to
say it wasn't a good choice for the problem that they were trying to
solve.  Had they solved it some other way, parallel writes might be
easier, but you can't win 'em all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, Nov 16, 2017 at 12:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I would like to get rid of that LockPage() call, for sure, because
> it's problematic in terms of allowing writes in parallel mode.
> However, I think the reason here is the same as why the hash AM used
> to use them.  If you use a buffer lock, you really can't -- or
> shoudn't, at least -- hold it across a whole series of operations,
> because anyone waiting for that lock is waiting *uninterruptibly*.

I'm well aware of that.

> The hash AM wanted to iterate through all of the pages in a bucket
> chain and do something to each one while preventing concurrent scans;
> the need here is similar.  Aside from the uninterruptible-wait
> problem, such coding patterns are extremely prone to deadlock.  If
> there's any chance that a process waiting for the buffer lock you hold
> might be holding a buffer lock you try to acquire, you have got a
> problem.

But there is only ever one page locked, the meta-page. And, it's
always an ExclusiveLock. I don't see much use for deadlock avoidance.

In any case, it's unusual to have a patch that uses LockPage() without
explaining the choice.

-- 
Peter Geoghegan


On Thu, Nov 16, 2017 at 12:49 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> But there is only ever one page locked, the meta-page. And, it's
> always an ExclusiveLock. I don't see much use for deadlock avoidance.

I meant deadlock detection.


-- 
Peter Geoghegan


On Mon, Nov 13, 2017 at 8:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Nov 13, 2017 at 6:56 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>    /*
>     * We would like to prevent concurrent cleanup process. For that we will
>     * lock metapage in exclusive mode using LockPage() call. Nobody other
>     * will use that lock for metapage, so we keep possibility of concurrent
>     * insertion into pending list
>     */
>
> So I conjecture that this has been introduced for not the reason why
> we need to detect deadlock but the reason why we need to a different
> lock from the lock used by insertion into pending list.

I understood that much, but I think that we need to detect problems
and recover from them (something like _bt_page_recyclable()), rather
than preventing them with pessimistic locking -- or, at least, there
is no reason I know to think that the HW lock is sufficient, and I am
tempted to go that way to fix this. Commit e9568083, which added the
feature that led to commit e2c79e14, may itself be the basic problem
here.

e2c79e14 was to fix a pre-existing bug, but probably e9568083 made that bug easier to hit than it was before.  (Which is not to say that  e9568083 can't contain bugs of its own, of course)


* According to the GIN README, the pending list cleanup by VACUUM has
a super-exclusive lock on the root, to block out concurrent inserters
(that hold a pin on the root throughout). That's why it was/is okay
that VACUUM recycled pending list pages without a RecentGlobalXmin
interlock. Not so easy, but still not hard.

The only reference to super-exclusive lock in src/backend/access/gin/README, that I can find, is about posting trees, not pending lists.  Can you quote or give line numbers of the section you are referring to?

Cheers,

Jeff
On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> The only reference to super-exclusive lock in src/backend/access/gin/README,
> that I can find, is about posting trees, not pending lists.  Can you quote
> or give line numbers of the section you are referring to?

That's the section I was referring to. I apologize for being unclear.

I would like to see a *general* explanation of why it's okay that the
pending list can have its pages recycled early. How can we be sure
that somebody looking through the pending list won't land on a
just-recycled page and somehow get confused?

We see this for the entry tree (no deletion is possible in the first
place), and we also see it for the posting tree (the dance with
inserters having a pin on the root, and so on). Not mentioning why
pending list recycling is safe in terms of locking protocols seems
like an omission that should be addressed. I'm no GIN expert, but I
would expect this because the pending list seems like a kind of
extension of the posting tree.

As I've said, it feels slightly off to me that a user backend that is
performing opportunistic cleanup during insertion can call
RecordFreeIndexPage() in GIN.

-- 
Peter Geoghegan


On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> e2c79e14 was to fix a pre-existing bug, but probably e9568083 made that bug
> easier to hit than it was before.  (Which is not to say that  e9568083 can't
> contain bugs of its own, of course)

I get that. I think that the same thing may have happened again, in
fact. A pending list recycling interlock bug may have merely been
unmasked by your commit e9568083; I'm speculating that your commit
made it more likely to hit in practice, just as with the earlier bug
you mentioned.

As you know, there is a significant history of VACUUM page recycling
bugs in GIN. I wouldn't be surprised if we found one more in the
pending list logic. Consider commit ac4ab97e, a bugfix from late 2013
for posting list page deletion, where the current posting list locking
protocol was more or less established. The commit message says:

"""
...
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.

To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the [posting] tree.
...
"""

I believe that this commit had GIN avoid deletion of the leftmost
branch of posting trees because that makes it safe to get to the
posting tree root from a raw root block number (a raw block number
from the B-tree index constructed over key values). The buffer lock
pairing/crabbing this commit adds to posting lists (similar to the
pairing/crabbing that pending lists had from day one, when first added
in 2011) can prevent a concurrent deletion once you reach the posting
tree root. But, as I said, you still need to reliably get to the root
in the first place, which the "don't delete leftmost" rule ensures (if
you can never delete it, you can never recycle it, and no reader gets
confused).

It's not clear why it's safe to recycle the pending list pages at all
(though deletion alone might well be okay, because readers can notice
a page is deleted if it isn't yet recycled). Why is it okay that we
can jump to the first block from the meta page in the face of
concurrent recycling?

Looking at scanPendingInsert(), it does appear that readers do buffer
lock crabbing/coupling for the meta page and the first pending page.
So that's an encouraging sign. However, ginInsertCleanup() *also* uses
shared buffer locks for both meta page and list head page (never
exclusive buffer locks) in exactly the same way when first
establishing what block is at the head of the list to be zapped. It's
acting as if it is a reader, but it's actually a writer. I don't
follow why this is correct. It looks like scanPendingInsert() can
decide that it knows where the head of the pending list is *after*
ginInsertCleanup() has already decided that that same head of the list
block needs to possibly be deleted/recycled. Have I missed something?

We really ought to make the asserts on gin page type into "can't
happen" elog errors, as a defensive measure, in both
scanPendingInsert() and ginInsertCleanup(). The 2013 bug fix actually
did this for the posting tree, but didn't touch similar pending list
code.

-- 
Peter Geoghegan


On Thu, Nov 16, 2017 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Nov 16, 2017 at 7:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Agreed, that's better. Attached updated patch.
> Also I've added this to the next CF so as not to forget.

Committed and back-patched.  While I'm fairly sure this is a correct
fix, post-commit review from someone who knows GIN better than I do
would be a great idea.

I am also clear on what the consequences of this bug are.  It seems
like it could harm insert performance by making us wait when we
shouldn't, but can it cause corruption?  That I'm not sure about.  If
there's already a cleanup of the pending list in progress, how do
things go wrong?


 
It can cause corruption.  Attached is a test case to show it.  It is based on my usual crash-recovery test, but to get corruption I had to turn off the crash-injection (which means the test can be run on an unpatched server) and increase gin_pending_list_limit.

On 8 CPU, it took anywhere from 15 minutes to 8 hours to see corruption, which presents as something like this in the script output:

child abnormal exit update did not update 1 row: key 8117 updated 2 at count.pl line 193.\n  at count.pl line 201.

Your commit has apparently fixed it, but I will run some more extensive tests.

The source of the corruption is this:

A tuple is inserted into the pending list.  It is quickly updated again, or deleted, and then passes the dead-to-all horizon, so is eligible to vacuum.

A user back end takes the lock to start cleaning the pending list.

A vacuum back end attempts the lock, but due to the bug, it does so conditionally and then decides it is done when that fails, and starts the index vacuum proper.  

The user back end reaches the dead tuple in the pending list, and inserts into the main part of the index, in a part that the vacuum has already passed over.

The vacuum goes back to the table and marks the tid slot as re-usable, even though there is still an index tuple pointing to it.

Cheers,

Jeff

Attachment
On Thu, Nov 16, 2017 at 2:43 PM, Peter Geoghegan wrote: > On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes wrote: > > The only reference to super-exclusive lock in > src/backend/access/gin/README, > > that I can find, is about posting trees, not pending lists. Can you > quote > > or give line numbers of the section you are referring to? > > That's the section I was referring to. I apologize for being unclear. > > I would like to see a *general* explanation of why it's okay that the > pending list can have its pages recycled early. How can we be sure > that somebody looking through the pending list won't land on a > just-recycled page and somehow get confused? > > We see this for the entry tree (no deletion is possible in the first > place), and we also see it for the posting tree (the dance with > inserters having a pin on the root, and so on). Not mentioning why > pending list recycling is safe in terms of locking protocols seems > like an omission that should be addressed. I'm no GIN expert, but I > would expect this because the pending list seems like a kind of > extension of the posting tree. > The pending list follows the same rule as the right-links on the posting trees: "To prevent a backend from reaching a deleted page via a right-link, when following a right-link the lock on the previous page is not released until the lock on next page has been acquired." This lock-chaining rule is also followed when stepping from the meta page to the first page in the pending list. (Yes, neither of these are mentioned in the README). Since pending lists don't have downlinks, the dance with the super-exclusive locks is not needed. I think what is true for deleted pages should also be sufficient for deleted-and-freed-and-recycled pages. Since you are only allowed to follow a link from a locked buffer and not from private memory, you can't follow a stale link as long as the links are updated before the page is deleted and then recycled. Is this convincing? Also missing is an argument for the deadlock-freeness. I think what is needed for that is that you never lock a page earlier in the pending list, nor the metapage, while holding a lock on a pending list page. (Which as far as I can tell is true, but I don't know how to exhaustively verify it) As I've said, it feels slightly off to me that a user backend that is > performing opportunistic cleanup during insertion can call > RecordFreeIndexPage() in GIN. > I don't know how to address that, because I don't know why it seems off. No one else does this because no one else generates large number of free pages during insertions. If we ever implement something like fractal indexes for BTrees, then we would need to do this from the BTree insertion (or clean-up-from-insertion) code as well. There are no warnings in the .h or .c for RecordFreeIndexPage() about where one is allowed to call it from. I don't see any hazards of this beyond those associated with deleting the page in the first place. Any deleted page unpinned can be recycled at any time by a vacuum that comes along while you are swapped out, no? Cheers, Jeff
On Fri, Nov 24, 2017 at 6:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> We see this for the entry tree (no deletion is possible in the first
>> place), and we also see it for the posting tree (the dance with
>> inserters having a pin on the root, and so on). Not mentioning why
>> pending list recycling is safe in terms of locking protocols seems
>> like an omission that should be addressed. I'm no GIN expert, but I
>> would expect this because the pending list seems like a kind of
>> extension of the posting tree.
>
> The pending list follows the same rule as the right-links on the posting
> trees:
>
> "To prevent a backend from reaching a deleted page via a right-link, when
> following a right-link the lock on the previous page is not released until
> the lock on next page has been acquired."
>
> This lock-chaining rule is also followed when stepping from the meta page to
> the first page in the pending list.
>
> (Yes, neither of these are mentioned in the README).

But, as I said, we lock-chain with shared buffer locks in
ginInsertCleanup(). Not exclusive buffer locks. (We also take the HW
lock on the metapage so that ginInsertCleanup() callers block each
other out).

With B-Tree index structures that don't use Lehman & Yao's technique,
in general, we must do lock coupling. There is, in general, no point
in lock-chaining at all if there is no exclusive buffer locker
(writer) involved at some point. Writers must lock with exclusive
buffer locks from the beginning of the chain (e.g. root) for
traditional coupling to be correct. However, the only exclusive buffer
lock in ginInsertCleanup() is taken on the metapage at quite a late
point in cleanup, just before insertion of pending items is performed.
Crucially, this comes after we've decided what the head and tail of
the pending list are.

The metapage HW lock is sufficient for writers to block writers, but
AFAICT it is not sufficient for writers to block readers.

> Since pending lists don't have downlinks, the dance with the super-exclusive
> locks is not needed.

Isn't the pending list's head block number in the metapage very much
like a downlink? I think I would be convinced by what you say here if
we held an exclusive buffer lock on the metapage from the beginning
within ginInsertCleanup() (we'd then remove the LockPage() HW lock
calls entirely). We don't, though. I'm not proposing that we actually
change the code in that way. My point is that that looks like it would
make the code correct from the angle I'm looking at it, which may make
my thinking clearer to you.

> I think what is true for deleted pages should also be sufficient for
> deleted-and-freed-and-recycled pages.  Since you are only allowed to follow
> a link from a locked buffer and not from private memory, you can't follow a
> stale link as long as the links are updated before the page is deleted and
> then recycled.
>
> Is this convincing?

In general, yes, but the devil is in the details. I very much like the
"pending list is analogous to posting list" framing. i just doubt that
we get all the details right in practice.

> Also missing is an argument for the deadlock-freeness.  I think what is
> needed for that is that you never lock a page earlier in the pending list,
> nor the metapage, while holding a lock on a pending list page.  (Which as
> far as I can tell is true, but I don't know how to exhaustively verify it)

Defensively checking the page type (pending, posting, etc) within
scanPendingInsert() would be a good start. That's already something
that we do for posting trees. If we're following the same rules as
posting trees (which sounds like a good idea), then we should have the
same defenses. Same applies to using elogs within ginInsertCleanup()
-- we should promote those Assert()s to elog()s.

I also suggest that you add a random, artificial sleep within
ginInsertCleanup(), like this:

/** Read and lock head of pending list*/
blkno = metadata->head;
// Random sleep goes here
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, GIN_SHARE);
page = BufferGetPage(buffer);

If there is a race here, then this should make it more likely to hit.

I should remind you that deadlocks might be caused by recycling races.
While we should explain why pending list cleaning is deadlock-free, as
you suggest, reports of deadlocks might not be due to a flaw in
locking as such.

>> As I've said, it feels slightly off to me that a user backend that is
>> performing opportunistic cleanup during insertion can call
>> RecordFreeIndexPage() in GIN.
>
>
> I don't know how to address that, because I don't know why it seems off.  No
> one else does this because no one else generates large number of free pages
> during insertions.

My point is that it seems like having little distinction between
deletion and recycling is problematic. If you don't have very strong
locking to block out all possible index scans (locking at some central
choke point), then you risk concurrent recycling races. I am wearing
my nbtree hat in assessing the GIN code, which admittedly might not
always be the best way of looking at it. I'm explaining my perspective
because I think it might be useful for you to understand my thought
process. I hope I don't come off as peevish.

-- 
Peter Geoghegan