Thread: Multiple FPI_FOR_HINT for the same block during killing btree index items

Multiple FPI_FOR_HINT for the same block during killing btree index items

From
Masahiko Sawada
Date:
Hi all,

During investigating the issue our customer had, I realized that
_bt_killitems() can record the same FPI pages multiple times
simultaneously. This can happen when several concurrent index scans
are processing pages that contain killable tuples. Because killing
index items could be performed while holding a buffer lock in shared
mode concurrent processes record multiple FPI_FOR_HINT for the same
block.

Here is the reproducer:

cat <<EOF | psql -d postgres
drop table if exists tbl;
create table tbl (c int primary key) with (autovacuum_enabled = off);
insert into tbl select generate_series(1,300);
update tbl set c = c * -1 where c = 100;
checkpoint;
EOF

for n in `seq 1 4`
do
    psql -d postgres -c "select from tbl where c = 100" &
done

The server needs to enable wal_log_hints and this might need to run
several times. After running the script we can see this issue by
pg_waldump:

rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1C3D8,
prev 1/8FD1C368, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW
rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1E408,
prev 1/8FD1C3D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW

This is an excerpt from _bt_killitems() of version 12.2. By recent
changes the code of HEAD looks different much but the part in question
is essentially not changed much. That is, it's reproducible even with
HEAD.

    for (i = 0; i < numKilled; i++)
    {
        int         itemIndex = so->killedItems[i];
        BTScanPosItem *kitem = &so->currPos.items[itemIndex];
        OffsetNumber offnum = kitem->indexOffset;

        Assert(itemIndex >= so->currPos.firstItem &&
               itemIndex <= so->currPos.lastItem);
        if (offnum < minoff)
            continue;           /* pure paranoia */
        while (offnum <= maxoff)
        {
            ItemId      iid = PageGetItemId(page, offnum);
            IndexTuple  ituple = (IndexTuple) PageGetItem(page, iid);

            if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
            {
                /* found the item */
                ItemIdMarkDead(iid);
                killedsomething = true;
                break;          /* out of inner search loop */
            }
            offnum = OffsetNumberNext(offnum);
        }
    }

    /*
     * Since this can be redone later if needed, mark as dirty hint.
     *
     * Whenever we mark anything LP_DEAD, we also set the page's
     * BTP_HAS_GARBAGE flag, which is likewise just a hint.
     */
    if (killedsomething)
    {
        opaque->btpo_flags |= BTP_HAS_GARBAGE;
        MarkBufferDirtyHint(so->currPos.buf, true);
    }

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I understand that we can call MarkBufferDirtyHint while holding a
buffer lock in share mode as the comment of MarkBufferDirtyHint()
says, but I'd like to improve this behavior so that we can avoid
multiple FPI_FOR_HINT for the same block.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Alvaro Herrera
Date:
On 2020-Apr-09, Masahiko Sawada wrote:

> The inner test in the comment "found the item" never tests the item
> for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> condition. But there still is a race condition of recording multiple
> FPIs can happen. Maybe a better solution is to change the lock to
> exclusive, at least when wal_log_hints = on, so that only one process
> can run this code -- the reduction in concurrency might be won back by
> the fact that we don't wal-log the page multiple times.

I agree.

It seems worth pointing out that when this code was written, these hint
bit changes were not logged, so this consideration did not apply then.
But we added data checksums and wal_log_hints, which changed the
equation.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Peter Geoghegan
Date:
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> Here is the reproducer:

What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.

> The inner test in the comment "found the item" never tests the item
> for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> condition. But there still is a race condition of recording multiple
> FPIs can happen. Maybe a better solution is to change the lock to
> exclusive, at least when wal_log_hints = on, so that only one process
> can run this code -- the reduction in concurrency might be won back by
> the fact that we don't wal-log the page multiple times.

I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.

-- 
Peter Geoghegan



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
James Coleman
Date:
On Thu, Apr 9, 2020 at 3:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > Here is the reproducer:
>
> What version of Postgres did you notice the actual customer issue on?
> I ask because I wonder if the work on B-Tree indexes in Postgres 12
> affects the precise behavior you get here with real world workloads.
> It probably makes _bt_killitems() more effective with some workloads,
> which naturally increases the likelihood of having multiple FPI issued
> in the manner that you describe. OTOH, it might make it less likely
> with low cardinality indexes, since large groups of garbage duplicate
> tuples tend to get concentrated on just a few leaf pages.

We saw the issue on our PG11 clusters. The specific index we noticed
in the wal dump (I don't think we confirmed if there were others) as
one on a `created_at` column, to give you an idea of cardinality.

> > The inner test in the comment "found the item" never tests the item
> > for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> > condition. But there still is a race condition of recording multiple
> > FPIs can happen. Maybe a better solution is to change the lock to
> > exclusive, at least when wal_log_hints = on, so that only one process
> > can run this code -- the reduction in concurrency might be won back by
> > the fact that we don't wal-log the page multiple times.
>
> I like the idea of checking !ItemIdIsDead(iid) as a further condition
> of killing the item -- there is clearly no point in doing work to kill
> an item that is already dead. I don't like the idea of using an
> exclusive buffer lock (even if it's just with wal_log_hints = on),
> though.

I don't have a strong opinion on the lock.

James



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Peter Geoghegan
Date:
On Thu, Apr 9, 2020 at 1:37 PM James Coleman <jtc331@gmail.com> wrote:
> We saw the issue on our PG11 clusters. The specific index we noticed
> in the wal dump (I don't think we confirmed if there were others) as
> one on a `created_at` column, to give you an idea of cardinality.

You tend to get a lot of problems with indexes like that when there
are consistent updates (actually, that's more of a thing with an
updated_at index). But non-HOT updates alone might result in what you
could describe as "updates" to the index.

With Postgres 11, a low cardinality index could place new/successor
duplicate index tuples (those needed for non-HOT updates) on a more or
less random leaf page (you'll recall that this is determined by the
old "getting tired" logic). This is the kind of thing I had in mind
when I asked Sawada-san about it.

Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.

-- 
Peter Geoghegan



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Peter Geoghegan
Date:
On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Was this a low cardinality index in the way I describe? If it was,
> then we can hope (and maybe even verify) that the Postgres 12 work
> noticeably ameliorates the problem.

What I really meant was an index where hundreds or even thousands of
rows for each distinct timestamp value are expected. Not an index
where almost every row has a distinct timestamp value. Both timestamp
index patterns are common, obviously.

-- 
Peter Geoghegan



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
James Coleman
Date:
On Thu, Apr 9, 2020 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Was this a low cardinality index in the way I describe? If it was,
> > then we can hope (and maybe even verify) that the Postgres 12 work
> > noticeably ameliorates the problem.
>
> What I really meant was an index where hundreds or even thousands of
> rows for each distinct timestamp value are expected. Not an index
> where almost every row has a distinct timestamp value. Both timestamp
> index patterns are common, obviously.

I'll try to run some numbers tomorrow to confirm, but I believe that
the created_at value is almost (if not completely) unique. So, no,
it's not a low cardinality case like that.

I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.

James



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Peter Geoghegan
Date:
On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote:
> I believe the write pattern to this table likely looks like:
> - INSERT
> - UPDATE
> - DELETE
> for every row. But tomorrow I can do some more digging if needed.

The pg_stats.null_frac for the column/index might be interesting here. I
believe that Active Record will sometimes generate created_at columns
that sometimes end up containing NULL values. Not sure why.


--
Peter Geoghegan



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Masahiko Sawada
Date:
On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > Here is the reproducer:
>
> What version of Postgres did you notice the actual customer issue on?
> I ask because I wonder if the work on B-Tree indexes in Postgres 12
> affects the precise behavior you get here with real world workloads.
> It probably makes _bt_killitems() more effective with some workloads,
> which naturally increases the likelihood of having multiple FPI issued
> in the manner that you describe. OTOH, it might make it less likely
> with low cardinality indexes, since large groups of garbage duplicate
> tuples tend to get concentrated on just a few leaf pages.
>
> > The inner test in the comment "found the item" never tests the item
> > for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> > condition. But there still is a race condition of recording multiple
> > FPIs can happen. Maybe a better solution is to change the lock to
> > exclusive, at least when wal_log_hints = on, so that only one process
> > can run this code -- the reduction in concurrency might be won back by
> > the fact that we don't wal-log the page multiple times.
>
> I like the idea of checking !ItemIdIsDead(iid) as a further condition
> of killing the item -- there is clearly no point in doing work to kill
> an item that is already dead. I don't like the idea of using an
> exclusive buffer lock (even if it's just with wal_log_hints = on),
> though.
>

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Kyotaro Horiguchi
Date:
At Fri, 10 Apr 2020 12:32:31 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > > Here is the reproducer:
> >
> > What version of Postgres did you notice the actual customer issue on?
> > I ask because I wonder if the work on B-Tree indexes in Postgres 12
> > affects the precise behavior you get here with real world workloads.
> > It probably makes _bt_killitems() more effective with some workloads,
> > which naturally increases the likelihood of having multiple FPI issued
> > in the manner that you describe. OTOH, it might make it less likely
> > with low cardinality indexes, since large groups of garbage duplicate
> > tuples tend to get concentrated on just a few leaf pages.
> >
> > > The inner test in the comment "found the item" never tests the item
> > > for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> > > condition. But there still is a race condition of recording multiple
> > > FPIs can happen. Maybe a better solution is to change the lock to
> > > exclusive, at least when wal_log_hints = on, so that only one process
> > > can run this code -- the reduction in concurrency might be won back by
> > > the fact that we don't wal-log the page multiple times.
> >
> > I like the idea of checking !ItemIdIsDead(iid) as a further condition
> > of killing the item -- there is clearly no point in doing work to kill
> > an item that is already dead. I don't like the idea of using an
> > exclusive buffer lock (even if it's just with wal_log_hints = on),
> > though.
> >
> 
> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?

FWIW, looks good to me.

> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.

Makes sense if the lock were acquired just before the "BM_DIRTY |
BM_JUST_DIRTIED) check.  Could we use double-checking, as similar to
the patch for ItemIdIsDead()?

> if ((pg_atomic_read_u32(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>     (BM_DIRTY | BM_JUST_DIRTIED))
> {
...
>       * essential that CreateCheckpoint waits for virtual transactions
>       * rather than full transactionids.
>       */
>      /* blah, blah */   
>      buf_state = LockBufHdr(bufHdr);
>
>      if (buf_state & (BM_ | BM_JUST) != (..))
>      {
>        MyProc->delayChkpt = delayChkpt = true;
>        lsn = XLogSaveBufferForHint(buffer, buffer_std);
>      }
>    }
>    else
>     buf_state = LockBuffer(bufHdr);
      

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
James Coleman
Date:
On Thu, Apr 9, 2020 at 10:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote:
> > I believe the write pattern to this table likely looks like:
> > - INSERT
> > - UPDATE
> > - DELETE
> > for every row. But tomorrow I can do some more digging if needed.
>
> The pg_stats.null_frac for the column/index might be interesting here. I
> believe that Active Record will sometimes generate created_at columns
> that sometimes end up containing NULL values. Not sure why.

null_frac is 0 for created_at (what I expected). Also (under current
data) all created_at values are unique except a single row duplicate.

That being said, remember the write pattern above: every row gets
deleted eventually, so there'd be a lots of dead tuples overall.

James



Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

From
Alvaro Herrera
Date:
On 2020-Apr-10, Masahiko Sawada wrote:

> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches.  (Branches 12 and older obviously needed an
adjustment.)  Thanks!

> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan.  This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need.  So the one that gets it
first will do all the killing, log the page, then release the lock.  At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.

Like the attached.  I didn't verify that it works well or that it
actually improves performance ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <alvherre@2ndquadrant.com> escreveu:
On 2020-Apr-10, Masahiko Sawada wrote:

> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches.  (Branches 12 and older obviously needed an
adjustment.)  Thanks!

> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan.  This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need.  So the one that gets it
first will do all the killing, log the page, then release the lock.  At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.

Like the attached.  I didn't verify that it works well or that it
actually improves performance ...
This is not related to your latest patch.
But I believe I can improve the performance.

So:
1. If killedsomething is false
2. Any killtuple is true and (not ItemIdIsDead(iid)) is false
3. Nothing to be done.

So why do all the work and then discard it.
We can eliminate the current item much earlier, testing if it is already dead.
 
regards,
Ranier VIlela
Attachment
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <alvherre@2ndquadrant.com> escreveu:
On 2020-Apr-10, Masahiko Sawada wrote:

> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches.  (Branches 12 and older obviously needed an
adjustment.)  Thanks!

> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan.  This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need.  So the one that gets it
first will do all the killing, log the page, then release the lock.  At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.
Regarding the block, I disagree in part, because in the worst case,
the block can be requested in the last item analyzed, leading to redo all the work from the beginning.
If we are in _bt_killitems it is because there is a high probability that there will be items to be deleted,
why not request the block soon, if this meets the conditions?

1. XLogHintBitIsNeeded ()
2.! AutoVacuumingActive ()
3. New exclusive configuration variable option to activate the lock?

Masahiko reported that it occurs only when (autovacuum_enabled = off);

regards,
Ranier Vilela
Attachment