Thread: Multiple FPI_FOR_HINT for the same block during killing btree index items
Multiple FPI_FOR_HINT for the same block during killing btree index items
From
Masahiko Sawada
Date:
Hi all, During investigating the issue our customer had, I realized that _bt_killitems() can record the same FPI pages multiple times simultaneously. This can happen when several concurrent index scans are processing pages that contain killable tuples. Because killing index items could be performed while holding a buffer lock in shared mode concurrent processes record multiple FPI_FOR_HINT for the same block. Here is the reproducer: cat <<EOF | psql -d postgres drop table if exists tbl; create table tbl (c int primary key) with (autovacuum_enabled = off); insert into tbl select generate_series(1,300); update tbl set c = c * -1 where c = 100; checkpoint; EOF for n in `seq 1 4` do psql -d postgres -c "select from tbl where c = 100" & done The server needs to enable wal_log_hints and this might need to run several times. After running the script we can see this issue by pg_waldump: rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1C3D8, prev 1/8FD1C368, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767 blk 0 FPW rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1E408, prev 1/8FD1C3D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767 blk 0 FPW This is an excerpt from _bt_killitems() of version 12.2. By recent changes the code of HEAD looks different much but the part in question is essentially not changed much. That is, it's reproducible even with HEAD. for (i = 0; i < numKilled; i++) { int itemIndex = so->killedItems[i]; BTScanPosItem *kitem = &so->currPos.items[itemIndex]; OffsetNumber offnum = kitem->indexOffset; Assert(itemIndex >= so->currPos.firstItem && itemIndex <= so->currPos.lastItem); if (offnum < minoff) continue; /* pure paranoia */ while (offnum <= maxoff) { ItemId iid = PageGetItemId(page, offnum); IndexTuple ituple = (IndexTuple) PageGetItem(page, iid); if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)) { /* found the item */ ItemIdMarkDead(iid); killedsomething = true; break; /* out of inner search loop */ } offnum = OffsetNumberNext(offnum); } } /* * Since this can be redone later if needed, mark as dirty hint. * * Whenever we mark anything LP_DEAD, we also set the page's * BTP_HAS_GARBAGE flag, which is likewise just a hint. */ if (killedsomething) { opaque->btpo_flags |= BTP_HAS_GARBAGE; MarkBufferDirtyHint(so->currPos.buf, true); } The inner test in the comment "found the item" never tests the item for being dead. So maybe we can add !ItemIdIsDead(iid) to that condition. But there still is a race condition of recording multiple FPIs can happen. Maybe a better solution is to change the lock to exclusive, at least when wal_log_hints = on, so that only one process can run this code -- the reduction in concurrency might be won back by the fact that we don't wal-log the page multiple times. I understand that we can call MarkBufferDirtyHint while holding a buffer lock in share mode as the comment of MarkBufferDirtyHint() says, but I'd like to improve this behavior so that we can avoid multiple FPI_FOR_HINT for the same block. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Alvaro Herrera
Date:
On 2020-Apr-09, Masahiko Sawada wrote: > The inner test in the comment "found the item" never tests the item > for being dead. So maybe we can add !ItemIdIsDead(iid) to that > condition. But there still is a race condition of recording multiple > FPIs can happen. Maybe a better solution is to change the lock to > exclusive, at least when wal_log_hints = on, so that only one process > can run this code -- the reduction in concurrency might be won back by > the fact that we don't wal-log the page multiple times. I agree. It seems worth pointing out that when this code was written, these hint bit changes were not logged, so this consideration did not apply then. But we added data checksums and wal_log_hints, which changed the equation. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Peter Geoghegan
Date:
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > Here is the reproducer: What version of Postgres did you notice the actual customer issue on? I ask because I wonder if the work on B-Tree indexes in Postgres 12 affects the precise behavior you get here with real world workloads. It probably makes _bt_killitems() more effective with some workloads, which naturally increases the likelihood of having multiple FPI issued in the manner that you describe. OTOH, it might make it less likely with low cardinality indexes, since large groups of garbage duplicate tuples tend to get concentrated on just a few leaf pages. > The inner test in the comment "found the item" never tests the item > for being dead. So maybe we can add !ItemIdIsDead(iid) to that > condition. But there still is a race condition of recording multiple > FPIs can happen. Maybe a better solution is to change the lock to > exclusive, at least when wal_log_hints = on, so that only one process > can run this code -- the reduction in concurrency might be won back by > the fact that we don't wal-log the page multiple times. I like the idea of checking !ItemIdIsDead(iid) as a further condition of killing the item -- there is clearly no point in doing work to kill an item that is already dead. I don't like the idea of using an exclusive buffer lock (even if it's just with wal_log_hints = on), though. -- Peter Geoghegan
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
James Coleman
Date:
On Thu, Apr 9, 2020 at 3:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > Here is the reproducer: > > What version of Postgres did you notice the actual customer issue on? > I ask because I wonder if the work on B-Tree indexes in Postgres 12 > affects the precise behavior you get here with real world workloads. > It probably makes _bt_killitems() more effective with some workloads, > which naturally increases the likelihood of having multiple FPI issued > in the manner that you describe. OTOH, it might make it less likely > with low cardinality indexes, since large groups of garbage duplicate > tuples tend to get concentrated on just a few leaf pages. We saw the issue on our PG11 clusters. The specific index we noticed in the wal dump (I don't think we confirmed if there were others) as one on a `created_at` column, to give you an idea of cardinality. > > The inner test in the comment "found the item" never tests the item > > for being dead. So maybe we can add !ItemIdIsDead(iid) to that > > condition. But there still is a race condition of recording multiple > > FPIs can happen. Maybe a better solution is to change the lock to > > exclusive, at least when wal_log_hints = on, so that only one process > > can run this code -- the reduction in concurrency might be won back by > > the fact that we don't wal-log the page multiple times. > > I like the idea of checking !ItemIdIsDead(iid) as a further condition > of killing the item -- there is clearly no point in doing work to kill > an item that is already dead. I don't like the idea of using an > exclusive buffer lock (even if it's just with wal_log_hints = on), > though. I don't have a strong opinion on the lock. James
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Peter Geoghegan
Date:
On Thu, Apr 9, 2020 at 1:37 PM James Coleman <jtc331@gmail.com> wrote: > We saw the issue on our PG11 clusters. The specific index we noticed > in the wal dump (I don't think we confirmed if there were others) as > one on a `created_at` column, to give you an idea of cardinality. You tend to get a lot of problems with indexes like that when there are consistent updates (actually, that's more of a thing with an updated_at index). But non-HOT updates alone might result in what you could describe as "updates" to the index. With Postgres 11, a low cardinality index could place new/successor duplicate index tuples (those needed for non-HOT updates) on a more or less random leaf page (you'll recall that this is determined by the old "getting tired" logic). This is the kind of thing I had in mind when I asked Sawada-san about it. Was this a low cardinality index in the way I describe? If it was, then we can hope (and maybe even verify) that the Postgres 12 work noticeably ameliorates the problem. -- Peter Geoghegan
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Peter Geoghegan
Date:
On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > Was this a low cardinality index in the way I describe? If it was, > then we can hope (and maybe even verify) that the Postgres 12 work > noticeably ameliorates the problem. What I really meant was an index where hundreds or even thousands of rows for each distinct timestamp value are expected. Not an index where almost every row has a distinct timestamp value. Both timestamp index patterns are common, obviously. -- Peter Geoghegan
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
James Coleman
Date:
On Thu, Apr 9, 2020 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Was this a low cardinality index in the way I describe? If it was, > > then we can hope (and maybe even verify) that the Postgres 12 work > > noticeably ameliorates the problem. > > What I really meant was an index where hundreds or even thousands of > rows for each distinct timestamp value are expected. Not an index > where almost every row has a distinct timestamp value. Both timestamp > index patterns are common, obviously. I'll try to run some numbers tomorrow to confirm, but I believe that the created_at value is almost (if not completely) unique. So, no, it's not a low cardinality case like that. I believe the write pattern to this table likely looks like: - INSERT - UPDATE - DELETE for every row. But tomorrow I can do some more digging if needed. James
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Peter Geoghegan
Date:
On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote: > I believe the write pattern to this table likely looks like: > - INSERT > - UPDATE > - DELETE > for every row. But tomorrow I can do some more digging if needed. The pg_stats.null_frac for the column/index might be interesting here. I believe that Active Record will sometimes generate created_at columns that sometimes end up containing NULL values. Not sure why. -- Peter Geoghegan
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Masahiko Sawada
Date:
On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > Here is the reproducer: > > What version of Postgres did you notice the actual customer issue on? > I ask because I wonder if the work on B-Tree indexes in Postgres 12 > affects the precise behavior you get here with real world workloads. > It probably makes _bt_killitems() more effective with some workloads, > which naturally increases the likelihood of having multiple FPI issued > in the manner that you describe. OTOH, it might make it less likely > with low cardinality indexes, since large groups of garbage duplicate > tuples tend to get concentrated on just a few leaf pages. > > > The inner test in the comment "found the item" never tests the item > > for being dead. So maybe we can add !ItemIdIsDead(iid) to that > > condition. But there still is a race condition of recording multiple > > FPIs can happen. Maybe a better solution is to change the lock to > > exclusive, at least when wal_log_hints = on, so that only one process > > can run this code -- the reduction in concurrency might be won back by > > the fact that we don't wal-log the page multiple times. > > I like the idea of checking !ItemIdIsDead(iid) as a further condition > of killing the item -- there is clearly no point in doing work to kill > an item that is already dead. I don't like the idea of using an > exclusive buffer lock (even if it's just with wal_log_hints = on), > though. > Okay. I think only adding the check would also help with reducing the likelihood. How about the changes for the current HEAD I've attached? Related to this behavior on btree indexes, this can happen even on heaps during searching heap tuples. To reduce the likelihood of that more generally I wonder if we can acquire a lock on buffer descriptor right before XLogSaveBufferForHint() and set a flag to the buffer descriptor that indicates that we're about to log FPI for hint bit so that concurrent process can be aware of that. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Kyotaro Horiguchi
Date:
At Fri, 10 Apr 2020 12:32:31 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in > On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > Here is the reproducer: > > > > What version of Postgres did you notice the actual customer issue on? > > I ask because I wonder if the work on B-Tree indexes in Postgres 12 > > affects the precise behavior you get here with real world workloads. > > It probably makes _bt_killitems() more effective with some workloads, > > which naturally increases the likelihood of having multiple FPI issued > > in the manner that you describe. OTOH, it might make it less likely > > with low cardinality indexes, since large groups of garbage duplicate > > tuples tend to get concentrated on just a few leaf pages. > > > > > The inner test in the comment "found the item" never tests the item > > > for being dead. So maybe we can add !ItemIdIsDead(iid) to that > > > condition. But there still is a race condition of recording multiple > > > FPIs can happen. Maybe a better solution is to change the lock to > > > exclusive, at least when wal_log_hints = on, so that only one process > > > can run this code -- the reduction in concurrency might be won back by > > > the fact that we don't wal-log the page multiple times. > > > > I like the idea of checking !ItemIdIsDead(iid) as a further condition > > of killing the item -- there is clearly no point in doing work to kill > > an item that is already dead. I don't like the idea of using an > > exclusive buffer lock (even if it's just with wal_log_hints = on), > > though. > > > > Okay. I think only adding the check would also help with reducing the > likelihood. How about the changes for the current HEAD I've attached? FWIW, looks good to me. > Related to this behavior on btree indexes, this can happen even on > heaps during searching heap tuples. To reduce the likelihood of that > more generally I wonder if we can acquire a lock on buffer descriptor > right before XLogSaveBufferForHint() and set a flag to the buffer > descriptor that indicates that we're about to log FPI for hint bit so > that concurrent process can be aware of that. Makes sense if the lock were acquired just before the "BM_DIRTY | BM_JUST_DIRTIED) check. Could we use double-checking, as similar to the patch for ItemIdIsDead()? > if ((pg_atomic_read_u32(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED)) != > (BM_DIRTY | BM_JUST_DIRTIED)) > { ... > * essential that CreateCheckpoint waits for virtual transactions > * rather than full transactionids. > */ > /* blah, blah */ > buf_state = LockBufHdr(bufHdr); > > if (buf_state & (BM_ | BM_JUST) != (..)) > { > MyProc->delayChkpt = delayChkpt = true; > lsn = XLogSaveBufferForHint(buffer, buffer_std); > } > } > else > buf_state = LockBuffer(bufHdr); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
James Coleman
Date:
On Thu, Apr 9, 2020 at 10:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Apr 9, 2020 at 6:47 PM James Coleman <jtc331@gmail.com> wrote: > > I believe the write pattern to this table likely looks like: > > - INSERT > > - UPDATE > > - DELETE > > for every row. But tomorrow I can do some more digging if needed. > > The pg_stats.null_frac for the column/index might be interesting here. I > believe that Active Record will sometimes generate created_at columns > that sometimes end up containing NULL values. Not sure why. null_frac is 0 for created_at (what I expected). Also (under current data) all created_at values are unique except a single row duplicate. That being said, remember the write pattern above: every row gets deleted eventually, so there'd be a lots of dead tuples overall. James
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Alvaro Herrera
Date:
On 2020-Apr-10, Masahiko Sawada wrote: > Okay. I think only adding the check would also help with reducing the > likelihood. How about the changes for the current HEAD I've attached? Pushed this to all branches. (Branches 12 and older obviously needed an adjustment.) Thanks! > Related to this behavior on btree indexes, this can happen even on > heaps during searching heap tuples. To reduce the likelihood of that > more generally I wonder if we can acquire a lock on buffer descriptor > right before XLogSaveBufferForHint() and set a flag to the buffer > descriptor that indicates that we're about to log FPI for hint bit so > that concurrent process can be aware of that. I'm not sure how that helps; the other process would have to go back and redo their whole operation from scratch in order to find out whether there's still something alive that needs killing. I think you need to acquire the exclusive lock sooner: if, when scanning the page, you find a killable item, *then* upgrade the lock to exclusive and restart the scan. This means that we'll have to wait for any other process that's doing the scan, and they will all give up their share lock to wait for the exclusive lock they need. So the one that gets it first will do all the killing, log the page, then release the lock. At that point the other processes will wake up and see that items have been killed, so they will return having done nothing. Like the attached. I didn't verify that it works well or that it actually improves performance ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Ranier Vilela
Date:
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <alvherre@2ndquadrant.com> escreveu:
On 2020-Apr-10, Masahiko Sawada wrote:
> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?
Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!
> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.
I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.
I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.
Like the attached. I didn't verify that it works well or that it
actually improves performance ...
This is not related to your latest patch.
But I believe I can improve the performance.
But I believe I can improve the performance.
So:
1. If killedsomething is false
2. Any killtuple is true and (not ItemIdIsDead(iid)) is false
3. Nothing to be done.
So why do all the work and then discard it.
We can eliminate the current item much earlier, testing if it is already dead.
2. Any killtuple is true and (not ItemIdIsDead(iid)) is false
3. Nothing to be done.
So why do all the work and then discard it.
We can eliminate the current item much earlier, testing if it is already dead.
regards,
Ranier VIlela
Attachment
Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items
From
Ranier Vilela
Date:
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <alvherre@2ndquadrant.com> escreveu:
On 2020-Apr-10, Masahiko Sawada wrote:
> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?
Pushed this to all branches. (Branches 12 and older obviously needed an
adjustment.) Thanks!
> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.
I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.
I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan. This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need. So the one that gets it
first will do all the killing, log the page, then release the lock. At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.
Regarding the block, I disagree in part, because in the worst case,
the block can be requested in the last item analyzed, leading to redo all the work from the beginning.
If we are in _bt_killitems it is because there is a high probability that there will be items to be deleted,
If we are in _bt_killitems it is because there is a high probability that there will be items to be deleted,
why not request the block soon, if this meets the conditions?
1. XLogHintBitIsNeeded ()
2.! AutoVacuumingActive ()
3. New exclusive configuration variable option to activate the lock?
Masahiko reported that it occurs only when (autovacuum_enabled = off);
regards,
2.! AutoVacuumingActive ()
3. New exclusive configuration variable option to activate the lock?
Masahiko reported that it occurs only when (autovacuum_enabled = off);
regards,
Ranier Vilela