Thread: lazy_scan_heap() should release lock on buffer before vacuuming FSM

lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Melanie Plageman
Date:
Hi,

I noticed that in lazy_scan_heap(), when there are no indexes on the
table being vacuumed, we don't release the lock on the heap page buffer
before vacuuming the freespace map. Other call sites of
FreeSpaceMapVacuumRange() hold no such lock. It seems like a waste to
hold a lock we don't need.

ISTM the fix (attached) is just to move down the call to
FreeSpaceMapVacuumRange() to after we've released the lock and recorded
the space we just freed.

- Melanie

Attachment

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Andres Freund
Date:
Hi,

On 2023-11-13 17:13:32 -0500, Melanie Plageman wrote:
> I noticed that in lazy_scan_heap(), when there are no indexes on the
> table being vacuumed, we don't release the lock on the heap page buffer
> before vacuuming the freespace map. Other call sites of
> FreeSpaceMapVacuumRange() hold no such lock. It seems like a waste to
> hold a lock we don't need.

I think this undersells the situation a bit. We right now do
FreeSpaceMapVacuumRange() for 8GB of data (VACUUM_FSM_EVERY_PAGES) in the main
fork, while holding an exclusive page level lock.  There's no guarantee (or
even high likelihood) that those pages are currently in the page cache, given
that we have processed up to 8GB of heap data since. 8GB of data is roughly
2MB of FSM with the default compilation options.

Of course processing 2MB of FSM doesn't take that long, but still, it seems
worse than just reading a page or two.



> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 6985d299b2..8b729828ce 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1046,18 +1046,6 @@ lazy_scan_heap(LVRelState *vacrel)
>                  /* Forget the LP_DEAD items that we just vacuumed */
>                  dead_items->num_items = 0;
>  
> -                /*
> -                 * Periodically perform FSM vacuuming to make newly-freed
> -                 * space visible on upper FSM pages.  Note we have not yet
> -                 * performed FSM processing for blkno.
> -                 */
> -                if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
> -                {
> -                    FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
> -                                            blkno);
> -                    next_fsm_block_to_vacuum = blkno;
> -                }
> -
>                  /*
>                   * Now perform FSM processing for blkno, and move on to next
>                   * page.
> @@ -1071,6 +1059,18 @@ lazy_scan_heap(LVRelState *vacrel)
>  
>                  UnlockReleaseBuffer(buf);
>                  RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
> +
> +                /*
> +                 * Periodically perform FSM vacuuming to make newly-freed
> +                 * space visible on upper FSM pages.
> +                 */
> +                if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
> +                {
> +                    FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
> +                                            blkno);
> +                    next_fsm_block_to_vacuum = blkno;
> +                }
> +
>                  continue;
>              }

Previously there was this comment about "not yet", hinting at that being
important for FreeSpaceMapVacuumRange's API. I think as-is we now don't vacuum
the actually freshly updated page contents.

FreeSpaceMapVacuumRange()'s comment says:
 * As above, but assume that only heap pages between start and end-1 inclusive
 * have new free-space information, so update only the upper-level slots
 * covering that block range.  end == InvalidBlockNumber is equivalent to
 * "all the rest of the relation".

So FreeSpaceMapVacuumRange(..., blkno) will not actually process the "effects"
of the RecordPageWithFreeSpace() above it - which seems confusing.



Aside:

I just tried to reach the path and noticed something odd:

=# show autovacuum;
┌────────────┐
│ autovacuum │
├────────────┤
│ off        │
└────────────┘
(1 row)

=# \dt+ copytest_0
                                      List of relations
┌────────┬────────────┬───────┬────────┬─────────────┬───────────────┬─────────┬─────────────┐
│ Schema │    Name    │ Type  │ Owner  │ Persistence │ Access method │  Size   │ Description │
├────────┼────────────┼───────┼────────┼─────────────┼───────────────┼─────────┼─────────────┤
│ public │ copytest_0 │ table │ andres │ permanent   │ heap          │ 1143 MB │             │
└────────┴────────────┴───────┴────────┴─────────────┴───────────────┴─────────┴─────────────┘

=# DELETE FROM copytest_0;
=# VACUUM (VERBOSE) copytest_0;
...
INFO:  00000: table "copytest_0": truncated 146264 to 110934 pages
...
tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention
...

A bit of debugging later I figured out that this is due to the background
writer. If I SIGSTOP bgwriter, the whole relation is truncated...

Greetings,

Andres Freund



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Melanie Plageman
Date:
On Mon, Nov 13, 2023 at 8:26 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-11-13 17:13:32 -0500, Melanie Plageman wrote:
> > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> > index 6985d299b2..8b729828ce 100644
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -1046,18 +1046,6 @@ lazy_scan_heap(LVRelState *vacrel)
> >                               /* Forget the LP_DEAD items that we just vacuumed */
> >                               dead_items->num_items = 0;
> >
> > -                             /*
> > -                              * Periodically perform FSM vacuuming to make newly-freed
> > -                              * space visible on upper FSM pages.  Note we have not yet
> > -                              * performed FSM processing for blkno.
> > -                              */
> > -                             if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
> > -                             {
> > -                                     FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
> > -                                                                                     blkno);
> > -                                     next_fsm_block_to_vacuum = blkno;
> > -                             }
> > -
> >                               /*
> >                                * Now perform FSM processing for blkno, and move on to next
> >                                * page.
> > @@ -1071,6 +1059,18 @@ lazy_scan_heap(LVRelState *vacrel)
> >
> >                               UnlockReleaseBuffer(buf);
> >                               RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
> > +
> > +                             /*
> > +                              * Periodically perform FSM vacuuming to make newly-freed
> > +                              * space visible on upper FSM pages.
> > +                              */
> > +                             if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
> > +                             {
> > +                                     FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
> > +                                                                                     blkno);
> > +                                     next_fsm_block_to_vacuum = blkno;
> > +                             }
> > +
> >                               continue;
> >                       }
>
> Previously there was this comment about "not yet", hinting at that being
> important for FreeSpaceMapVacuumRange's API. I think as-is we now don't vacuum
> the actually freshly updated page contents.
>
> FreeSpaceMapVacuumRange()'s comment says:
>  * As above, but assume that only heap pages between start and end-1 inclusive
>  * have new free-space information, so update only the upper-level slots
>  * covering that block range.  end == InvalidBlockNumber is equivalent to
>  * "all the rest of the relation".
>
> So FreeSpaceMapVacuumRange(..., blkno) will not actually process the "effects"
> of the RecordPageWithFreeSpace() above it - which seems confusing.

Ah, so shall I pass blkno + 1 as end?

> Aside:
>
> I just tried to reach the path and noticed something odd:
>
> =# show autovacuum;
> ┌────────────┐
> │ autovacuum │
> ├────────────┤
> │ off        │
> └────────────┘
> (1 row)
>
> =# \dt+ copytest_0
>                                       List of relations
> ┌────────┬────────────┬───────┬────────┬─────────────┬───────────────┬─────────┬─────────────┐
> │ Schema │    Name    │ Type  │ Owner  │ Persistence │ Access method │  Size   │ Description │
> ├────────┼────────────┼───────┼────────┼─────────────┼───────────────┼─────────┼─────────────┤
> │ public │ copytest_0 │ table │ andres │ permanent   │ heap          │ 1143 MB │             │
> └────────┴────────────┴───────┴────────┴─────────────┴───────────────┴─────────┴─────────────┘
>
> =# DELETE FROM copytest_0;
> =# VACUUM (VERBOSE) copytest_0;
> ...
> INFO:  00000: table "copytest_0": truncated 146264 to 110934 pages
> ...
> tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention
> ...
>
> A bit of debugging later I figured out that this is due to the background
> writer. If I SIGSTOP bgwriter, the whole relation is truncated...

That's a bit sad. But isn't that what you would expect bgwriter to do?
Write out dirty buffers? It doesn't know that those pages consist of
only dead tuples and that you will soon truncate them. Or are you
suggesting that bgwriter should do something like use the FSM to avoid
flushing pages which have a lot of free space? Would the FSM even be
updated in this case to reflect that those pages have free space?

- Melanie

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Andres Freund
Date:
Hi,

On 2023-11-14 07:46:10 -0500, Melanie Plageman wrote:
> > FreeSpaceMapVacuumRange()'s comment says:
> >  * As above, but assume that only heap pages between start and end-1 inclusive
> >  * have new free-space information, so update only the upper-level slots
> >  * covering that block range.  end == InvalidBlockNumber is equivalent to
> >  * "all the rest of the relation".
> >
> > So FreeSpaceMapVacuumRange(..., blkno) will not actually process the "effects"
> > of the RecordPageWithFreeSpace() above it - which seems confusing.
>
> Ah, so shall I pass blkno + 1 as end?

I think there's no actual overflow danger, because MaxBlockNumber + 1 is
InvalidBlockNumber, which scans the rest of the relation (i.e. exactly the
intended block). Perhaps worth noting?



> > =# DELETE FROM copytest_0;
> > =# VACUUM (VERBOSE) copytest_0;
> > ...
> > INFO:  00000: table "copytest_0": truncated 146264 to 110934 pages
> > ...
> > tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention
> > ...
> >
> > A bit of debugging later I figured out that this is due to the background
> > writer. If I SIGSTOP bgwriter, the whole relation is truncated...
>
> That's a bit sad. But isn't that what you would expect bgwriter to do?

I mainly noted this so that if somebody else tries this they don't also spent
30 minutes being confused. I'm not quite sure what a good solution here would
be.


> Write out dirty buffers? It doesn't know that those pages consist of
> only dead tuples and that you will soon truncate them.

I think the issue more that it's feels wrong that a pin by bgwriter blocks
vacuum cleaning up. I think the same happens with on-access pruning - where
it's more likely for bgwriter to focus on those pages. Checkpointer likely
causes the same. Even normal backends can cause this while writing out the
page.

It's not like bgwriter/checkpointer would actually have a problem if the page
were pruned - the page is locked while being written out and neither keep
pointers into the page after unlocking it again.


At this point I started to wonder if we should invent a separate type of pin
for a buffer undergoing IO. We basically have the necessary state already:
BM_IO_IN_PROGRESS. We'd need to look at that in some places, e.g. in
InvalidateBuffer(), instead of BUF_STATE_GET_REFCOUNT(), we'd need to also
look at BM_IO_IN_PROGRESS.


Except of course that that'd not change anything -
ConditionalLockBufferForCleanup() locks the buffer conditionally, *before*
even looking at the refcount and returns false if not.  And writing out a
buffer takes a content lock.  Which made me realize that
  "tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention"

is often kinda wrong - the contention is *not* cleanup lock specific. It's
often just plain contention on the lwlock.


Perhaps we should just treat IO_IN_PROGRESS buffers differently in
lazy_scan_heap() and heap_page_prune_opt() and wait?


Greetings,

Andres Freund



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Melanie Plageman
Date:
On Tue, Nov 14, 2023 at 7:15 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-11-14 07:46:10 -0500, Melanie Plageman wrote:
> > > FreeSpaceMapVacuumRange()'s comment says:
> > >  * As above, but assume that only heap pages between start and end-1 inclusive
> > >  * have new free-space information, so update only the upper-level slots
> > >  * covering that block range.  end == InvalidBlockNumber is equivalent to
> > >  * "all the rest of the relation".
> > >
> > > So FreeSpaceMapVacuumRange(..., blkno) will not actually process the "effects"
> > > of the RecordPageWithFreeSpace() above it - which seems confusing.
> >
> > Ah, so shall I pass blkno + 1 as end?
>
> I think there's no actual overflow danger, because MaxBlockNumber + 1 is
> InvalidBlockNumber, which scans the rest of the relation (i.e. exactly the
> intended block). Perhaps worth noting?

Attached

> > > =# DELETE FROM copytest_0;
> > > =# VACUUM (VERBOSE) copytest_0;
> > > ...
> > > INFO:  00000: table "copytest_0": truncated 146264 to 110934 pages
> > > ...
> > > tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention
> > > ...
> > >
> > > A bit of debugging later I figured out that this is due to the background
> > > writer. If I SIGSTOP bgwriter, the whole relation is truncated...
> >
> > That's a bit sad. But isn't that what you would expect bgwriter to do?
> > Write out dirty buffers? It doesn't know that those pages consist of
> > only dead tuples and that you will soon truncate them.
>
> I think the issue more that it's feels wrong that a pin by bgwriter blocks
> vacuum cleaning up. I think the same happens with on-access pruning - where
> it's more likely for bgwriter to focus on those pages. Checkpointer likely
> causes the same. Even normal backends can cause this while writing out the
> page.
>
> It's not like bgwriter/checkpointer would actually have a problem if the page
> were pruned - the page is locked while being written out and neither keep
> pointers into the page after unlocking it again.
>
>
> At this point I started to wonder if we should invent a separate type of pin
> for a buffer undergoing IO. We basically have the necessary state already:
> BM_IO_IN_PROGRESS. We'd need to look at that in some places, e.g. in
> InvalidateBuffer(), instead of BUF_STATE_GET_REFCOUNT(), we'd need to also
> look at BM_IO_IN_PROGRESS.
>
>
> Except of course that that'd not change anything -
> ConditionalLockBufferForCleanup() locks the buffer conditionally, *before*
> even looking at the refcount and returns false if not.  And writing out a
> buffer takes a content lock.  Which made me realize that
>   "tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention"
>
> is often kinda wrong - the contention is *not* cleanup lock specific. It's
> often just plain contention on the lwlock.

Do you think we should change the error message?

> Perhaps we should just treat IO_IN_PROGRESS buffers differently in
> lazy_scan_heap() and heap_page_prune_opt() and wait?

Hmm. This is an interesting idea. lazy_scan_heap() waiting for
checkpointer to write out a buffer could have an interesting property
of shifting who writes out the FPI. I wonder if it would matter.

- Melanie

Attachment

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Robert Haas
Date:
On Mon, Nov 13, 2023 at 8:26 PM Andres Freund <andres@anarazel.de> wrote:
> I think this undersells the situation a bit. We right now do
> FreeSpaceMapVacuumRange() for 8GB of data (VACUUM_FSM_EVERY_PAGES) in the main
> fork, while holding an exclusive page level lock.

That sounds fairly horrific?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Andres Freund
Date:
Hi,

On 2023-11-15 16:32:48 -0500, Robert Haas wrote:
> On Mon, Nov 13, 2023 at 8:26 PM Andres Freund <andres@anarazel.de> wrote:
> > I think this undersells the situation a bit. We right now do
> > FreeSpaceMapVacuumRange() for 8GB of data (VACUUM_FSM_EVERY_PAGES) in the main
> > fork, while holding an exclusive page level lock.
> 
> That sounds fairly horrific?

It's decidedly not great, indeed. I couldn't come up with a clear risk of
deadlock, but I wouldn't want to bet against there being a deadlock risk.

I think the rarity of it does ameliorate the performance issues to some
degree.

Thoughts on whether to backpatch? It'd probably be at least a bit painful,
there have been a lot of changes in the surrounding code in the last 5 years.

- Andres



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Robert Haas
Date:
On Wed, Nov 15, 2023 at 6:17 PM Andres Freund <andres@anarazel.de> wrote:
> Thoughts on whether to backpatch? It'd probably be at least a bit painful,
> there have been a lot of changes in the surrounding code in the last 5 years.

I guess the main point that I'd make here is that we shouldn't
back-patch because it's wrong, but because of whatever consequences it
being wrong has. If somebody demonstrates that a deadlock occurs, or
that a painfully long stall can be constructed on a somewhat realistic
test case, then I think we should back-patch. If it's just something
that we look at and by visual inspection say "wow, that looks awful,"
that is not a sufficient reason to back-patch in my view. Such a
back-patch would still have known risk, but no known reward.

Until just now, I hadn't quite absorbed the fact that this only
affected the indexes == 0 case; that case is probably extremely rare
in real life. It's possible that accounts for why this hasn't caused
more trouble. But there could also be reasons why, even if you do have
that use case, this tends not to be too serious.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Andres Freund
Date:
Hi,

On 2023-11-16 15:29:38 -0500, Robert Haas wrote:
> On Wed, Nov 15, 2023 at 6:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Thoughts on whether to backpatch? It'd probably be at least a bit painful,
> > there have been a lot of changes in the surrounding code in the last 5 years.
> 
> I guess the main point that I'd make here is that we shouldn't
> back-patch because it's wrong, but because of whatever consequences it
> being wrong has. If somebody demonstrates that a deadlock occurs, or
> that a painfully long stall can be constructed on a somewhat realistic
> test case, then I think we should back-patch.

Yea, that'd make it easy :)


> If it's just something that we look at and by visual inspection say "wow,
> that looks awful," that is not a sufficient reason to back-patch in my
> view. Such a back-patch would still have known risk, but no known reward.


> Until just now, I hadn't quite absorbed the fact that this only
> affected the indexes == 0 case; that case is probably extremely rare
> in real life. It's possible that accounts for why this hasn't caused
> more trouble. But there could also be reasons why, even if you do have
> that use case, this tends not to be too serious.

I think the main reason it's not all that bad, even when hitting this path, is
that one stall every 8GB just isn't that much and that the stalls aren't that
long - the leaf page fsm updates don't use the strategy, so they're still
somewhat likely to be in s_b, and there's "just" ~2MB of FMS to read. I tried
to reproduce it here, and it was a few ms, even though I dropped filesystem
caches in a loop.

Greetings,

Andres Freund



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Robert Haas
Date:
On Thu, Nov 16, 2023 at 3:49 PM Andres Freund <andres@anarazel.de> wrote:
> I think the main reason it's not all that bad, even when hitting this path, is
> that one stall every 8GB just isn't that much and that the stalls aren't that
> long - the leaf page fsm updates don't use the strategy, so they're still
> somewhat likely to be in s_b, and there's "just" ~2MB of FMS to read. I tried
> to reproduce it here, and it was a few ms, even though I dropped filesystem
> caches in a loop.

So just fix it in master then. If it turns out later there are worse
scenarios, you can back-patch then.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Melanie Plageman
Date:
On Thu, Nov 16, 2023 at 3:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Nov 15, 2023 at 6:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Thoughts on whether to backpatch? It'd probably be at least a bit painful,
> > there have been a lot of changes in the surrounding code in the last 5 years.
>
> I guess the main point that I'd make here is that we shouldn't
> back-patch because it's wrong, but because of whatever consequences it
> being wrong has. If somebody demonstrates that a deadlock occurs, or
> that a painfully long stall can be constructed on a somewhat realistic
> test case, then I think we should back-patch. If it's just something
> that we look at and by visual inspection say "wow, that looks awful,"
> that is not a sufficient reason to back-patch in my view. Such a
> back-patch would still have known risk, but no known reward.

This reasoning makes sense, and I hadn't really thought of it that way.
I was going to offer to take a stab at producing some back patch sets,
but, given this rationale and Andres' downthread agreement and
analysis, it sounds like there is no reason to do so. Thanks for
thinking about my bug report!

- Melanie



Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From
Andres Freund
Date:
Hi,

On 2023-11-15 16:21:45 -0500, Melanie Plageman wrote:
> On Tue, Nov 14, 2023 at 7:15 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-11-14 07:46:10 -0500, Melanie Plageman wrote:
> > > > FreeSpaceMapVacuumRange()'s comment says:
> > > >  * As above, but assume that only heap pages between start and end-1 inclusive
> > > >  * have new free-space information, so update only the upper-level slots
> > > >  * covering that block range.  end == InvalidBlockNumber is equivalent to
> > > >  * "all the rest of the relation".
> > > >
> > > > So FreeSpaceMapVacuumRange(..., blkno) will not actually process the "effects"
> > > > of the RecordPageWithFreeSpace() above it - which seems confusing.
> > >
> > > Ah, so shall I pass blkno + 1 as end?
> >
> > I think there's no actual overflow danger, because MaxBlockNumber + 1 is
> > InvalidBlockNumber, which scans the rest of the relation (i.e. exactly the
> > intended block). Perhaps worth noting?
> 
> Attached

And pushed! Thanks for the report and fix!

Greetings,

Andres Freund