Thread: Avoiding pin scan during btree vacuum

Avoiding pin scan during btree vacuum

From

Simon Riggs

Date:

21 December 2015, 14:54:29

During VACUUM of btrees, we need to pin all pages, even those where tuples are not removed, which I am calling here the "pin scan". This is especially a problem during redo, where removing one tuple from a 100GB btree can take a minute or more. That causes replication lags. Bad Thing.

Previously, I have suggested ways of optimizing that and code comments reflect that. I would like to look at removing the pin scan entirely, on a standby only.

In www.postgresql.org/message-id/flat/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com, the focus was on removing pins from btrees.

In the commit message for that thread/patch, http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069

we see that there are some preconditions to altering the locking.

"This patch leaves behavior unchanged for some cases, which can be

addressed separately so that each case can be evaluated on its own

merits. These unchanged cases are when a scan uses a non-MVCC

snapshot, an index-only scan, and a scan of a btree index for which

modifications are not WAL-logged. If later patches allow all of

these cases to drop the buffer pin after reading a leaf page, then

the btree vacuum process can be simplified; it will no longer need

the "super-exclusive" lock to delete tuples from a page."

The case for master and standby are different. The standby is simpler, yet more important to change since the pin scan happens in the foreground.

Looking just at the case for standbys, we see there are 3 cases

* non-WAL logged indexes - does not apply on a standby, so ignore

* non-MVCC snapshot - looks like only TOAST scans are a problem on standbys

* index only scans (IOS) - the analysis of which looks wrong to me...

IOSs always check the visibility before using the tuple. If a tuple is about to be removed by a VACUUM then the tuple will already be dead and the visibility map will always be set to not-all-visible. So any tuple being removed by vacuum can never cause a problem to an IOS. Hence the locking interactions are not required, at least on standbys, for normal tables.

So it looks like we can skip the "pin scan" during redo unless we are vacuuming a toast index.

Patch attached.

Notice that the patch does not slacken the requirement to super-exclusive-lock the block from which tuples are being removed. The only thing it does is skip past the requirement to pin each of the intervening blocks where nothing has happened.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

avoid_standby_pin_scan.v1.patch

Re: Avoiding pin scan during btree vacuum

From

Alvaro Herrera

Date:

21 December 2015, 21:28:31

Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.

Agreed on there being a problem here.

As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum.  That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed).  In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything.  This is supposed to optimize
the case in question.

One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.

One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition.  (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)

Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.

It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage().  I assume you should be changing both.

I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding pin scan during btree vacuum

From

Tom Lane

Date:

21 December 2015, 21:37:01

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think the new comment that talks about Toast Index should explain
> *why* we can skip the pinning in all cases except that one, instead of
> just saying we can do it.

I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast.  Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.
        regards, tom lane

Re: Avoiding pin scan during btree vacuum

From

Simon Riggs

Date:

03 January 2016, 15:35:00

On 21 December 2015 at 21:28, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.

Agreed on there being a problem here.

As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum. That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed). In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything. This is supposed to optimize
the case in question.

Nice summary, thanks.

One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.

Updated

One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition. (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)

Agreed, not worth it.

Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.

Updated

It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage(). I assume you should be changing both.

Yes, that was correct. Updated.

I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.

Greatly expanded comments.

Thanks for the review.

New patch attached.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

avoid_standby_pin_scan.v2.patch

Re: Avoiding pin scan during btree vacuum

From

Simon Riggs

Date:

03 January 2016, 15:40:08

On 21 December 2015 at 21:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think the new comment that talks about Toast Index should explain
> *why* we can skip the pinning in all cases except that one, instead of
> just saying we can do it.

I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast.

Yes, that's the principle in use here. Which means we can optimize away the scan on a Hot Standby in all cases except for Toast Index vacuums.

Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.

We could, but its likely to be a much more complex patch.

I'm happy with this being a simple patch now, not least because I would like to backpatch this to 9.4 where catalog scans became MVCC.

A backpatch is warranted because it is a severe performance issue with replication and we can fix that before 9.5 hits the streets.

I'll be doing some more testing and checking, so not in a rush.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding pin scan during btree vacuum

From

Andres Freund

Date:

04 January 2016, 08:47:28

On 2016-01-03 15:40:01 +0000, Simon Riggs wrote:
> I'm happy with this being a simple patch now, not least because I would
> like to backpatch this to 9.4 where catalog scans became MVCC.
> 
> A backpatch is warranted because it is a severe performance issue with
> replication and we can fix that before 9.5 hits the streets.
> 
> I'll be doing some more testing and checking, so not in a rush.

This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.

Andres

Re: Avoiding pin scan during btree vacuum

From

Tom Lane

Date:

04 January 2016, 15:30:35

Andres Freund <andres@anarazel.de> writes:
> On 2016-01-03 15:40:01 +0000, Simon Riggs wrote:
>> I'm happy with this being a simple patch now, not least because I would
>> like to backpatch this to 9.4 where catalog scans became MVCC.
>> 
>> A backpatch is warranted because it is a severe performance issue with
>> replication and we can fix that before 9.5 hits the streets.
>> 
>> I'll be doing some more testing and checking, so not in a rush.

> This seems like a might subtle thing to backpatch. If we really want to
> go there, ISTM that the relevant code should stew in an unreleased
> branch for a while, before being backpatched.

I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
awhile.  If it survives six months or so then we could discuss it again.
        regards, tom lane

Re: Avoiding pin scan during btree vacuum

From

Robert Haas

Date:

04 January 2016, 15:38:23

On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This seems like a might subtle thing to backpatch. If we really want to
>> go there, ISTM that the relevant code should stew in an unreleased
>> branch for a while, before being backpatched.
>
> I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> awhile.  If it survives six months or so then we could discuss it again.

I agree with Tom.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding pin scan during btree vacuum

From

Alvaro Herrera

Date:

19 October 2016, 22:30:57

Robert Haas wrote:
> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> This seems like a might subtle thing to backpatch. If we really want to
> >> go there, ISTM that the relevant code should stew in an unreleased
> >> branch for a while, before being backpatched.
> >
> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> > awhile.  If it survives six months or so then we could discuss it again.
> 
> I agree with Tom.

Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.

Any objections?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding pin scan during btree vacuum

From

Amit Kapila

Date:

20 October 2016, 03:26:46

On Thu, Oct 20, 2016 at 4:00 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> This seems like a might subtle thing to backpatch. If we really want to
>> >> go there, ISTM that the relevant code should stew in an unreleased
>> >> branch for a while, before being backpatched.
>> >
>> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
>> > awhile.  If it survives six months or so then we could discuss it again.
>>
>> I agree with Tom.
>
> Okay, several months have passed with this in the development branch and
> now seems a good time to backpatch this all the way back to 9.4.
>

Are you talking about commit -
3e4b7d87988f0835f137f15f5c1a40598dd21f3d?  If yes, do we want to
retain this code in its current form under define UNUSED, is there any
advantage of same.   Another point is that won't this commit make
information in xl_btree_vacuum redundant, so shouldn't we remove it
usage during WAL writing as well?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Avoiding pin scan during btree vacuum

From

Robert Haas

Date:

20 October 2016, 14:54:05

On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> This seems like a might subtle thing to backpatch. If we really want to
>> >> go there, ISTM that the relevant code should stew in an unreleased
>> >> branch for a while, before being backpatched.
>> >
>> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
>> > awhile.  If it survives six months or so then we could discuss it again.
>>
>> I agree with Tom.
>
> Okay, several months have passed with this in the development branch and
> now seems a good time to backpatch this all the way back to 9.4.
>
> Any objections?

Although the code has now been in the tree for six months, it's only
been in a released branch for three weeks, which is something about
which to think.

I guess what's really bothering me is that I don't think this is a bug
fix.  It seems like a performance optimization.  And I think that we
generally have a policy that we don't back-patch performance
optimizations.  Of course, there is some fuzziness because when the
performance gets sufficiently bad, we sometimes decide that it amounts
to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7.  Maybe this
case qualifies for similar treatment, but I'm not sure.

Why are you talking about back-patching this to 9.4 rather than all
supported branches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding pin scan during btree vacuum

From

Alvaro Herrera

Date:

23 October 2016, 19:35:38

Robert Haas wrote:
> On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Robert Haas wrote:
> >> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> >> This seems like a might subtle thing to backpatch. If we really want to
> >> >> go there, ISTM that the relevant code should stew in an unreleased
> >> >> branch for a while, before being backpatched.
> >> >
> >> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> >> > awhile.  If it survives six months or so then we could discuss it again.
> >>
> >> I agree with Tom.
> >
> > Okay, several months have passed with this in the development branch and
> > now seems a good time to backpatch this all the way back to 9.4.
> >
> > Any objections?
> 
> Although the code has now been in the tree for six months, it's only
> been in a released branch for three weeks, which is something about
> which to think.

The objection above was "stew in an unreleased branch", which it has.

> I guess what's really bothering me is that I don't think this is a bug
> fix.  It seems like a performance optimization.  And I think that we
> generally have a policy that we don't back-patch performance
> optimizations.  Of course, there is some fuzziness because when the
> performance gets sufficiently bad, we sometimes decide that it amounts
> to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7.  Maybe this
> case qualifies for similar treatment, but I'm not sure.

I have seen a number of situations in which the standby strangely lags
behind seemingly randomly sometimes for very long periods of time,
without any apparent cause, without any possible remedy, and it makes
the standby unusable.  If the user happens to monitor the lag they may
notice it and some may decide not to run certain queries.  In other
cases the I/O load is so bad that queries that otherwise run without a
hitch are stuck for long.

In my opinion, this is a serious problem.

> Why are you talking about back-patching this to 9.4 rather than all
> supported branches?

As far as I understand this is dependent on catalog scans being MVCC,
so it cannot be backpatched any further than that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding pin scan during btree vacuum

From

Alvaro Herrera

Date:

16 November 2016, 17:44:10

I am ready now to backpatch this to 9.4 and 9.5; here are the patches.
They are pretty similar, but some adjustments were needed due to XLog
format changes in 9.5.  (I kept most of Simon's original commit
message.)

This patch has survived in master for a long time, and in released 9.6
for a couple of months now.  We have a couple of customers running with
this patch installed also (who make heavy use of their standbys),
without reported problems.  Moreover, the next minors for 9.4 and 9.5
are scheduled for February 2017, so unless some major security problem
pops up that prompts a more urgent update, we have three months to go
before this is released to the masses running 9.4 and 9.5; if a bug
crops up in the meantime, we have plenty of time to get it fixed.

While reviewing this I noticed a small thinko in the README, which I'm
going to patch in 9.6 and master thusly (with the same commit message):

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 067d15c..a3f11da 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -521,11 +521,12 @@ because it allows running applications to continue while the standby
 changes state into a normally running server.

 The interlocking required to avoid returning incorrect results from
-MVCC scans is not required on standby nodes. That is because
+non-MVCC scans is not required on standby nodes. That is because
 HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
 HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
 ever used during write transactions, which cannot exist on the standby.
-This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast().
+MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem.  That leaves concern only for HeapTupleSatisfiesToast().
 HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
 because it doesn't need to - if the main heap row is visible then the
 toast rows will also be visible. So as long as we follow a toast

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Avoiding pin scan during btree vacuum

From

Alvaro Herrera

Date:

17 November 2016, 16:38:09

Alvaro Herrera wrote:
> I am ready now to backpatch this to 9.4 and 9.5; here are the patches.
> They are pretty similar, but some adjustments were needed due to XLog
> format changes in 9.5.  (I kept most of Simon's original commit
> message.)

Finally done.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services