Thread: Avoiding pin scan during btree vacuum
During VACUUM of btrees, we need to pin all pages, even those where tuples are not removed, which I am calling here the "pin scan". This is especially a problem during redo, where removing one tuple from a 100GB btree can take a minute or more. That causes replication lags. Bad Thing.
Previously, I have suggested ways of optimizing that and code comments reflect that. I would like to look at removing the pin scan entirely, on a standby only.
In www.postgresql.org/message-id/flat/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com, the focus was on removing pins from btrees.
In the commit message for that thread/patch, http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069
we see that there are some preconditions to altering the locking.
"This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits. These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged. If later patches allow all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page."
The case for master and standby are different. The standby is simpler, yet more important to change since the pin scan happens in the foreground.
Looking just at the case for standbys, we see there are 3 cases
* non-WAL logged indexes - does not apply on a standby, so ignore
* non-MVCC snapshot - looks like only TOAST scans are a problem on standbys
* index only scans (IOS) - the analysis of which looks wrong to me...
IOSs always check the visibility before using the tuple. If a tuple is about to be removed by a VACUUM then the tuple will already be dead and the visibility map will always be set to not-all-visible. So any tuple being removed by vacuum can never cause a problem to an IOS. Hence the locking interactions are not required, at least on standbys, for normal tables.
So it looks like we can skip the "pin scan" during redo unless we are vacuuming a toast index.
Patch attached.
Notice that the patch does not slacken the requirement to super-exclusive-lock the block from which tuples are being removed. The only thing it does is skip past the requirement to pin each of the intervening blocks where nothing has happened.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Simon Riggs wrote: > During VACUUM of btrees, we need to pin all pages, even those where tuples > are not removed, which I am calling here the "pin scan". This is especially > a problem during redo, where removing one tuple from a 100GB btree can take > a minute or more. That causes replication lags. Bad Thing. Agreed on there being a problem here. As a reminder, this "pin scan" is implemented in the HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is in charge of replaying an action recorded by _bt_delitems_vacuum. That replay works by acquiring cleanup lock on all btree blocks from lastBlockVacuumed to "this one" (i.e. the one in which elements are being removed). In essence, this means pinning *all* the blocks in the index, which is bad. The new code sets lastBlockVacuumed to Invalid; a new test in the replay code makes that value not pin anything. This is supposed to optimize the case in question. One thing that this patch should update is the comment above struct xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one of the options to apply to each block, but that routine doesn't exist. One possible naive optimization is to avoid locking pages that aren't leaf pages, but that would still require fetching the block from disk, so it wouldn't actually optimize anything other than the cleanup lock acquisition. (And probably not too useful, since leaf pages are said to be 95% - 99% of index pages anyway.) Also of interest is the comment above the call to _bt_delitems_vacuum in btvacuumpage(); that needs an update too. It appears to me that your patch changes the call in btvacuumscan() but not the one in btvacuumpage(). I assume you should be changing both. I think the new comment that talks about Toast Index should explain *why* we can skip the pinning in all cases except that one, instead of just saying we can do it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I think the new comment that talks about Toast Index should explain > *why* we can skip the pinning in all cases except that one, instead of > just saying we can do it. I've not looked at that code in a long while, but my recollection is that it's designed that way to protect non-MVCC catalog scans, which are gone now ... except for SnapshotToast. Maybe the right way to approach this is to adjust HeapTupleSatisfiesToast (or maybe just convince ourselves that no one could be dereferencing a stale toast pointer in the first place?) and then redesign btree vacuuming without the requirement to support non-MVCC scans, period. regards, tom lane
On 21 December 2015 at 21:28, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
--
Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.
Agreed on there being a problem here.
As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum. That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed). In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything. This is supposed to optimize
the case in question.
Nice summary, thanks.
One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.
Updated
One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition. (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)
Agreed, not worth it.
Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.
Updated
It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage(). I assume you should be changing both.
Yes, that was correct. Updated.
I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.
Greatly expanded comments.
Thanks for the review.
New patch attached.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 21 December 2015 at 21:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
--
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think the new comment that talks about Toast Index should explain
> *why* we can skip the pinning in all cases except that one, instead of
> just saying we can do it.
I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast.
Yes, that's the principle in use here. Which means we can optimize away the scan on a Hot Standby in all cases except for Toast Index vacuums.
Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.
We could, but its likely to be a much more complex patch.
I'm happy with this being a simple patch now, not least because I would like to backpatch this to 9.4 where catalog scans became MVCC.
A backpatch is warranted because it is a severe performance issue with replication and we can fix that before 9.5 hits the streets.
I'll be doing some more testing and checking, so not in a rush.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-01-03 15:40:01 +0000, Simon Riggs wrote: > I'm happy with this being a simple patch now, not least because I would > like to backpatch this to 9.4 where catalog scans became MVCC. > > A backpatch is warranted because it is a severe performance issue with > replication and we can fix that before 9.5 hits the streets. > > I'll be doing some more testing and checking, so not in a rush. This seems like a might subtle thing to backpatch. If we really want to go there, ISTM that the relevant code should stew in an unreleased branch for a while, before being backpatched. Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-01-03 15:40:01 +0000, Simon Riggs wrote: >> I'm happy with this being a simple patch now, not least because I would >> like to backpatch this to 9.4 where catalog scans became MVCC. >> >> A backpatch is warranted because it is a severe performance issue with >> replication and we can fix that before 9.5 hits the streets. >> >> I'll be doing some more testing and checking, so not in a rush. > This seems like a might subtle thing to backpatch. If we really want to > go there, ISTM that the relevant code should stew in an unreleased > branch for a while, before being backpatched. I'm definitely -1 on back-patching such a thing. Put it in HEAD for awhile. If it survives six months or so then we could discuss it again. regards, tom lane
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This seems like a might subtle thing to backpatch. If we really want to >> go there, ISTM that the relevant code should stew in an unreleased >> branch for a while, before being backpatched. > > I'm definitely -1 on back-patching such a thing. Put it in HEAD for > awhile. If it survives six months or so then we could discuss it again. I agree with Tom. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> This seems like a might subtle thing to backpatch. If we really want to > >> go there, ISTM that the relevant code should stew in an unreleased > >> branch for a while, before being backpatched. > > > > I'm definitely -1 on back-patching such a thing. Put it in HEAD for > > awhile. If it survives six months or so then we could discuss it again. > > I agree with Tom. Okay, several months have passed with this in the development branch and now seems a good time to backpatch this all the way back to 9.4. Any objections? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Oct 20, 2016 at 4:00 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> This seems like a might subtle thing to backpatch. If we really want to >> >> go there, ISTM that the relevant code should stew in an unreleased >> >> branch for a while, before being backpatched. >> > >> > I'm definitely -1 on back-patching such a thing. Put it in HEAD for >> > awhile. If it survives six months or so then we could discuss it again. >> >> I agree with Tom. > > Okay, several months have passed with this in the development branch and > now seems a good time to backpatch this all the way back to 9.4. > Are you talking about commit - 3e4b7d87988f0835f137f15f5c1a40598dd21f3d? If yes, do we want to retain this code in its current form under define UNUSED, is there any advantage of same. Another point is that won't this commit make information in xl_btree_vacuum redundant, so shouldn't we remove it usage during WAL writing as well? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> This seems like a might subtle thing to backpatch. If we really want to >> >> go there, ISTM that the relevant code should stew in an unreleased >> >> branch for a while, before being backpatched. >> > >> > I'm definitely -1 on back-patching such a thing. Put it in HEAD for >> > awhile. If it survives six months or so then we could discuss it again. >> >> I agree with Tom. > > Okay, several months have passed with this in the development branch and > now seems a good time to backpatch this all the way back to 9.4. > > Any objections? Although the code has now been in the tree for six months, it's only been in a released branch for three weeks, which is something about which to think. I guess what's really bothering me is that I don't think this is a bug fix. It seems like a performance optimization. And I think that we generally have a policy that we don't back-patch performance optimizations. Of course, there is some fuzziness because when the performance gets sufficiently bad, we sometimes decide that it amounts to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7. Maybe this case qualifies for similar treatment, but I'm not sure. Why are you talking about back-patching this to 9.4 rather than all supported branches? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Robert Haas wrote: > >> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> >> This seems like a might subtle thing to backpatch. If we really want to > >> >> go there, ISTM that the relevant code should stew in an unreleased > >> >> branch for a while, before being backpatched. > >> > > >> > I'm definitely -1 on back-patching such a thing. Put it in HEAD for > >> > awhile. If it survives six months or so then we could discuss it again. > >> > >> I agree with Tom. > > > > Okay, several months have passed with this in the development branch and > > now seems a good time to backpatch this all the way back to 9.4. > > > > Any objections? > > Although the code has now been in the tree for six months, it's only > been in a released branch for three weeks, which is something about > which to think. The objection above was "stew in an unreleased branch", which it has. > I guess what's really bothering me is that I don't think this is a bug > fix. It seems like a performance optimization. And I think that we > generally have a policy that we don't back-patch performance > optimizations. Of course, there is some fuzziness because when the > performance gets sufficiently bad, we sometimes decide that it amounts > to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7. Maybe this > case qualifies for similar treatment, but I'm not sure. I have seen a number of situations in which the standby strangely lags behind seemingly randomly sometimes for very long periods of time, without any apparent cause, without any possible remedy, and it makes the standby unusable. If the user happens to monitor the lag they may notice it and some may decide not to run certain queries. In other cases the I/O load is so bad that queries that otherwise run without a hitch are stuck for long. In my opinion, this is a serious problem. > Why are you talking about back-patching this to 9.4 rather than all > supported branches? As far as I understand this is dependent on catalog scans being MVCC, so it cannot be backpatched any further than that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I am ready now to backpatch this to 9.4 and 9.5; here are the patches. They are pretty similar, but some adjustments were needed due to XLog format changes in 9.5. (I kept most of Simon's original commit message.) This patch has survived in master for a long time, and in released 9.6 for a couple of months now. We have a couple of customers running with this patch installed also (who make heavy use of their standbys), without reported problems. Moreover, the next minors for 9.4 and 9.5 are scheduled for February 2017, so unless some major security problem pops up that prompts a more urgent update, we have three months to go before this is released to the masses running 9.4 and 9.5; if a bug crops up in the meantime, we have plenty of time to get it fixed. While reviewing this I noticed a small thinko in the README, which I'm going to patch in 9.6 and master thusly (with the same commit message): diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 067d15c..a3f11da 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -521,11 +521,12 @@ because it allows running applications to continue while the standby changes state into a normally running server. The interlocking required to avoid returning incorrect results from -MVCC scans is not required on standby nodes. That is because +non-MVCC scans is not required on standby nodes. That is because HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(), HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever used during write transactions, which cannot exist on the standby. -This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast(). +MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC() +is not a problem. That leaves concern only for HeapTupleSatisfiesToast(). HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's because it doesn't need to - if the main heap row is visible then the toast rows will also be visible. So as long as we follow a toast -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera wrote: > I am ready now to backpatch this to 9.4 and 9.5; here are the patches. > They are pretty similar, but some adjustments were needed due to XLog > format changes in 9.5. (I kept most of Simon's original commit > message.) Finally done. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services