Re: Avoiding pin scan during btree vacuum - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Avoiding pin scan during btree vacuum
Date
Msg-id CANP8+jKsmSLuPssH6ivuznDVnSotJxn0LMz+e8PsphsC0xyMtw@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding pin scan during btree vacuum  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On 21 December 2015 at 21:28, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.

Agreed on there being a problem here.

As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum.  That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed).  In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything.  This is supposed to optimize
the case in question.

Nice summary, thanks.
 
One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.

Updated
 
One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition.  (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)

Agreed, not worth it.
 
Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.

Updated
 
It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage().  I assume you should be changing both.

Yes, that was correct. Updated.
 
I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.

Greatly expanded comments.

Thanks for the review.

New patch attached.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: dynloader.h missing in prebuilt package for Windows?
Next
From: Simon Riggs
Date:
Subject: Re: Avoiding pin scan during btree vacuum