Re: Avoiding pin scan during btree vacuum - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: Avoiding pin scan during btree vacuum
Date
Msg-id 20151221212816.GY2618@alvherre.pgsql
Whole thread Raw
In response to Avoiding pin scan during btree vacuum  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Avoiding pin scan during btree vacuum
Re: Avoiding pin scan during btree vacuum
List pgsql-hackers
Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.

Agreed on there being a problem here.

As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum.  That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed).  In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything.  This is supposed to optimize
the case in question.

One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.

One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition.  (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)

Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.

It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage().  I assume you should be changing both.

I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Coding note: truncating with strlcpy() is not such a hot idea
Next
From: Tom Lane
Date:
Subject: Re: Avoiding pin scan during btree vacuum