Thread: Reduce pinning in btree indexes

Reduce pinning in btree indexes

From

Kevin Grittner

Date:

15 February 2015, 00:20:03

We have a customer who was unable to migrate from a well-known
commercial database product to pg because they have a very large
software base that holds cursors open for very long periods of
time, preventing GlobalXmin values from advancing, leading to
bloat.  They have a standard test that exercises hundreds of client
connections for days at a time, and saw disk space and CPU usage
climbing to unacceptable levels[1].  The other database product had
no such problems.  We suggested the obvious solutions that we
always suggest on the community lists, and they had perfectly good
reasons why those were not viable for them -- we either needed to
prevent bloat under their current software or they could not
migrate.

What they wanted was what happened in the other database product --
if a snapshot got sufficiently old that cleaning up the MVCC data
was a problem *and* the snapshot was used again *and* it read a
page which had been modified far enough back that it was not
possible to return correct data, then they wanted to receive a
"snapshot too old" error.  I wrote a patch to do that, but they
didn't seem much improvement, because all (auto)vacuum processes
blocked indefinitely on the btree index pages where a cursor was
parked.  So they still got massive bloat and consequent problems.

It seemed odd to me that a btree implementation based on the
Lehman & Yao techniques would block anything, because the concept
is that it reads everything it needs off the index page and pauses
"between" pages.  We sorta do that, but the "interlock" between
vacuum processes and other index usages basically involves holding
a pin on the page we just read until we are done using the values
we read off that page, and treating the pin as a lighter lock than
a shared (or READ) lock -- which only conflicts with a "super
exclusive" lock, which consists of getting an exclusive lock only
once there are no other processes holding a pin on the page.  This
ensures that at the end of a vacuum pass over a btree index there
are no TIDs in process-local memory from before the start of the
pass, and consequently no scan can read a new tuple assigned to the
same TID value after the rest of the vacuum phases run.  So a pin
on a btree page blocks a vacuum process indefinitely.

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.  The customer's cursors which were causing the
problems all use MVCC snapshots, so I went looking to see whether
we couldn't take advantage of this fact.  For the most part it
seemed we were OK if we dropped pins with the READ lock for a btree
scan which used an MVCC snapshot.  I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting.  That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.  Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

There was also a problem with the fact that the code conflated the
notion of having a valid scan position with holding a pin on a
block in the index.  Those two concepts needed to be teased apart,
which I did using some new macros and figuring out which one to use
where.  Without a pin on the block, we also needed to remember what
block we had been processing to be able to do the LP_DEAD hinting;
for that the block number was added to the structure which tracks
scan position.

Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements.  The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit.  In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes.  But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

In combination with the snapshot-too-old patch the 2-day customer
benchmark ran without problem levels of bloat, controlling disk
space growth and keeping CPU usage level.  Without the other patch
this patch would at least allow autovacuum to maintain statistics,
which probably makes it worthwhile even without the other patch,
but will probably not have a very big impact on bloat without both.

At least two README files and a lot of comments need to be modified
before commit to reflect the change in approach if this is accepted
by the community.  Since this work has been very recent I haven't
had time to do that yet.

git diff --stat tells me this patch has:

4 files changed, 208 insertions(+), 128 deletions(-)

... and the related "snapshot too old" patch has:

16 files changed, 145 insertions(+), 19 deletions(-)

Given that these are going into the last CF and the related patch
is really still at "proof of concept" phase, it may be too late to
consider them for PostgreSQL 9.5, but I would still like a serious
review now in case people feel strongly about controlling bloat in
the face of old snapshots, and to know whether to work on this as
just for the Postgres Plus Advanced Server fork or for the
community.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


[1] This test maintains a certain transaction rate, with user think
times on the connections, so performance isn't measured by a tps
rate but by how much CPU is consumed to maintain that rate.

Attachment

bt-nopin-v1.patch

Re: Reduce pinning in btree indexes

From

Heikki Linnakangas

Date:

23 February 2015, 19:49:21

On 02/15/2015 02:19 AM, Kevin Grittner wrote:
> Interestingly, the btree README points out that using the old TID
> with a new tuple poses no hazard for a scan using an MVCC snapshot,
> because the new tuple would not be visible to a snapshot created
> that long ago.

The first question is: Do we really need that interlock for the non-MVCC 
snapshots either?

If we do: For non-MVCC snapshots, we need to ensure that all index scans 
that started before the VACUUM did complete before the VACUUM does. I 
wonder if we could find a different mechanism to enforce that. Using the 
pin-interlock for that has always seemed a bit silly to me. Perhaps grab 
a new heavy-weight lock on the table whenever a non-MVCC index scan on 
the table begins, and have VACUUM wait on it.

> I found that the LP_DEAD hinting
> would be a problem with an old TID, but I figured we could work
> around that by storing the page LSN into the scan position structure
> when the page contents were read, and only doing hinting if that
> matched when we were ready to do the hinting.  That wouldn't work
> for an index which was not WAL-logged, so the patch still holds
> pins for those.

Or you could use GetFakeLSNForUnloggedRel().

> Robert pointed out that the visibility information
> for an index-only scan wasn't checked while the index page READ
> lock was held, so those scans also still hold the pins.

Why does an index-only scan need to hold the pin?

> Finally, there was an "optimization" for marking buffer position
> for possible restore that was incompatible with releasing the pin.
> I use quotes because the optimization adds overhead to every move
> to the next page in order set some variables in a structure when a
> mark is requested instead of running two fairly small memcpy()
> statements.  The two-day benchmark of the customer showed no
> performance hit, and looking at the code I would be amazed if the
> optimization yielded a measurable benefit.  In general, optimization
> by adding overhead to moving through a scan to save time in a mark
> operation seems dubious.

Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
are not that small. Especially if it's an index-only scan, you're 
copying a large portion of the page. Some scans call markpos on every 
tuple, so that could add up.

> At some point we could consider building on this patch to recheck
> index conditions for heap access when a non-MVCC snapshot is used,
> check the visibility map for referenced heap pages when the TIDs
> are read for an index-only scan, and/or skip LP_DEAD hinting for
> non-WAL-logged indexes.  But all those are speculative future work;
> this is a conservative implementation that just didn't modify
> pinning where there were any confounding factors.

Understood. Still, I'd like to see if we can easily get rid of the 
pinning altogether.
- Heikki

Re: Reduce pinning in btree indexes

From

Robert Haas

Date:

24 February 2015, 03:12:05

On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Robert pointed out that the visibility information
>> for an index-only scan wasn't checked while the index page READ
>> lock was held, so those scans also still hold the pins.
>
> Why does an index-only scan need to hold the pin?

Suppose there's a dead tuple in the heap someplace, and an index
pointer pointing to that dead tuple.  An index scan reaches the
relevant page and copies all of the index pointers.  VACUUM removes
the index pointer and marks the heap page all-visible.  The scan now
uses the index pointers copied to backend-local memory and notes that
the heap-page is all-visible, so the scan sees the tuple even though
that tuple is totally gone from the heap by that point.  Holding the
pin prevents this, because we can't reach the second heap pass until
the scan is done with the page, and therefore the dead line pointer in
the heap page can't be marked unused, and therefore the page can't be
marked all-visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

26 February 2015, 17:49:37

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote:
>> Interestingly, the btree README points out that using the old TID
>> with a new tuple poses no hazard for a scan using an MVCC snapshot,
>> because the new tuple would not be visible to a snapshot created
>> that long ago.
>
> The first question is: Do we really need that interlock for the non-MVCC 

> snapshots either?

We don't need the interlock for non-MVCC snapshots if the conditions we
limit or filter on at the index level are rechecked when we get to the heap
tuple; otherwise we do.

> If we do: For non-MVCC snapshots, we need to ensure that all index scans 

> that started before the VACUUM did complete before the VACUUM does.

Isn't it enough to be sure that if we're not going to recheck any index
tests against the heap tuple that any process-local copies of index entries
which exist at the start of the index scan phase of the vacuum are not used
after the vacuum enters the phase of freeing line pointers -- that is, any
scan which has read a page when the vacuum starts to scan the index moves
on to another page before vacuum finishes scanning that index?

> I wonder if we could find a different mechanism to enforce that. Using the 

> pin-interlock for that has always seemed a bit silly to me.

It certainly is abusing the semantics of the pin to treat it as a type of
lock on the contents.

> Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index
> scan on  the table begins, and have VACUUM wait on it.

-1

The problem I'm most interested in fixing based on user feedback is a vacuum
blocking when a cursor which is using an index scan is sitting idle.  Not
only does the vacuum of that table stall, but if it is an autovacuum worker,
that worker is prevented from cleaning up any other tables.  I have seen all
autovacuum workers blocked in exactly this way, leading to massive bloat.
What you suggest is just using a less awkward way to keep the problem.

>> I found that the LP_DEAD hinting
>> would be a problem with an old TID, but I figured we could work
>> around that by storing the page LSN into the scan position structure
>> when the page contents were read, and only doing hinting if that
>> matched when we were ready to do the hinting.  That wouldn't work
>> for an index which was not WAL-logged, so the patch still holds
>> pins for those.
>

> Or you could use GetFakeLSNForUnloggedRel().

I'm unfamiliar with that, but will take a look.  (I will be back at
my usual development machine late tomorrow.)

>> Robert pointed out that the visibility information
>> for an index-only scan wasn't checked while the index page READ
>> lock was held, so those scans also still hold the pins.
>

> Why does an index-only scan need to hold the pin?

Robert already answered this one, but there is a possible solution
-- provide some way to check the heap's visibility map for the TIDs
read from an index page before releasing the read lock on that page.

>> Finally, there was an "optimization" for marking buffer position
>> for possible restore that was incompatible with releasing the pin.
>> I use quotes because the optimization adds overhead to every move
>> to the next page in order set some variables in a structure when a
>> mark is requested instead of running two fairly small memcpy()
>> statements.  The two-day benchmark of the customer showed no
>> performance hit, and looking at the code I would be amazed if the
>> optimization yielded a measurable benefit.  In general, optimization
>> by adding overhead to moving through a scan to save time in a mark
>> operation seems dubious.
>
> Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
> are not that small. Especially if it's an index-only scan, you're 
> copying a large portion of the page. Some scans call markpos on every 
> tuple, so that could add up.

I have results from the `make world` regression tests and a 48-hour
customer test.  Unfortunately I don't know how heavily either of those
exercise this code.  Do you have a suggestion for a way to test whether
there is a serious regression for something approaching a "worst case"?

>> At some point we could consider building on this patch to recheck
>> index conditions for heap access when a non-MVCC snapshot is used,
>> check the visibility map for referenced heap pages when the TIDs
>> are read for an index-only scan, and/or skip LP_DEAD hinting for
>> non-WAL-logged indexes.  But all those are speculative future work;
>> this is a conservative implementation that just didn't modify
>> pinning where there were any confounding factors.
>
> Understood. Still, I'd like to see if we can easily get rid of the 
> pinning altogether.

That would be great, but I felt that taking care of the easy cases in
on patch and following with patches to handle the trickier cases as
separate follow-on patches made more sense than trying to do everything
at once.  Assuming we sort out the mark/restore issues for the initial
patch, it provides infrastructure to keep the other cases simpler.

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

27 February 2015, 05:32:05

Hello, I measured the performance of this patch considering
markpos/restorepos. The loss seems to be up to about 10%
unfortunately.

At Thu, 26 Feb 2015 17:49:23 +0000 (UTC), Kevin Grittner <kgrittn@ymail.com> wrote in
<440831854.629116.1424972963735.JavaMail.yahoo@mail.yahoo.com>
> Heikki Linnakangas <hlinnakangas@vmware.com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote:
> > Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
> > are not that small. Especially if it's an index-only scan, you're 
> > copying a large portion of the page. Some scans call markpos on every 
> > tuple, so that could add up.
> 
> 
> I have results from the `make world` regression tests and a 48-hour
> customer test.  Unfortunately I don't know how heavily either of those
> exercise this code.  Do you have a suggestion for a way to test whether
> there is a serious regression for something approaching a "worst case"?

ammarkpos/amrestrpos are called in merge joins. By the steps
shown below, I had 1M times of markpos and no restorepos for 1M
result rows, and had 500k times of markpos and the same number of
times of restorepos for 2M rows result by a bit different
configuration. I suppose we can say that they are the worst case
considering maskpos/restrpos. The call counts can be taken using
the attached patch.

Both binaries ware compiled with -O2. shared_buffers=1GB and all
shared pages used in the query were hit when measuring.

The numbers were taken 12 times for each cofiguration and took
averages and stddevs of 10 numbers excluding best and worst.

Case 1. 500k markpos/restorepos for 2M result rows.

Index scan: The patched loses about 2%. (1.98%) master:  6166 ms (std dev = 3.2 ms) Patched: 6288 ms (std dev = 3.7
ms)

IndesOnly scan: The patches loses about 2%. (2.14%) master:  5946 ms (std dev = 5.0 ms) Patched: 6073 ms (std dev = 3.3
ms)

The patched version is slower by about 2%. Of course all of it is
not the effect of memcpy but I don't know the distribution.


Case 2. 1M markpos, no restorepos for 1M result rows.

IndesOnly scan: The patches loses about 10%. master:  3655 ms (std dev = 2.5 ms) Patched: 4038 ms (std dev = 2.6 ms)

The loss is about 10%. The case looks a bit impractical but
unfortunately the number might be unignorable. The distribution
of the loss is unknown, too.


regards,

===========
CREATE TABLE t1 (a int);
CREATE TABLE t2 (a int);
delete from t1;
delete from t2;
-- This causes 1M times of markpos and no restorepos
INSERT INTO t1 (SELECT a FROM generate_series(0, 999999) a);
INSERT INTO t2 (SELECT a FROM generate_series(0, 999999) a);
-- This causes 500k times of markpos and the same number of restorepos
-- INSERT INTO t1 (SELECT a/2 FROM generate_series(0, 999999) a);
-- INSERT INTO t2 (SELECT a/2 FROM generate_series(0, 999999) a);
CREATE INDEX it1 ON t1 (a);
CREATE INDEX it2 ON t2 (a);
ANALYZE t1;
ANALYZE t1;
SET enable_seqscan to false;
SET enable_material to false;
SET enable_hashjoin to false;
SET enable_nestloop to false;
SET enable_indexonlyscan to false; -- omit this to do indexonly scan

EXPLAIN (ANALYZE) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a);

                                     QUERY PLAN         
--------------------------------------------------------------------------Merge Join  (cost=2.83..322381.82
rows=3031231width=8) (actual time=0.013..5193.566 rows=2000000 loops=1)  Merge Cond: (t1.a = t2.a)  ->  Index Scan
usingit1 on t1  (cost=0.43..65681.87 rows=1666667 width=4) (actual time=0.004..727.557 rows=1000000
 
oops=1)  ->  Index Scan using it2 on t2  (cost=0.43..214642.89 rows=3031231 width=4) (actual time=0.004..1478.361
rows=19999loops=1)Planningtime: 25.585 msExecution time: 6299.521 ms
 
(6 rows)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3af462b..04d6cec 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -543,15 +543,19 @@ btendscan(PG_FUNCTION_ARGS)    PG_RETURN_VOID();}
+/* *    btmarkpos() -- save current scan position */
+int hoge_markpos_count = 0;
+int hoge_restrpos_count = 0;Datumbtmarkpos(PG_FUNCTION_ARGS){    IndexScanDesc scan = (IndexScanDesc)
PG_GETARG_POINTER(0);   BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+    hoge_markpos_count++;    /* we aren't holding any read locks, but gotta drop the pin */    if
(BTScanPosIsValid(so->markPos))   {
 
@@ -585,7 +589,7 @@ btrestrpos(PG_FUNCTION_ARGS){    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
BTScanOpaqueso = (BTScanOpaque) scan->opaque;
 
-
+    hoge_restrpos_count++;    /* Restore the marked positions of any array keys */    if (so->numArrayKeys)
_bt_restore_array_keys(scan);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 33b172b..e7c1b99 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -258,14 +258,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags) * *
----------------------------------------------------------------*/
 
+extern int hoge_markpos_count;
+extern int hoge_restrpos_count;voidExecutorRun(QueryDesc *queryDesc,            ScanDirection direction, long count){
+    hoge_markpos_count = 0;
+    hoge_restrpos_count = 0;    if (ExecutorRun_hook)        (*ExecutorRun_hook) (queryDesc, direction, count);
else       standard_ExecutorRun(queryDesc, direction, count);
 
+    elog(LOG, "markpos=%d, restrpos=%d", hoge_markpos_count, hoge_restrpos_count);}void

Re: Reduce pinning in btree indexes

From

Andrew Gierth

Date:

27 February 2015, 05:56:29

>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
Kyotaro> ammarkpos/amrestrpos are called in merge joins. By the stepsKyotaro> shown below, I had 1M times of markpos
andno restorepos forKyotaro> 1M result rows, and had 500k times of markpos and the sameKyotaro> number of times of
restoreposfor 2M rows result by a bitKyotaro> different configuration. I suppose we can say that they areKyotaro> the
worstcase considering maskpos/restrpos. The call countsKyotaro> can be taken using the attached patch.
 

You might want to try running the same test, but after patching
ExecSupportsMarkRestore to return false for index scans. This will cause
the planner to insert a Materialize node to handle the mark/restore.

This way, you can get an idea of how much gain (if any) the direct
support of mark/restore in the scan is actually providing.

-- 
Andrew (irc:RhodiumToad)

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

27 February 2015, 06:59:20

Hi,

At Fri, 27 Feb 2015 05:56:18 +0000, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote in
<874mq77vuu.fsf@news-spur.riddles.org.uk>
> >>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> 
>  Kyotaro> ammarkpos/amrestrpos are called in merge joins. By the steps
>  Kyotaro> shown below, I had 1M times of markpos and no restorepos for
>  Kyotaro> 1M result rows, and had 500k times of markpos and the same
>  Kyotaro> number of times of restorepos for 2M rows result by a bit
>  Kyotaro> different configuration. I suppose we can say that they are
>  Kyotaro> the worst case considering maskpos/restrpos. The call counts
>  Kyotaro> can be taken using the attached patch.
> 
> You might want to try running the same test, but after patching
> ExecSupportsMarkRestore to return false for index scans. This will cause
> the planner to insert a Materialize node to handle the mark/restore.

Mmm? The patch bt-nopin-v1.patch seems not contain the change for
ExecSupportMarkRestore and the very simple function remain
looking to return true for T_Index(Only)Scan after the patch
applied.

Explain shows that the plan does not involve materializing step
and addition to it, the counters I put into both btmarkpos() and
btrestrpos() showed that they are actually called during the
execution, like this, for both unpatched and patched.

| LOG:  markpos=1000000, restrpos=0
| STATEMENT:  EXPLAIN (ANALYZE,BUFFERS) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a);

To make sure, I looked into btmarkpos and btrestrpos to confirm
that they really does the memcpy() we're talking about, then
recompiled whole the source tree.

> This way, you can get an idea of how much gain (if any) the direct
> support of mark/restore in the scan is actually providing.

Does "the scan" mean T_Material node? But no material in plan and
counters in bt*pos were incremented in fact as mentioned above..

Could you point out any other possible mistake in my thought?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reduce pinning in btree indexes

From

Andrew Gierth

Date:

27 February 2015, 07:26:49

>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
>> You might want to try running the same test, but after patching>> ExecSupportsMarkRestore to return false for index
scans.This will>> cause the planner to insert a Materialize node to handle the>> mark/restore.

Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the changeKyotaro> for ExecSupportMarkRestore and the very
simplefunction remainKyotaro> looking to return true for T_Index(Only)Scan after the patchKyotaro> applied.

Right. I'm suggesting you change that, in order to determine what
performance cost, if any, would result from abandoning the idea of doing
mark/restore entirely.

-- 
Andrew (irc:RhodiumToad)

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

27 February 2015, 08:15:45

Just a reminder, but I am not the author of this patch:)

At Fri, 27 Feb 2015 07:26:38 +0000, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote in
<87zj7z6ckc.fsf@news-spur.riddles.org.uk>
> >>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> 
>  >> You might want to try running the same test, but after patching
>  >> ExecSupportsMarkRestore to return false for index scans. This will
>  >> cause the planner to insert a Materialize node to handle the
>  >> mark/restore.
> 
>  Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the change
>  Kyotaro> for ExecSupportMarkRestore and the very simple function remain
>  Kyotaro> looking to return true for T_Index(Only)Scan after the patch
>  Kyotaro> applied.
> 
> Right. I'm suggesting you change that, in order to determine what
> performance cost, if any, would result from abandoning the idea of doing
> mark/restore entirely.

I understand that you'd like to see the net drag of performance
by the memcpy(), right?

That don't seem to be possible without breaking (a part of) the
patch's function, but, concerning btmarkpos() only case, the mark
struct can have garbage so only changing refcount would be viable
to see the pure loss by the memcpy().

The attached patch is the destruction I did, and the result was
like below.

> Case 2. 1M markpos, no restorepos for 1M result rows.
> 
> IndesOnly scan: The patches loses about 10%.
>   master:  3655 ms (std dev = 2.5 ms)
>   Patched: 4038 ms (std dev = 2.6 ms)
  Patched: 4036 ms (std dev = 3.5 ms)  Broken:  3718 ms (std dev = 1.6 ms)

The "Patched" just above is the retook numbers. It's a little
under 9% loss from "broken" version. That is the pure drag of
memcpy(). This seems quite big as Heikki said. It's an extreme
case, though.

The rest 1.7% should be the share of all the other stuff.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 65daf27..4973b63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -547,12 +547,17 @@ btmarkpos(PG_FUNCTION_ARGS)    BTScanOpaque so = (BTScanOpaque) scan->opaque;
hoge_markpos_count++;
-    memcpy(&so->markPos, &so->currPos,
-           offsetof(BTScanPosData, items[1]) +
-           so->currPos.lastItem * sizeof(BTScanPosItem));
-    if (so->markTuples)
-        memcpy(so->markTuples, so->currTuples,
-               so->currPos.nextTupleOffset);
+//    memcpy(&so->markPos, &so->currPos,
+//           offsetof(BTScanPosData, items[1]) +
+//           so->currPos.lastItem * sizeof(BTScanPosItem));
+//    if (so->markTuples)
+//        memcpy(so->markTuples, so->currTuples,
+//               so->currPos.nextTupleOffset);
+    if (BTScanPosIsValid(so->markPos))
+    {
+        ReleaseBuffer(so->markPos.buf);
+        so->markPos.buf = InvalidBuffer;
+    }    /* We don't take out an extra pin for the mark position. */    so->markPos.buf = InvalidBuffer;
@@ -599,12 +604,13 @@ btrestrpos(PG_FUNCTION_ARGS)        if (BTScanPosIsValid(so->markPos))        {
-            memcpy(&so->currPos, &so->markPos,
-                   offsetof(BTScanPosData, items[1]) +
-                   so->markPos.lastItem * sizeof(BTScanPosItem));
-            if (so->currTuples)
-                memcpy(so->currTuples, so->markTuples,
-                       so->markPos.nextTupleOffset);
+//            memcpy(&so->currPos, &so->markPos,
+//                   offsetof(BTScanPosData, items[1]) +
+//                   so->markPos.lastItem * sizeof(BTScanPosItem));
+//            if (so->currTuples)
+//                memcpy(so->currTuples, so->markTuples,
+//                       so->markPos.nextTupleOffset);
+            IncrBufferRefCount(so->markPos.buf);                    }        else
BTScanPosInvalidate(so->currPos);

Re: Reduce pinning in btree indexes

From

Andrew Gierth

Date:

27 February 2015, 08:27:45

>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
Kyotaro> Just a reminder, but I am not the author of this patch:)

Yes, I understand that.
Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the changeKyotaro> for ExecSupportMarkRestore and the very
simplefunction remainKyotaro> looking to return true for T_Index(Only)Scan after the patchKyotaro> applied.

>> Right. I'm suggesting you change that, in order to determine what>> performance cost, if any, would result from
abandoningthe idea of>> doing mark/restore entirely.

Kyotaro> I understand that you'd like to see the net drag ofKyotaro> performance by the memcpy(), right?

No.

What I am suggesting is this: if mark/restore is a performance issue,
then it would be useful to know how much gain we're getting (if any)
from supporting it _at all_.

Let me try and explain it another way. If you change
ExecSupportMarkRestore to return false for index scans, then
btmarkpos/btrestorepos will no longer be called. What is the performance
of this case compared to the original and patched versions?

-- 
Andrew (irc:RhodiumToad)

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

27 February 2015, 10:21:08

<div dir="ltr"><p dir="ltr">Hello.<p dir="ltr">Mmm... We might be focusing on a bit different things..<p
dir="ltr">2015/02/2717:27 "Andrew Gierth" >  Kyotaro> I understand that you'd like to see the net drag of<br />
> Kyotaro> performance by the memcpy(), right?<br /> ><br /> > No.<br /> ><br /> > What I am
suggestingis this: if mark/restore is a performance issue,<br /> > then it would be useful to know how much gain
we'regetting (if any)<br /> > from supporting it _at all_.<p dir="ltr">The mark/restore works as before with this
patch,except the stuff for early dropping of buffer pins. As far as my understanding so far all the stuff in the patch
isfor the purpose but doesn't intend to boost btree itself. That is, it reduces the chance to block vacuum for 
possible burden on general performance. And I think  the current issue in focus is that the burden might a bit too
heavyon specific situation. Therefore I tried to measure how heavy/light the burden is.<p dir="ltr">Am I correct so
far?<pdir="ltr">> Let me try and explain it another way. If you change<br /> > ExecSupportMarkRestore to return
falsefor index scans, then<br /> > btmarkpos/btrestorepos will no longer be called. What is the performance<br />
>of this case compared to the original and patched versions?<p dir="ltr">As you mentioned before, such change will
makethe planner turn to, perhaps materialize plan or rescan, or any other scans which are no use comparing to indexscan
concerningthe issue in focus, the performance drag when btree scan does  extremely frequent mark/restore in comparison
to unpatched, less copy-amount version. Your suggestion looks intending somewhat different from this.<p
dir="ltr">AnywayI'm sorry but I have left my dev env and cannot have time till next week.<p>regards,<br /><p
dir="ltr">--<br /> Kyotaro Horiguti<br /></div>

Re: Reduce pinning in btree indexes

From

Andrew Gierth

Date:

27 February 2015, 11:11:03

>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
Kyotaro> Anyway I'm sorry but I have left my dev env and cannot haveKyotaro> time till next week.

The point is moot; rather than try and explain it further I did the test
myself, and the performance penalty of disabling mark/restore is
substantial, on the order of 30%, so that's a non-starter. (I was a bit
surprised that it was so bad...)

-- 
Andrew (irc:RhodiumToad)

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

01 March 2015, 14:08:41

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> Hello, I measured the performance of this patch considering
> markpos/restorepos. The loss seems to be up to about 10%
> unfortunately.

Thanks for the test case!

I took another look at this optimization, and found that it didn't
really depend on the pin (as I had first thought), so I put it back
(keeping the rest of the patch unchanged).  I saw a 1.4% increase
in run time with the patch (compared to master) for the mark-only
test, and a 1.6% increase in run time for the mark/restore test.
I'll look into what might be causing that.  I have seen bigger
differences just due to changing where executable code crossed
cache boundaries, but this is big enough to worry about; it needs
to be checked.

New patch attached.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

bt-nopin-v2.patch

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

04 March 2015, 03:16:47

Kevin Grittner <kgrittn@ymail.com> wrote:

> [need to check performance more]


It looks like the remaining performance regression was indeed a
result of code alignment.  I found two "paranoia" assignments I had
accidentally failed to put back with the rest of the mark/restore
optimization; after that trivial change the patched version is
ever-so-slightly faster than master on all tests.

Average of 3 `make check-world` runs:
master: 336.288 seconds
patch:  332.237 seconds

Average of 6 `make check` runs:
master: 25.409 seconds
patch:  25.270 seconds

The following were all run 12 times, the worst 2 dropped, the rest
averaged:

Kyotaro-san's 1m mark "worst case" benchmark:
master: 962.581 ms, 6.765 stdev
patch:  947.518 ms, 3.584 stdev

Kyotaro-san's 500k mark, 500k restore "worst case" benchmark:
master: 1366.639 ms, 5.314 stdev
patch:  1363.844 ms, 5.896 stdev

pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench
master: 265.011 tps, 4.952 stdev
patch:  267.164 tps, 9.094 stdev

How do people feel about the idea of me pushing this for 9.5 (after
I clean up all the affected comments and README files)?  I know
this first appeared in the last CF, but the footprint is fairly
small and the only user-visible behavior change is that a btree
index scan of a WAL-logged table using an MVCC snapshot no longer
blocks a vacuum indefinitely.  (If there are objections I will move
this to the first CF for the next release.)
src/backend/access/nbtree/nbtree.c    |  50 +++++-------src/backend/access/nbtree/nbtsearch.c | 141
+++++++++++++++++++++++++---------src/backend/access/nbtree/nbtutils.c |  58 ++++++++++----src/include/access/nbtree.h
        |  36 ++++++++-4 files changed, 201 insertions(+), 84 deletions(-)
 

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

04 March 2015, 08:01:03

Hello,

> It looks like the remaining performance regression was indeed a
> result of code alignment.  I found two "paranoia" assignments I had
> accidentally failed to put back with the rest of the mark/restore
> optimization; after that trivial change the patched version is
> ever-so-slightly faster than master on all tests.

The performance which your test shows looks great. The gain might
comes from removing of buffer locking on _bt_next. I also would
like to see this to come along with 9.5 but it is the matter for
committers.

Apart from it, I looked into the patch itself more
closely. Please reconcile as you like among contradicting
comments below:)


In nbtutils.c, _bt_killitems:

- This is not in charge of this patch, but the comment for _bt_killitems looks to have a trivial typo.
 >  * _bt_killitems - set LP_DEAD state for items an indexscan caller has >  * told us were killed >  * >  * scan->so
containsinformation about the current page and killed tuples >  * thereon (generally, this should only be called if
so->numKilled> 0).
 
 I suppose "scan->so" should be "scan->opaque" and "so->numKilled" be "opaque->numKilled".


- The following comment looks somewhat wrong.
 > /* Since the else condition needs page, get it here, too. */
 "the else condition needs page" means "the page of the buffer is needed later"? But, I suppose the comment itself is
notnecessary.
 

- If (BTScanPosIsPinned(so->currPos)).
 As I mention below for nbtsearch.c, the page aquired in the if-then block may be vacuumed so the LSN check done in the
if-elseblock is need also in the if-then block. It will be accomplished only by changing the position of the end of the
if-elseblock.
 

- _bt_killitems is called without pin when rescanning from before, so I think the previous code has already considered
theunpinned case ("if (ItemPointerEquals(&ituple..." does that). Vacuum rearranges line pointers after deleting LP_DEAD
itemsso the previous check seems enough for the purpose. The previous code is more effeicient becuase the mathing items
willbe processed even after vacuum.
 


In nbtsearch.c

- _bt_first(): It now releases the share lock on the page before the items to be killed is processed. This allows other
backendsto insert items into the page simultaneously. It seems to be dangerous alone, but _bt_killitems already
considersthe case. Anyway I think It'll be better to add a comment mentioning unlocking is not dangerous.
 


... Sorry, time's up for now.

regards,


> Average of 3 `make check-world` runs:
> master: 336.288 seconds
> patch:  332.237 seconds
> 
> Average of 6 `make check` runs:
> master: 25.409 seconds
> patch:  25.270 seconds
> 
> The following were all run 12 times, the worst 2 dropped, the rest
> averaged:
> 
> Kyotaro-san's 1m mark "worst case" benchmark:
> master: 962.581 ms, 6.765 stdev
> patch:  947.518 ms, 3.584 stdev
> 
> Kyotaro-san's 500k mark, 500k restore "worst case" benchmark:
> master: 1366.639 ms, 5.314 stdev
> patch:  1363.844 ms, 5.896 stdev
> 
> pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench
> master: 265.011 tps, 4.952 stdev
> patch:  267.164 tps, 9.094 stdev
> 
> How do people feel about the idea of me pushing this for 9.5 (after
> I clean up all the affected comments and README files)?  I know
> this first appeared in the last CF, but the footprint is fairly
> small and the only user-visible behavior change is that a btree
> index scan of a WAL-logged table using an MVCC snapshot no longer
> blocks a vacuum indefinitely.  (If there are objections I will move
> this to the first CF for the next release.)
> 
>  src/backend/access/nbtree/nbtree.c    |  50 +++++-------
>  src/backend/access/nbtree/nbtsearch.c | 141 +++++++++++++++++++++++++---------
>  src/backend/access/nbtree/nbtutils.c  |  58 ++++++++++----
>  src/include/access/nbtree.h           |  36 ++++++++-
>  4 files changed, 201 insertions(+), 84 deletions(-)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

04 March 2015, 18:59:23

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> The performance which your test shows looks great. The gain might
> comes from removing of buffer locking on _bt_next.

Yeah, I had been hoping that removing some buffer pins and locks
from the common path of scanning forward from one page to the next
might have some direct performance benefit; it's possible that this
will show up more dramatically at high concurrency levels.  The
main goal, of course, was to improve performance more indirectly,
by not having autovacuum worker processes blocked for extended
periods in some situations where that can now happen.

> In nbtutils.c, _bt_killitems:
>
> - This is not in charge of this patch, but the comment for
> _bt_killitems looks to have a trivial typo.

As you say, that problem was already there, but no reason not to
fix it in the patch as long as we're here.  Reworded.

> - The following comment looks somewhat wrong.
>
> > /* Since the else condition needs page, get it here, too. */
>
> "the else condition needs page" means "the page of the buffer
> is needed later"? But, I suppose the comment itself is not
> necessary.

I guess that comment isn't really worth the space it takes up; what
it says is pretty obvious from the code.  Removed.

> - If (BTScanPosIsPinned(so->currPos)).
>
> As I mention below for nbtsearch.c, the page aquired in the
> if-then block may be vacuumed so the LSN check done in the
> if-else block is need also in the if-then block. It will be
> accomplished only by changing the position of the end of the
> if-else block.

I'm not sure I agree on this.  If the page is pinned it should have
been pinned continuously since we initially read it, so the line
pointers we have could not have been removed by any vacuum process.
The tuples may have been pruned away in the heap, but that doesn't
matter.  Index entries may have been added and the index page may
have been split, but that was true before this patch and
_bt_killitems will deal with those things the same as it always
has.  I don't see any benefit to doing the LSN compare in this
case; if we've paid the costs of holding the pin through to this
point, we might as well flag any dead entries we can.

> - _bt_killitems is called without pin when rescanning from
> before, so I think the previous code has already considered the
> unpinned case ("if (ItemPointerEquals(&ituple..." does
> that). Vacuum rearranges line pointers after deleting LP_DEAD
> items so the previous check seems enough for the purpose. The
> previous code is more effeicient becuase the mathing items will
> be processed even after vacuum.

I'm not following you on this one; could you rephrase it?

> In nbtsearch.c
>
> - _bt_first(): It now releases the share lock on the page before
> the items to be killed is processed. This allows other backends
> to insert items into the page simultaneously. It seems to be
> dangerous alone, but _bt_killitems already considers the
> case. Anyway I think It'll be better to add a comment
> mentioning unlocking is not dangerous.

Well, _bt_killitems does acquire a shared (read) lock to flag index
tuples as LP_DEAD, and it has always been OK for the index page to
accept inserts and allow page splits before calling _bt_killitems.
I don't really see anything new with the patch in this regard.  I
do agree, though, that a lot of comments and at least two README
files refer to the pins blocking vacuum, and these must be found
and changed.

It doesn't seem worth posting to the list for the small changes
since the last version; I'll wait until I update the comments and
README files.  If you want review or test the latest, you can peek
at: https://github.com/kgrittn/postgres/tree/btnopin

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

09 March 2015, 15:45:10

Kevin Grittner <kgrittn@ymail.com> wrote:

> It doesn't seem worth posting to the list for the small changes
> since the last version; I'll wait until I update the comments and
> README files.  If you want review or test the latest, you can
> peek at:
>
> https://github.com/kgrittn/postgres/tree/btnopin

Here is v3, with the promised README and code comment changes.

In the process of documenting the mark/restore area I noticed a
subtlety that I had missed (in the case that there was a mark,
advance to the next page, restore, advance within the page, and
restore).  I fixed that, and in the process gave the patched code
an additional direct performance edge over unpatched code.  For the
1000k marks, average timings are now:

master:  970.999 ms, stdev: 4.043
patched: 940.460 ms, stdev: 4.874

So, in the micro-benchmark showing the biggest benefit the direct
improvement is now just over 3%.  It remains the case that the
primary motivation for the patch is to reduce blocking of vacuum
processes; but that's a nice side-benefit.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

bt-nopin-v3.patch

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

10 March 2015, 11:41:52

Hello,

I found no other problem including the performance issue in the
patch attached to the last mail as far as I can understand. So
I'll mark this as ready for commit after a few days with no
objection after this comments is addressed.


> > - If (BTScanPosIsPinned(so->currPos)).
> >
> > As I mention below for nbtsearch.c, the page aquired in the
> > if-then block may be vacuumed so the LSN check done in the
> > if-else block is need also in the if-then block. It will be
> > accomplished only by changing the position of the end of the
> > if-else block.
> 
> I'm not sure I agree on this.  

Sorry, it is largely because of my poor composition.

> If the page is pinned it should have
> been pinned continuously since we initially read it, so the line
> pointers we have could not have been removed by any vacuum process.
> The tuples may have been pruned away in the heap, but that doesn't
> matter. 
> Index entries may have been added and the index page may
> have been split, but that was true before this patch and
> _bt_killitems will deal with those things the same as it always
> has.

Yes. I think so.

> I don't see any benefit to doing the LSN compare in this
> case; if we've paid the costs of holding the pin through to this
> point, we might as well flag any dead entries we can.

I thought of the case that the buffer has been pinned by another
backend after this backend unpinned it. I looked again the
definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and
understood that the pin should be mine if BTScanPosIsPinned.

Would you mind rewriting the comment there like this?

-  /* The buffer is still pinned, but not locked.  Lock it now. */
+  /* I still hold the pin on the buffer, but not locked.  Lock it now. */
|   LockBuffer(so->currPos.buf, BT_READ);


Or would you mind renaming the macro as "BTScanPosIsPinnedByMe"
or something like, or anyway to notify the reader that the pin
should be mine there?


> > - _bt_killitems is called without pin when rescanning from
> > before, so I think the previous code has already considered the
> > unpinned case ("if (ItemPointerEquals(&ituple..." does
> > that). Vacuum rearranges line pointers after deleting LP_DEAD
> > items so the previous check seems enough for the purpose. The
> > previous code is more effeicient becuase the mathing items will
> > be processed even after vacuum.
> 
> I'm not following you on this one; could you rephrase it?

Sorry, I read btrescan incorrectly that it calls _bt_killitems()
*after* releaseing the buffer. Please forget it.


Finally, I'd like to timidly comment on this..

+ To handle the cases where it is safe to release the pin after
+ reading the index page, the LSN of the index page is read along
+ with the index entries before the shared lock is released, and we
+ do not attempt to flag index tuples as dead if the page is not
+ pinned and the LSN does not match.


I suppose that the sentence like following is more directly
describing about the mechanism and easier to find the
correnponsing code, if it is correct.

>  To handle the cases where a index page has unpinned when
>  trying to mark the unused index tuples on the page as dead,
>  the LSN of the index page is remembered when reading the index
>  page for index tuples, and we do not attempt to flag index
>  tuples as dead if the page is not pinned and the LSN does not
>  match.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

10 March 2015, 13:58:24

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> I found no other problem including the performance issue in the
> patch attached to the last mail as far as I can understand. So
> I'll mark this as ready for commit after a few days with no
> objection after this comments is addressed.

Thanks for the reviews!

>> I don't see any benefit to doing the LSN compare in this
>> case; if we've paid the costs of holding the pin through to this
>> point, we might as well flag any dead entries we can.
>
> I thought of the case that the buffer has been pinned by another
> backend after this backend unpinned it. I looked again the
> definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and
> understood that the pin should be mine if BTScanPosIsPinned.
>
> Would you mind rewriting the comment there like this?
>
> -  /* The buffer is still pinned, but not locked.  Lock it now. */
> +  /* I still hold the pin on the buffer, but not locked.  Lock it now. */
> |  LockBuffer(so->currPos.buf, BT_READ);
>
> Or would you mind renaming the macro as "BTScanPosIsPinnedByMe"
> or something like, or anyway to notify the reader that the pin
> should be mine there?

I see your point, although those first person singular pronouns
used like that make me a little uncomfortable; I'll change the
comment and/or macro name, but I'll work on the name some more.

> Finally, I'd like to timidly comment on this..
>
> + To handle the cases where it is safe to release the pin after
> + reading the index page, the LSN of the index page is read along
> + with the index entries before the shared lock is released, and we
> + do not attempt to flag index tuples as dead if the page is not
> + pinned and the LSN does not match.
>
> I suppose that the sentence like following is more directly
> describing about the mechanism and easier to find the
> correnponsing code, if it is correct.
>
>>  To handle the cases where a index page has unpinned when
>>  trying to mark the unused index tuples on the page as dead,
>>  the LSN of the index page is remembered when reading the index
>>  page for index tuples, and we do not attempt to flag index
>>  tuples as dead if the page is not pinned and the LSN does not
>>  match.

Will reword that part to try to make it clearer.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

10 March 2015, 15:13:07

Kevin Grittner <kgrittn@ymail.com> wrote:
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

>> Would you mind rewriting the comment there like this?
>>
>> - /* The buffer is still pinned, but not locked. Lock it now. */
>> + /* I still hold the pin on the buffer, but not locked. Lock it now. */

>> Or would you mind renaming the macro as "BTScanPosIsPinnedByMe"
>> or something like, or anyway to notify the reader that the pin
>> should be mine there?
>
> I see your point, although those first person singular pronouns
> used like that make me a little uncomfortable; I'll change the
> comment and/or macro name, but I'll work on the name some more.

After thinking it over, I think that the "BTScanPos" part of the
macro name is enough of a hint that it is concerned with the
actions of this scan; it is the comment that needs the change.
I went with:

  /*
   * We have held the pin on this page since we read the index tuples,
   * so all we need to do is lock it.  The pin will have prevented
   * re-use of any TID on the page, so there is no need to check the
   * LSN.
   */

>> + To handle the cases where it is safe to release the pin after
>> + reading the index page, the LSN of the index page is read along
>> + with the index entries before the shared lock is released, and we
>> + do not attempt to flag index tuples as dead if the page is not
>> + pinned and the LSN does not match.
>>
>> I suppose that the sentence like following is more directly
>> describing about the mechanism and easier to find the
>> correnponsing code, if it is correct.
>>
>>> To handle the cases where a index page has unpinned when
>>> trying to mark the unused index tuples on the page as dead,
>>> the LSN of the index page is remembered when reading the index
>>> page for index tuples, and we do not attempt to flag index
>>> tuples as dead if the page is not pinned and the LSN does not
>>> match.
>
> Will reword that part to try to make it clearer.

That sentence was full of "passive voice", which didn't help any.
I changed it to:

| So that we can handle the cases where we attempt LP_DEAD flagging
| for a page after we have released its pin, we remember the LSN of
| the index page when we read the index tuples from it; we do not
| attempt to flag index tuples as dead if the we didn't hold the
| pin the entire time and the LSN has changed.

Do these changes seem clear?

Because these changes are just to a comment and a README file, I'm
posting a patch-on-patch v3a (to be applied on top of v3).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

bt-nopin-v3a.patch

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

12 March 2015, 20:56:24

Kevin Grittner <kgrittn@ymail.com> wrote:

> Because these changes are just to a comment and a README file,
> I'm posting a patch-on-patch v3a (to be applied on top of v3).

Here is some additional comment work that I hope will make things
easier to understand for whoever next visits this code.  There is
also a minor reorganization of some code that should not have any
impact except to skip the restore memcpy() calls where they are not
needed in a corner case that I'm not sure we can currently even
reach.  That reorganization is intended more for code clarity than
for the marginal performance it could provide.

I'm posting this as v3b to be applied on top of v3 and v3a, so that
these changes can be reviewed and commented on separately.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

bt-nopin-v3b.patch

Re: Reduce pinning in btree indexes

From

Peter Geoghegan

Date:

12 March 2015, 22:27:43

On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> At some point we could consider building on this patch to recheck
> index conditions for heap access when a non-MVCC snapshot is used,
> check the visibility map for referenced heap pages when the TIDs
> are read for an index-only scan, and/or skip LP_DEAD hinting for
> non-WAL-logged indexes.  But all those are speculative future work;
> this is a conservative implementation that just didn't modify
> pinning where there were any confounding factors.

I don't have the bandwidth for a full review, but I took a quick look at this.

I think you should call out those "confounding factors" in the README.
It's not hard to piece them together from
_bt_drop_lock_and_maybe_pin(), but I think you should anyway.

Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree
LockBuffer() callers do. You're inconsistent about that in V3.

-- 
Peter Geoghegan

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

13 March 2015, 08:55:18

Hello,

At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote in
<CAM3SWZQ5nwTB-y4ZOj=5ckMLce5GAEUnjKJ2=M1vMHfX_aYmCg@mail.gmail.com>
> On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> > At some point we could consider building on this patch to recheck
> > index conditions for heap access when a non-MVCC snapshot is used,
> > check the visibility map for referenced heap pages when the TIDs
> > are read for an index-only scan, and/or skip LP_DEAD hinting for
> > non-WAL-logged indexes.  But all those are speculative future work;
> > this is a conservative implementation that just didn't modify
> > pinning where there were any confounding factors.
> 
> I don't have the bandwidth for a full review, but I took a quick look at this.
> 
> I think you should call out those "confounding factors" in the README.
> It's not hard to piece them together from
> _bt_drop_lock_and_maybe_pin(), but I think you should anyway.
> 
> Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree
> LockBuffer() callers do. You're inconsistent about that in V3.

I agree with you. It looks the only point where it is used. 

Addition to that, the commnet just above the point methioned
above quizes me.

> /* XXX: Can walking left be lighter on the locking and pins? */
>     if (BTScanPosIsPinned(so->currPos))
>         LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE);
>     else
>         so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);

I'm happy if I could read the meaming of the comment more
clearly. I understand that it says that you want to remove the
locking (and pinning), but can't now because the simultaneous
splitting of the left page would break something. I'd like to see
it clearer even for me either I am correct or not..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

13 March 2015, 09:23:35

On 4 March 2015 at 03:16, Kevin Grittner <kgrittn@ymail.com> wrote:

> How do people feel about the idea of me pushing this for 9.5 (after
> I clean up all the affected comments and README files)?  I know
> this first appeared in the last CF, but the footprint is fairly
> small and the only user-visible behavior change is that a btree
> index scan of a WAL-logged table using an MVCC snapshot no longer
> blocks a vacuum indefinitely.  (If there are objections I will move
> this to the first CF for the next release.)

It helps Hot Standby also, BTW. I proposed this previously, but it was
shot down, so I'm glad to see it happening.

I'm not sure why you have proposed only half the solution though?
Hopefully we aren't just submitting the difficult half that you needed
feedback on? Let's see the whole "snapshot too old" patch as well,
since that affects core PostgresSQL also.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

13 March 2015, 12:52:45

On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote:

> What they wanted was what happened in the other database product --
> if a snapshot got sufficiently old that cleaning up the MVCC data
> was a problem *and* the snapshot was used again *and* it read a
> page which had been modified far enough back that it was not
> possible to return correct data, then they wanted to receive a
> "snapshot too old" error.  I wrote a patch to do that...

So, please lets see the patch. It seems useful for core Postgres.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Robert Haas

Date:

13 March 2015, 13:17:38

On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote:
>> What they wanted was what happened in the other database product --
>> if a snapshot got sufficiently old that cleaning up the MVCC data
>> was a problem *and* the snapshot was used again *and* it read a
>> page which had been modified far enough back that it was not
>> possible to return correct data, then they wanted to receive a
>> "snapshot too old" error.  I wrote a patch to do that...
>
> So, please lets see the patch. It seems useful for core Postgres.

It was submitted here:

http://www.postgresql.org/message-id/136937748.3364317.1423964815320.JavaMail.yahoo@mail.yahoo.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

13 March 2015, 15:08:46

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote:
>> On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

>>> At some point we could consider building on this patch to
>>> recheck index conditions for heap access when a non-MVCC
>>> snapshot is used, check the visibility map for referenced heap
>>> pages when the TIDs are read for an index-only scan, and/or
>>> skip LP_DEAD hinting for non-WAL-logged indexes.  But all those
>>> are speculative future work; this is a conservative
>>> implementation that just didn't modify pinning where there were
>>> any confounding factors.

>> I think you should call out those "confounding factors" in the
>> README. It's not hard to piece them together from
>> _bt_drop_lock_and_maybe_pin(), but I think you should anyway.

OK:

https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e

>> Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing
>> nbtree LockBuffer() callers do. You're inconsistent about that
>> in V3.
>
> I agree with you. It looks the only point where it is used.

OK:

https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea

> Addition to that, the commnet just above the point methioned
> above quizes me.
>
>>    /* XXX: Can walking left be lighter on the locking and pins? */
>>        if (BTScanPosIsPinned(so->currPos))
>>                LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE);
>>        else
>>                so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
>
> I'm happy if I could read the meaming of the comment more
> clearly. I understand that it says that you want to remove the
> locking (and pinning), but can't now because the simultaneous
> splitting of the left page would break something. I'd like to see
> it clearer even for me either I am correct or not..

Does this clear it up?:

https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b

Since there are no changes that would affect the compiled code
here, I'm not posting a new patch yet.  I'll do that once things
seem to have settled down a bit more.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

13 March 2015, 15:41:47

Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote:
>>> What they wanted was what happened in the other database
>>> product -- if a snapshot got sufficiently old that cleaning up
>>> the MVCC data was a problem *and* the snapshot was used again
>>> *and* it read a page which had been modified far enough back
>>> that it was not possible to return correct data, then they
>>> wanted to receive a "snapshot too old" error.  I wrote a patch
>>> to do that...
>>
>> So, please lets see the patch. It seems useful for core
>> Postgres.
>
> It was submitted here:
>
> http://www.postgresql.org/message-id/136937748.3364317.1423964815320.JavaMail.yahoo@mail.yahoo.com

The feedback was generally fairly positive except for the fact that
snapshot "age" (for purposes of being too old) was measured in
transaction IDs assigned.  There seemed to be a pretty universal
feeling that this needed to be changed to a time-based setting.
I've been working on that, although my initial efforts proved to be
a dead-end.  I've had some off-list discussions with Andrew Dunstan
and we have agreed on something that looks like it should work
well, but it's still being coded.

The other thing that is needed is to sprinkle the other index AMs
with TestForOldSnapshot() calls.  In the btree code the right spots
for those were all following BufferGetPage() calls, so the work was
in seeing where each of those could be called from to see whether
it was a spot that the check was needed.

Of course, docs and comment changes are also needed, but that's
probably only a day or two's effort.

Further discussion of the "snapshot too old" patch should probably
go on its thread.  The spillover is largely my fault for
referencing one from the other.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

14 March 2015, 16:03:34

On 13 March 2015 at 13:17, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote:
>>> What they wanted was what happened in the other database product --
>>> if a snapshot got sufficiently old that cleaning up the MVCC data
>>> was a problem *and* the snapshot was used again *and* it read a
>>> page which had been modified far enough back that it was not
>>> possible to return correct data, then they wanted to receive a
>>> "snapshot too old" error.  I wrote a patch to do that...
>>
>> So, please lets see the patch. It seems useful for core Postgres.
>
> It was submitted here:
>
> http://www.postgresql.org/message-id/136937748.3364317.1423964815320.JavaMail.yahoo@mail.yahoo.com

Thanks. I have +1'd that patch.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

14 March 2015, 16:07:37

On 13 March 2015 at 15:41, Kevin Grittner <kgrittn@ymail.com> wrote:

> The feedback was generally fairly positive except for the fact that
> snapshot "age" (for purposes of being too old) was measured in
> transaction IDs assigned.  There seemed to be a pretty universal
> feeling that this needed to be changed to a time-based setting.

-1 for a time based setting.

After years of consideration, bloat is now controllable by altering
the size of the undo tablespace.

I think PostgreSQL needs something size-based also. It would need some
estimation to get it to work like that, true, but it is actually the
size of the bloat we care about, not the time. So we should be
thinking in terms of limits that we actually care about.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Kyotaro HORIGUCHI

Date:

16 March 2015, 06:58:01

Thank you for rewriting.

I was satisfied with the current patch (Head of btnopin on your
github repos). I have no further comment on the functions. Peter
might have a comment on the README description but I suppose I
can push this to the committers.

I attached the your latest patch to this mail as
bt-nopin-v4.patch for now. Please check that there's no problem
in it.


- By this patch, index scan becomes to release buffer pins while fetching index tuples in a page, so it should reduce
thechance of index scans with long duration to block vacuum, because vacuums now can easily overtake the current
positionof an index scan. I didn't actually measured how effective it is, though.
 

- It looks to work correctly for scan in both direction with simultaneous page splitting and vacuum.

- It makes no performance deterioration, on the contrary it accelerates index scans. It seems to be because of removal
oflock and unlock surrounding _bt_steppage in bt_next.
 

- It applies cleanly on the current head on the master branch.

- It has enough intelligible comments and README descriptions.

- An inrelevant typo fix is made in buffers/README by this patch. But it doesn't seem necessary to be separated.

regards,



At Fri, 13 Mar 2015 15:08:25 +0000 (UTC), Kevin Grittner <kgrittn@ymail.com> wrote in
<848652045.4044965.1426259305233.JavaMail.yahoo@mail.yahoo.com>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote:
> >> On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

> >> I think you should call out those "confounding factors" in the
> >> README. It's not hard to piece them together from
> >> _bt_drop_lock_and_maybe_pin(), but I think you should anyway.
> 
> OK:
> 
> https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e

It looks perfect for me. Do you have any further request on this,
Peter?

> >> Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing
> >> nbtree LockBuffer() callers do. You're inconsistent about that
> >> in V3.
> >
> > I agree with you. It looks the only point where it is used.
> 
> OK:
> 
> https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea
> 
> > Addition to that, the commnet just above the point methioned
> > above quizes me.
> >
> >>    /* XXX: Can walking left be lighter on the locking and pins? */
> >>        if (BTScanPosIsPinned(so->currPos))
> >>                LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE);
> >>        else
> >>                so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
> >
> > I'm happy if I could read the meaming of the comment more
> > clearly. I understand that it says that you want to remove the
> > locking (and pinning), but can't now because the simultaneous
> > splitting of the left page would break something. I'd like to see
> > it clearer even for me either I am correct or not..
> 
> Does this clear it up?:
> 
> https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b

Thank you for the labor. It also looks perfect.

> Since there are no changes that would affect the compiled code
> here, I'm not posting a new patch yet.  I'll do that once things
> seem to have settled down a bit more.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 213542c..c904d2f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -92,17 +92,18 @@ To minimize lock/unlock traffic, an index scan always searches a leaf pageto identify all the
matchingitems at once, copying their heap tuple IDsinto backend-local storage.  The heap tuple IDs are then processed
whilenotholding any page lock within the index.  We do continue to hold a pin
 
-on the leaf page, to protect against concurrent deletions (see below).
-In this state the scan is effectively stopped "between" pages, either
-before or after the page it has pinned.  This is safe in the presence of
-concurrent insertions and even page splits, because items are never moved
-across pre-existing page boundaries --- so the scan cannot miss any items
-it should have seen, nor accidentally return the same item twice.  The scan
-must remember the page's right-link at the time it was scanned, since that
-is the page to move right to; if we move right to the current right-link
-then we'd re-scan any items moved by a page split.  We don't similarly
-remember the left-link, since it's best to use the most up-to-date
-left-link when trying to move left (see detailed move-left algorithm below).
+on the leaf page in some circumstances, to protect against concurrent
+deletions (see below).  In this state the scan is effectively stopped
+"between" pages, either before or after the page it has pinned.  This is
+safe in the presence of concurrent insertions and even page splits, because
+items are never moved across pre-existing page boundaries --- so the scan
+cannot miss any items it should have seen, nor accidentally return the same
+item twice.  The scan must remember the page's right-link at the time it
+was scanned, since that is the page to move right to; if we move right to
+the current right-link then we'd re-scan any items moved by a page split.
+We don't similarly remember the left-link, since it's best to use the most
+up-to-date left-link when trying to move left (see detailed move-left
+algorithm below).In most cases we release our lock and pin on a page before attemptingto acquire pin and lock on the
pagewe are moving to.  In a few places
 
@@ -154,25 +155,37 @@ starts.  This is not necessary for correctness in terms of the btree indexoperations themselves;
asexplained above, index scans logically stop"between" pages and so can't lose their place.  The reason we do it is
toprovidean interlock between non-full VACUUM and indexscans.  Since VACUUM
 
-deletes index entries before deleting tuples, the super-exclusive lock
-guarantees that VACUUM can't delete any heap tuple that an indexscanning
-process might be about to visit.  (This guarantee works only for simple
-indexscans that visit the heap in sync with the index scan, not for bitmap
-scans.  We only need the guarantee when using non-MVCC snapshot rules; in
-an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with
-an unrelated tuple at the same TID, because the new tuple wouldn't be
-visible to our scan anyway.)
-
-Because a page can be split even while someone holds a pin on it, it is
-possible that an indexscan will return items that are no longer stored on
-the page it has a pin on, but rather somewhere to the right of that page.
-To ensure that VACUUM can't prematurely remove such heap tuples, we require
-btbulkdelete to obtain super-exclusive lock on every leaf page in the index,
-even pages that don't contain any deletable tuples.  This guarantees that
-the btbulkdelete call cannot return while any indexscan is still holding
-a copy of a deleted index tuple.  Note that this requirement does not say
-that btbulkdelete must visit the pages in any particular order.  (See also
-on-the-fly deletion, below.)
+deletes index entries before reclaiming heap tuple line pointers, the
+super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
+line pointer that an indexscanning process might be about to visit.  This
+guarantee works only for simple indexscans that visit the heap in sync
+with the index scan, not for bitmap scans.  We only need the guarantee
+when using non-MVCC snapshot rules; when using an MVCC snapshot, it
+doesn't matter if the heap tuple is replaced with an unrelated tuple at
+the same TID, because the new tuple won't be visible to our scan anyway.
+Therefore, a scan using an MVCC snapshot which has no other confounding
+factors will not hold the pin after the page contents are read.  The
+current reasons for exceptions, where a pin is still needed, are if the
+index is not WAL-logged or if the scan is an index-only scan.  If later
+work allows the pin to be dropped for all cases we will be able to
+simplify the vacuum code, since the concept of a super-exclusive lock
+for btree indexes will no longer be needed.
+
+Because a pin is not always held, and a page can be split even while
+someone does hold a pin on it, it is possible that an indexscan will
+return items that are no longer stored on the page it has a pin on, but
+rather somewhere to the right of that page.  To ensure that VACUUM can't
+prematurely remove such heap tuples, we require btbulkdelete to obtain a
+super-exclusive lock on every leaf page in the index, even pages that
+don't contain any deletable tuples.  Any scan which could yield incorrect
+results if the tuple at a TID matching the the scan's range and filter
+conditions were replaced by a different tuple while the scan is in
+progress must hold the pin on each index page until all index entries read
+from the page have been processed.  This guarantees that the btbulkdelete
+call cannot return while any indexscan is still holding a copy of a
+deleted index tuple if the scan could be confused by that.  Note that this
+requirement does not say that btbulkdelete must visit the pages in any
+particular order.  (See also on-the-fly deletion, below.)There is no such interlocking for deletion of items in
internalpages,since backends keep no lock nor pin on a page they have descended past.
 
@@ -396,8 +409,12 @@ that this breaks the interlock between VACUUM and indexscans, but that isnot so: as long as an
indexscanningprocess has a pin on the page wherethe index item used to be, VACUUM cannot complete its btbulkdelete
scanandso cannot remove the heap tuple.  This is another reason why
 
-btbulkdelete has to get super-exclusive lock on every leaf page, not only
-the ones where it actually sees items to delete.
+btbulkdelete has to get a super-exclusive lock on every leaf page, not
+only the ones where it actually sees items to delete.  So that we can
+handle the cases where we attempt LP_DEAD flagging for a page after we
+have released its pin, we remember the LSN of the index page when we read
+the index tuples from it; we do not attempt to flag index tuples as dead
+if the we didn't hold the pin the entire time and the LSN has changed.WAL Considerations------------------
@@ -462,7 +479,7 @@ metapage update (of the "fast root" link) is performed and logged as partof the insertion into the
parentlevel.  When splitting the root page, themetapage update is handled as part of the "new root" action.
 
-Each step in page deletion are logged as separate WAL entries: marking the
+Each step in page deletion is logged as a separate WAL entry: marking theleaf as half-dead and removing the downlink
isone record, and unlinking apage is a second record.  If vacuum is interrupted for some reason, or thesystem crashes,
thetree is consistent for searches and insertions.  The
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 932c6f7..ef68a71 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -498,9 +498,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, *        If there's not enough
roomin the space, we try to make room by *        removing any LP_DEAD tuples. * 
- *        On entry, *buf and *offsetptr point to the first legal position
+ *        On entry, *bufptr and *offsetptr point to the first legal position *        where the new tuple could be
inserted. The caller should hold an
 
- *        exclusive lock on *buf.  *offsetptr can also be set to
+ *        exclusive lock on *bufptr.  *offsetptr can also be set to *        InvalidOffsetNumber, in which case the
functionwill search for the *        right location within the page if needed.  On exit, they point to the *
choseninsert location.  If _bt_findinsertloc decides to move right,
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3af462b..9a6dcdd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -404,7 +404,8 @@ btbeginscan(PG_FUNCTION_ARGS)    /* allocate private workspace */    so = (BTScanOpaque)
palloc(sizeof(BTScanOpaqueData));
-    so->currPos.buf = so->markPos.buf = InvalidBuffer;
+    BTScanPosInvalidate(so->currPos);
+    BTScanPosInvalidate(so->markPos);    if (scan->numberOfKeys > 0)        so->keyData = (ScanKey)
palloc(scan->numberOfKeys* sizeof(ScanKeyData));    else
 
@@ -424,8 +425,6 @@ btbeginscan(PG_FUNCTION_ARGS)     * scan->xs_itupdesc whether we'll need it or not, since that's so
cheap.    */    so->currTuples = so->markTuples = NULL;
 
-    so->currPos.nextTupleOffset = 0;
-    so->markPos.nextTupleOffset = 0;    scan->xs_itupdesc = RelationGetDescr(rel);
@@ -451,17 +450,14 @@ btrescan(PG_FUNCTION_ARGS)    {        /* Before leaving current page, deal with any killed items
*/       if (so->numKilled > 0)
 
-            _bt_killitems(scan, false);
-        ReleaseBuffer(so->currPos.buf);
-        so->currPos.buf = InvalidBuffer;
+            _bt_killitems(scan);
+        BTScanPosUnpinIfPinned(so->currPos);
+        BTScanPosInvalidate(so->currPos);    }
-    if (BTScanPosIsValid(so->markPos))
-    {
-        ReleaseBuffer(so->markPos.buf);
-        so->markPos.buf = InvalidBuffer;
-    }    so->markItemIndex = -1;
+    BTScanPosUnpinIfPinned(so->markPos);
+    BTScanPosInvalidate(so->markPos);    /*     * Allocate tuple workspace arrays, if needed for an index-only scan
and
@@ -515,17 +511,14 @@ btendscan(PG_FUNCTION_ARGS)    {        /* Before leaving current page, deal with any killed
items*/        if (so->numKilled > 0)
 
-            _bt_killitems(scan, false);
-        ReleaseBuffer(so->currPos.buf);
-        so->currPos.buf = InvalidBuffer;
+            _bt_killitems(scan);
+        BTScanPosUnpinIfPinned(so->currPos);    }
-    if (BTScanPosIsValid(so->markPos))
-    {
-        ReleaseBuffer(so->markPos.buf);
-        so->markPos.buf = InvalidBuffer;
-    }    so->markItemIndex = -1;
+    BTScanPosUnpinIfPinned(so->markPos);
+
+    /* No need to invalidate positions, the RAM is about to be freed. */    /* Release storage */    if (so->keyData
!=NULL)
 
@@ -552,12 +545,8 @@ btmarkpos(PG_FUNCTION_ARGS)    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
BTScanOpaqueso = (BTScanOpaque) scan->opaque;
 
-    /* we aren't holding any read locks, but gotta drop the pin */
-    if (BTScanPosIsValid(so->markPos))
-    {
-        ReleaseBuffer(so->markPos.buf);
-        so->markPos.buf = InvalidBuffer;
-    }
+    /* There may be an old mark with a pin (but no lock). */
+    BTScanPosUnpinIfPinned(so->markPos);    /*     * Just record the current itemIndex.  If we later step to next
page
@@ -568,7 +557,10 @@ btmarkpos(PG_FUNCTION_ARGS)    if (BTScanPosIsValid(so->currPos))        so->markItemIndex =
so->currPos.itemIndex;   else
 
+    {
+        BTScanPosInvalidate(so->markPos);        so->markItemIndex = -1;
+    }    /* Also record the current positions of any array keys */    if (so->numArrayKeys)
@@ -593,28 +585,54 @@ btrestrpos(PG_FUNCTION_ARGS)    if (so->markItemIndex >= 0)    {        /*
-         * The mark position is on the same page we are currently on. Just
+         * The scan has never moved to a new page since the last mark.  Just         * restore the itemIndex.
+         *
+         * NB: In this case we can't count on anything in so->markPos to be
+         * accurate.         */        so->currPos.itemIndex = so->markItemIndex;    }
+    else if (so->currPos.currPage == so->markPos.currPage)
+    {
+        /*
+         * so->markItemIndex < 0 but mark and current positions are on the
+         * same page.  This would be an unusual case, where the scan moved to
+         * a new index page after the mark, restored, and later restored again
+         * without moving off the marked page.  It is not clear that this code
+         * can currently be reached, but it seems better to make this function
+         * robust for this case than to Assert() or elog() that it can't
+         * happen.
+         *
+         * We neither want to set so->markItemIndex >= 0 (because that could
+         * cause a later move to a new page to redo the memcpy() executions)
+         * nor re-execute the memcpy() functions for a restore within the same
+         * page.  The previous restore to this page already set everything
+         * except markPos as it should be.
+         */
+        so->currPos.itemIndex = so->markPos.itemIndex;
+    }    else    {
-        /* we aren't holding any read locks, but gotta drop the pin */
+        /*
+         * The scan moved to a new page after last mark or restore, and we are
+         * now restoring to the marked page.  We aren't holding any read
+         * locks, but if we're still holding the pin for the current position,
+         * we must drop it.
+         */        if (BTScanPosIsValid(so->currPos))        {            /* Before leaving current page, deal with
anykilled items */
 
-            if (so->numKilled > 0 &&
-                so->currPos.buf != so->markPos.buf)
-                _bt_killitems(scan, false);
-            ReleaseBuffer(so->currPos.buf);
-            so->currPos.buf = InvalidBuffer;
+            if (so->numKilled > 0)
+                _bt_killitems(scan);
+            BTScanPosUnpinIfPinned(so->currPos);        }        if (BTScanPosIsValid(so->markPos))        {
/* bump pin on mark buffer for assignment to current buffer */
 
-            IncrBufferRefCount(so->markPos.buf);
+            if (BTScanPosIsPinned(so->markPos))
+                IncrBufferRefCount(so->markPos.buf);            memcpy(&so->currPos, &so->markPos,
offsetof(BTScanPosData,items[1]) +                   so->markPos.lastItem * sizeof(BTScanPosItem));
 
@@ -622,6 +640,8 @@ btrestrpos(PG_FUNCTION_ARGS)                memcpy(so->currTuples, so->markTuples,
    so->markPos.nextTupleOffset);        }
 
+        else
+            BTScanPosInvalidate(so->currPos);    }    PG_RETURN_VOID();
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index a681d6b..af4201a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -22,6 +22,7 @@#include "storage/predicate.h"#include "utils/lsyscache.h"#include "utils/rel.h"
+#include "utils/tqual.h"static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -31,6 +32,36 @@ static void _bt_saveitem(BTScanOpaque so, int itemIndex,static bool _bt_steppage(IndexScanDesc scan,
ScanDirectiondir);static Buffer _bt_walk_left(Relation rel, Buffer buf);static bool _bt_endpoint(IndexScanDesc scan,
ScanDirectiondir);
 
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+
+/*
+ *    _bt_drop_lock_and_maybe_pin()
+ *
+ * Unlock the buffer; and if it is safe to release the pin, do that, too.  It
+ * is safe if the scan is using an MVCC snapshot and the index is WAL-logged.
+ * This will prevent vacuum from stalling in a blocked state trying to read a
+ * page when a cursor is sitting on it -- at least in many important cases.
+ *
+ * Set the buffer to invalid if the pin is released, since the buffer may be
+ * re-used.  If we need to go back to this block (for example, to apply
+ * LP_DEAD hints) we must get a fresh reference to the buffer.  Hopefully it
+ * will remain in shared memory for as long as it takes to scan the index
+ * buffer page.
+ */
+static void
+_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+{
+    LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);
+
+    if (IsMVCCSnapshot(scan->xs_snapshot) &&
+        RelationNeedsWAL(scan->indexRelation) &&
+        !scan->xs_want_itup)
+    {
+        ReleaseBuffer(sp->buf);
+        sp->buf = InvalidBuffer;
+    }
+}/*
@@ -506,6 +537,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)    StrategyNumber strat_total;    BTScanPosItem
*currItem;
+    Assert(!BTScanPosIsValid(so->currPos));
+    pgstat_count_index_scan(rel);    /*
@@ -942,9 +975,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)    /* don't need to keep the stack around... */
_bt_freestack(stack);
-    /* remember which buffer we have pinned, if any */
-    so->currPos.buf = buf;
-    if (!BufferIsValid(buf))    {        /*
@@ -1023,6 +1053,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)            break;    }
+    /* remember which buffer we have pinned, if any */
+    Assert(!BTScanPosIsValid(so->currPos));
+    so->currPos.buf = buf;
+    /*     * Now load data from the first page of the scan.     */
@@ -1032,12 +1066,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)         * There's no actually-matching data on
thispage.  Try to advance to         * the next page.  Return false if there's no matching data at all.         */
 
+        LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        if (!_bt_steppage(scan, dir))            return false;
  }
 
-
-    /* Drop the lock, but not pin, on the current page */
-    LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+    else
+    {
+        /* Drop the lock, and maybe the pin, on the current page */
+        _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+    }    /* OK, itemIndex says what to return */    currItem = &so->currPos.items[so->currPos.itemIndex];
@@ -1051,8 +1088,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)/* *    _bt_next() -- Get the next item in a
scan.*
 
- *        On entry, so->currPos describes the current page, which is pinned
- *        but not locked, and so->currPos.itemIndex identifies which item was
+ *        On entry, so->currPos describes the current page, which may be pinned
+ *        but is not locked, and so->currPos.itemIndex identifies which item was *        previously returned. * *
  On successful exit, scan->xs_ctup.t_self is set to the TID of the
 
@@ -1076,26 +1113,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)    {        if (++so->currPos.itemIndex >
so->currPos.lastItem)       {
 
-            /* We must acquire lock before applying _bt_steppage */
-            Assert(BufferIsValid(so->currPos.buf));
-            LockBuffer(so->currPos.buf, BT_READ);            if (!_bt_steppage(scan, dir))                return
false;
-            /* Drop the lock, but not pin, on the new page */
-            LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        }    }    else    {        if
(--so->currPos.itemIndex< so->currPos.firstItem)        {
 
-            /* We must acquire lock before applying _bt_steppage */
-            Assert(BufferIsValid(so->currPos.buf));
-            LockBuffer(so->currPos.buf, BT_READ);            if (!_bt_steppage(scan, dir))                return
false;
-            /* Drop the lock, but not pin, on the new page */
-            LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        }    }
@@ -1135,7 +1162,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)    IndexTuple    itup;
 bool        continuescan;
 
-    /* we must have the buffer pinned and locked */
+    /*
+     * We must have the buffer pinned and locked, but the usual macro can't be
+     * used here; this function is what makes it good for currPos.
+     */    Assert(BufferIsValid(so->currPos.buf));    page = BufferGetPage(so->currPos.buf);
@@ -1144,6 +1174,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)    maxoff =
PageGetMaxOffsetNumber(page);   /*
 
+     * We note the buffer's block number so that we can release the pin later.
+     * This allows us to re-read the buffer if it is needed again for hinting.
+     */
+    so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+    /*
+     * We save the LSN of the page as we read it, so that we know whether it
+     * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+     * the pin for MVCC scans, which allows vacuum to avoid blocking.
+     */
+    so->currPos.lsn = PageGetLSN(page);
+
+    /*     * we must save the page's right-link while scanning it; this tells us     * where to step right to after
we'redone with these items.  There is no     * corresponding need for the left-link, since splits always go right.
 
@@ -1153,6 +1196,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)    /* initialize tuple
workspaceto empty */    so->currPos.nextTupleOffset = 0;
 
+    /*
+     * Now that the current page has been made consistent, the macro should be
+     * good.
+     */
+    Assert(BTScanPosIsPinned(so->currPos));
+    if (ScanDirectionIsForward(dir))    {        /* load items[] in ascending order */
@@ -1241,11 +1290,14 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,/* *    _bt_steppage() -- Step to next page
containingvalid data for scan *
 
- * On entry, so->currPos.buf must be pinned and read-locked.  We'll drop
- * the lock and pin before moving to next page.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
+ * if pinned, we'll drop the pin before moving to next page.  The buffer is
+ * not locked on entry. *
- * On success exit, we hold pin and read-lock on the next interesting page,
- * and so->currPos is updated to contain data from that page.
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  For success on a scan using a non-MVCC snapshot we hold
+ * a pin, but not a read lock, on that page.  If we do not hold the pin, we
+ * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success. * * If there are no more matching
recordsin the given direction, we drop all * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
 
@@ -1258,12 +1310,11 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)    Page        page;    BTPageOpaque
opaque;
-    /* we must have the buffer pinned and locked */
-    Assert(BufferIsValid(so->currPos.buf));
+    Assert(BTScanPosIsValid(so->currPos));    /* Before leaving current page, deal with any killed items */    if
(so->numKilled> 0)
 
-        _bt_killitems(scan, true);
+        _bt_killitems(scan);    /*     * Before we modify currPos, make a copy of the page data if there was a
@@ -1272,7 +1323,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)    if (so->markItemIndex >= 0)    {        /*
bumppin on current buffer for assignment to mark buffer */
 
-        IncrBufferRefCount(so->currPos.buf);
+        if (BTScanPosIsPinned(so->currPos))
+            IncrBufferRefCount(so->currPos.buf);        memcpy(&so->markPos, &so->currPos,
offsetof(BTScanPosData,items[1]) +               so->currPos.lastItem * sizeof(BTScanPosItem));
 
@@ -1294,14 +1346,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)        /* Remember we left a page with data
*/       so->currPos.moreLeft = true;
 
+        /* release the previous buffer, if pinned */
+        BTScanPosUnpinIfPinned(so->currPos);
+        for (;;)        {
-            /* release the previous buffer */
-            _bt_relbuf(rel, so->currPos.buf);
-            so->currPos.buf = InvalidBuffer;            /* if we're at end of scan, give up */            if (blkno ==
P_NONE|| !so->currPos.moreRight)
 
+            {
+                BTScanPosInvalidate(so->currPos);                return false;
+            }            /* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();           /* step right one page */
 
@@ -1317,8 +1372,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)                if (_bt_readpage(scan, dir,
P_FIRSTDATAKEY(opaque)))                   break;            }
 
+            /* nope, keep going */            blkno = opaque->btpo_next;
+            _bt_relbuf(rel, so->currPos.buf);        }    }    else
@@ -1332,14 +1389,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)         * to our left splits while we are
inflight to it, plus the         * possibility that the page we were on gets deleted after we leave         * it.  See
nbtree/READMEfor details.
 
+         *
+         * It might be possible to rearrange this code to have less overhead
+         * in pinning and locking, but that would require capturing the left
+         * pointer when the page is initially read, and using it here, along
+         * with big changes to _bt_walk_left() and the code below.  It is not
+         * clear whether this would be a win, since if the page immediately to
+         * the left splits after we read this page and before we step left, we
+         * would need to visit more pages than with the current code.         */
+        if (BTScanPosIsPinned(so->currPos))
+            LockBuffer(so->currPos.buf, BT_READ);
+        else
+            so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+        for (;;)        {            /* Done if we know there are no matching keys to the left */            if
(!so->currPos.moreLeft)           {                _bt_relbuf(rel, so->currPos.buf);
 
-                so->currPos.buf = InvalidBuffer;
+                BTScanPosInvalidate(so->currPos);                return false;            }
@@ -1348,7 +1418,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)            /* if we're physically at end of
index,return failure */            if (so->currPos.buf == InvalidBuffer)
 
+            {
+                BTScanPosInvalidate(so->currPos);                return false;
+            }            /*             * Okay, we managed to move left to a non-deleted page. Done if
@@ -1368,6 +1441,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)        }    }
+    /* Drop the lock, and maybe the pin, on the current page */
+    _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+    return true;}
@@ -1604,7 +1680,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)         * exists.         */
PredicateLockRelation(rel,scan->xs_snapshot);
 
-        so->currPos.buf = InvalidBuffer;
+        BTScanPosInvalidate(so->currPos);        return false;    }
@@ -1658,12 +1734,15 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)         * There's no actually-matching data
onthis page.  Try to advance to         * the next page.  Return false if there's no matching data at all.         */
 
+        LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        if (!_bt_steppage(scan, dir))            return false;
  }
 
-
-    /* Drop the lock, but not pin, on the current page */
-    LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+    else
+    {
+        /* Drop the lock, and maybe the pin, on the current page */
+        _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+    }    /* OK, itemIndex says what to return */    currItem = &so->currPos.items[so->currPos.itemIndex];
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 379dac9..d1589f0 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1721,27 +1721,35 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc, * _bt_killitems - set
LP_DEADstate for items an indexscan caller has * told us were killed *
 
- * scan->so contains information about the current page and killed tuples
- * thereon (generally, this should only be called if so->numKilled > 0).
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0). *
- * The caller must have pin on so->currPos.buf, but may or may not have
- * read-lock, as indicated by haveLock.  Note that we assume read-lock
- * is sufficient for setting LP_DEAD status (which is only a hint).
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint). * * We match items by heap TID before assuming they are the right ones to *
delete. We cope with cases where items have moved right due to insertions. * If an item has moved off the current page
dueto a split, we'll fail to * find it and do nothing (this is not an error case --- we assume the item
 
- * will eventually get marked in a future indexscan).  Note that because we
- * hold pin on the target page continuously from initially reading the items
- * until applying this function, VACUUM cannot have deleted any items from
- * the page, and so there is no need to search left from the recorded offset.
- * (This observation also guarantees that the item is still the right one
- * to delete, which might otherwise be questionable since heap TIDs can get
- * recycled.)
+ * will eventually get marked in a future indexscan).
+ *
+ * Note that if we hold a pin on the target page continuously from initially
+ * reading the items until applying this function, VACUUM cannot have deleted
+ * any items from the page, and so there is no need to search left from the
+ * recorded offset.  (This observation also guarantees that the item is still
+ * the right one to delete, which might otherwise be questionable since heap
+ * TIDs can get recycled.)  This holds true even if the page has been modified
+ * by inserts and page splits, so there is no need to consult the LSN.
+ *
+ * If the pin was released after reading the page, then we re-read it.  If it
+ * has been modified since we read it (as determined by the LSN), we dare not
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple. */void
-_bt_killitems(IndexScanDesc scan, bool haveLock)
+_bt_killitems(IndexScanDesc scan){    BTScanOpaque so = (BTScanOpaque) scan->opaque;    Page        page;
@@ -1749,19 +1757,56 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)    OffsetNumber minoff;    OffsetNumber
maxoff;   int            i;
 
+    int            numKilled = so->numKilled;    bool        killedsomething = false;
-    Assert(BufferIsValid(so->currPos.buf));
+    Assert(BTScanPosIsValid(so->currPos));
+
+    /*
+     * Always reset the scan state, so we don't look for same items on other
+     * pages.
+     */
+    so->numKilled = 0;
-    if (!haveLock)
+    if (BTScanPosIsPinned(so->currPos))
+    {
+        /*
+         * We have held the pin on this page since we read the index tuples,
+         * so all we need to do is lock it.  The pin will have prevented
+         * re-use of any TID on the page, so there is no need to check the
+         * LSN.
+         */        LockBuffer(so->currPos.buf, BT_READ);
-    page = BufferGetPage(so->currPos.buf);
+        page = BufferGetPage(so->currPos.buf);
+    }
+    else
+    {
+        Buffer        buf;
+
+        /* Attempt to re-read the buffer, getting pin and lock. */
+        buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+
+        /* It might not exist anymore; in which case we can't hint it. */
+        if (!BufferIsValid(buf))
+            return;
+
+        page = BufferGetPage(buf);
+        if (PageGetLSN(page) == so->currPos.lsn)
+            so->currPos.buf = buf;
+        else
+        {
+            /* Modified while not pinned means hinting is not safe. */
+            _bt_relbuf(scan->indexRelation, buf);
+            return;
+        }
+    }
+    opaque = (BTPageOpaque) PageGetSpecialPointer(page);    minoff = P_FIRSTDATAKEY(opaque);    maxoff =
PageGetMaxOffsetNumber(page);
-    for (i = 0; i < so->numKilled; i++)
+    for (i = 0; i < numKilled; i++)    {        int            itemIndex = so->killedItems[i];        BTScanPosItem
*kitem= &so->currPos.items[itemIndex];
 
@@ -1799,14 +1844,7 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)        MarkBufferDirtyHint(so->currPos.buf,
true);   }
 
-    if (!haveLock)
-        LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-
-    /*
-     * Always reset the scan state, so we don't look for same items on other
-     * pages.
-     */
-    so->numKilled = 0;
+    LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index e2ace5c..2f85867 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -395,7 +395,8 @@ btree_xlog_vacuum(XLogReaderState *record)     * unpinned between the lastBlockVacuumed and the
currentblock, if there     * are any.  This prevents replay of the VACUUM from reaching the stage of     * removing
heaptuples while there could still be indexscans "in flight"
 
-     * to those particular tuples (see nbtree/README).
+     * to those particular tuples for those scans which could be confused by
+     * finding new tuples at the old TID locations (see nbtree/README).     *     * It might be worth checking if
thereare actually any backends running;     * if not, we could just skip this.
 
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a4ebbcc..45c5c83 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -18,8 +18,8 @@ backend to pin a page more than once concurrently; the buffer managerhandles this efficiently.  It is
consideredOK to hold a pin for longintervals --- for example, sequential scans hold a pin on the current pageuntil done
processingall the tuples on the page, which could be quite a
 
-while if the scan is the outer scan of a join.  Similarly, btree index
-scans hold a pin on the current index page.  This is OK because normal
+while if the scan is the outer scan of a join.  Similarly, a btree index
+scan may hold a pin on the current index page.  This is OK because normaloperations never wait for a page's pin count
todrop to zero.  (Anythingthat might need to do such a wait is instead handled by waiting to obtainthe relation-level
lock,which is why you'd better hold one first.)  Pins
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 990dab3..2ef349b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -518,6 +518,8 @@ typedef struct BTScanPosData{    Buffer        buf;            /* if valid, the buffer is pinned
*/
+    XLogRecPtr    lsn;            /* pos in the WAL stream when page was read */
+    BlockNumber currPage;        /* page we've referencd by items array */    BlockNumber nextPage;        /* page's
rightlink when we scanned it */    /*
 
@@ -551,7 +553,37 @@ typedef struct BTScanPosDatatypedef BTScanPosData *BTScanPos;
-#define BTScanPosIsValid(scanpos) BufferIsValid((scanpos).buf)
+#define BTScanPosIsPinned(scanpos) \
+( \
+    AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+                !BufferIsValid((scanpos).buf)), \
+    BufferIsValid((scanpos).buf) \
+)
+#define BTScanPosUnpin(scanpos) \
+    do { \
+        ReleaseBuffer((scanpos).buf); \
+        (scanpos).buf = InvalidBuffer; \
+    } while (0)
+#define BTScanPosUnpinIfPinned(scanpos) \
+    do { \
+        if (BTScanPosIsPinned(scanpos)) \
+            BTScanPosUnpin(scanpos); \
+    } while (0)
+
+#define BTScanPosIsValid(scanpos) \
+( \
+    AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+                !BufferIsValid((scanpos).buf)), \
+    BlockNumberIsValid((scanpos).currPage) \
+)
+#define BTScanPosInvalidate(scanpos) \
+    do { \
+        (scanpos).currPage = InvalidBlockNumber; \
+        (scanpos).nextPage = InvalidBlockNumber; \
+        (scanpos).buf = InvalidBuffer; \
+        (scanpos).lsn = InvalidXLogRecPtr; \
+        (scanpos).nextTupleOffset = 0; \
+    } while (0);/* We need one of these for each equality-type SK_SEARCHARRAY scan key */typedef struct
BTArrayKeyInfo
@@ -697,7 +729,7 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);extern IndexTuple _bt_checkkeys(IndexScanDesc
scan,             Page page, OffsetNumber offnum,              ScanDirection dir, bool *continuescan);
 
-extern void _bt_killitems(IndexScanDesc scan, bool haveLock);
+extern void _bt_killitems(IndexScanDesc scan);extern BTCycleId _bt_vacuum_cycleid(Relation rel);extern BTCycleId
_bt_start_vacuum(Relationrel);extern void _bt_end_vacuum(Relation rel);

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

16 March 2015, 12:51:36

Simon Riggs <simon@2ndQuadrant.com> wrote:
> On 13 March 2015 at 15:41, Kevin Grittner <kgrittn@ymail.com> wrote:
>
>> The feedback was generally fairly positive except for the fact that
>> snapshot "age" (for purposes of being too old) was measured in
>> transaction IDs assigned.  There seemed to be a pretty universal
>> feeling that this needed to be changed to a time-based setting.
>
> -1 for a time based setting.
>
> After years of consideration, bloat is now controllable by altering
> the size of the undo tablespace.
>
> I think PostgreSQL needs something size-based also. It would need some
> estimation to get it to work like that, true, but it is actually the
> size of the bloat we care about, not the time. So we should be
> thinking in terms of limits that we actually care about.

Are you thinking, then, that WAL volume generated (as determined by
LSN) would be the appropriate unit of measure for this?  (We would
still need to map that back to transaction IDs for vacuuming, of
course.)  If we did that we could allow the "size" units of
measure, like '5GB' and similar.  Or are you thinking of something
else?

Given that there seems to be disagreement on what is the more
useful metric, do we want to consider allowing more than one?  If
so, would it be when *all* conditions are met or when *any*
conditions are met?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Robert Haas

Date:

16 March 2015, 13:02:17

On Mon, Mar 16, 2015 at 8:48 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Given that there seems to be disagreement on what is the more
> useful metric, do we want to consider allowing more than one?  If
> so, would it be when *all* conditions are met or when *any*
> conditions are met?

Oh, gosh.  Maybe we could invent a whole policy language for this.
Or, if we want to do something less painful, we could fire nail-guns
into our eyeballs.

I'm not sure what the value of basing this on WAL volume is, and I
think we should talk about that a bit more first.  The value of doing
this by rate of XID consumption is easy to understand: it's simple to
implement.  The value of doing this by time is also simple to
understand: if you set the setting to 20 minutes, then you can be sure
that queries will not get cancelled unless they run for more than 20
minutes.  This clearly makes the system easier to reason about.  I
don't see what the benefit of doing this by WAL volume would be.

What Simon actually proposed was by *bloat*; that is, if the overall
amount of bloat in the system exceeds some metric, start whacking old
queries in order to control it.  The appeal of that is obvious, but I
think it would be very hard to make work, because canceling queries
won't get rid of the bloat you've already introduced, and so you'd
tend to cancel nothing until you hit some threshold, and then start
canceling everything.

If we want this feature, let's try to keep the spec simple enough that
we actually get it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

16 March 2015, 13:27:56

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> Thank you for rewriting.

Thank *you* for pointing out where more work was needed in the
comments and README!

> I attached the your latest patch to this mail as
> bt-nopin-v4.patch for now. Please check that there's no problem
> in it.

I checked out master, applied the patch and checked it against my
latest code (merged with master), and it matched.  So, looks good.

> - By this patch, index scan becomes to release buffer pins while
>   fetching index tuples in a page, so it should reduce the chance
>   of index scans with long duration to block vacuum, because
>   vacuums now can easily overtake the current position of an
>   index scan. I didn't actually measured how effective it is,
>   though.

That part got pretty thorough testing on end-user software.  The
whole reason for writing the patch was that after applying the
snapshot-too-old PoC patch they still saw massive bloat because all
autovacuum workers blocked behind cursors which were left idle.
The initial version of this patch fixed that, preventing (in
combination with the other patch) uncontrolled bloat in a 48 hour
test.

> - It makes no performance deterioration, on the contrary it
>   accelerates index scans. It seems to be because of removal of
>   lock and unlock surrounding _bt_steppage in bt_next.

I fear that the performance improvement may only show up on forward
scans -- because the whole L&Y algorithm depends on splits only
moving right, walking right is almost trivial to perform correctly
with minimal locking and pinning.  Our left pointers help with
scanning backward, but there are a lot of conditions that can
complicate that, leaving us with the choice between keeping some of
the locking or potentially scanning more pages than we now do on a
backward scan.  Since it wasn't clear how many cases would benefit
from a change and how many would lose, I pretty much left the
backward scan locking alone in this patch.  Changing the code to
work the other way would not be outrageously difficult; but
benchmarking to determine whether the alternative was a net win
would be pretty tricky and very time-consuming.  If that is to be
done, I strongly feel it should be a separate patch.

Because this patch makes a forward scan faster without having much
affect on a backward scan, the performance difference between them
(which has always existed) will get a little wider.  I wonder
whether this difference should perhaps be reflected in plan
costing.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

16 March 2015, 13:33:09

On 16 March 2015 at 12:48, Kevin Grittner <kgrittn@ymail.com> wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>> On 13 March 2015 at 15:41, Kevin Grittner <kgrittn@ymail.com> wrote:
>>
>>> The feedback was generally fairly positive except for the fact that
>>> snapshot "age" (for purposes of being too old) was measured in
>>> transaction IDs assigned.  There seemed to be a pretty universal
>>> feeling that this needed to be changed to a time-based setting.
>>
>> -1 for a time based setting.
>>
>> After years of consideration, bloat is now controllable by altering
>> the size of the undo tablespace.
>>
>> I think PostgreSQL needs something size-based also. It would need some
>> estimation to get it to work like that, true, but it is actually the
>> size of the bloat we care about, not the time. So we should be
>> thinking in terms of limits that we actually care about.
>
> Are you thinking, then, that WAL volume generated (as determined by
> LSN) would be the appropriate unit of measure for this?  (We would
> still need to map that back to transaction IDs for vacuuming, of
> course.)  If we did that we could allow the "size" units of
> measure, like '5GB' and similar.  Or are you thinking of something
> else?

It's probably the closest and easiest measure, and the most
meaningful. We can easily accumulate that in a data structure in clog,
like async commit LSN. For next release though, since it will take a
little bit of thought to interpret that.

With commit timestamp enabled in 9.5, we can easily judge time limit,
but it is less useful because its not a measure of bloat.

As I've said, I'd be happy with just an xid limit for 9.5, if that was
the only thing we had. But I think timestamp is just as easy.

> Given that there seems to be disagreement on what is the more
> useful metric, do we want to consider allowing more than one?  If
> so, would it be when *all* conditions are met or when *any*
> conditions are met?

Yours was the first reply to my idea, so I think its too early to
describe that as disagreement.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

16 March 2015, 16:07:34

On 16 March 2015 at 13:02, Robert Haas <robertmhaas@gmail.com> wrote:

> I'm not sure what the value of basing this on WAL volume is, and I
> think we should talk about that a bit more first.  The value of doing
> this by rate of XID consumption is easy to understand: it's simple to
> implement.  The value of doing this by time is also simple to
> understand: if you set the setting to 20 minutes, then you can be sure
> that queries will not get cancelled unless they run for more than 20
> minutes.  This clearly makes the system easier to reason about.  I
> don't see what the benefit of doing this by WAL volume would be.

I think we can assess that more clearly as the amount of now-dead
space left by an action. So we can include Updates and Deletes, or in
the case of aborted transactions, the total WAL written. I assumed
that was what Kevin meant.

> What Simon actually proposed was by *bloat*; that is, if the overall
> amount of bloat in the system exceeds some metric, start whacking old
> queries in order to control it.  The appeal of that is obvious,

Agreed

> but I
> think it would be very hard to make work, because canceling queries
> won't get rid of the bloat you've already introduced, and so you'd
> tend to cancel nothing until you hit some threshold, and then start
> canceling everything.

That would just be an unhelpful way of using that info. There are many
possible designs that wouldn't need to work that way.

We have many cases where we begin a cleanup action before we run out
of space for other server resources. VACUUM is itself a good example,
but there are many others.

The main point is that if we blindly decide things based upon xid age
or time then it will be wrong in many cases, since apps with uniform
write rates are rare. We need to work out how to limit bloat itself.
If we can't do that easily at a macro level, then we'll have to do
that more precisely at the table level, for example having VACUUM
decide where to put the limit if we find too much unremovable/recently
dead data.

> If we want this feature, let's try to keep the spec simple enough that
> we actually get it.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Robert Haas

Date:

16 March 2015, 16:47:13

On Mon, Mar 16, 2015 at 12:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What Simon actually proposed was by *bloat*; that is, if the overall
>> amount of bloat in the system exceeds some metric, start whacking old
>> queries in order to control it.  The appeal of that is obvious,
>
> Agreed
>
>> but I
>> think it would be very hard to make work, because canceling queries
>> won't get rid of the bloat you've already introduced, and so you'd
>> tend to cancel nothing until you hit some threshold, and then start
>> canceling everything.
>
> That would just be an unhelpful way of using that info. There are many
> possible designs that wouldn't need to work that way.
>
> We have many cases where we begin a cleanup action before we run out
> of space for other server resources. VACUUM is itself a good example,
> but there are many others.
>
> The main point is that if we blindly decide things based upon xid age
> or time then it will be wrong in many cases, since apps with uniform
> write rates are rare. We need to work out how to limit bloat itself.
> If we can't do that easily at a macro level, then we'll have to do
> that more precisely at the table level, for example having VACUUM
> decide where to put the limit if we find too much unremovable/recently
> dead data.

I am sure there are more sophisticated things to be done here, but I
guess my feeling is that time is a good way to go here for a first cut
- lots of people have suggested it, and there's clearly a use case for
it.  If the setting turns out to be popular, we can fine-tune it more
once we've got some experience with it.  But I'm nervous about trying
to to design something more complicated than that right now,
especially so late in the release cycle.  We know that this setting,
with time-based units, will meet the needs of the customer for whom it
was developed, and that is a good-enough reason to think that time is
a reasonable metric for this, even if we eventually have others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce pinning in btree indexes

From

Jim Nasby

Date:

18 March 2015, 03:00:59

On 3/16/15 11:47 AM, Robert Haas wrote:
> I am sure there are more sophisticated things to be done here, but I
> guess my feeling is that time is a good way to go here for a first cut
> - lots of people have suggested it, and there's clearly a use case for
> it.  If the setting turns out to be popular, we can fine-tune it more
> once we've got some experience with it.  But I'm nervous about trying
> to to design something more complicated than that right now,
> especially so late in the release cycle.  We know that this setting,
> with time-based units, will meet the needs of the customer for whom it
> was developed, and that is a good-enough reason to think that time is
> a reasonable metric for this, even if we eventually have others.

+1. As you said, time is easy for people to understand, and I think it 
will handle a large chunk of the use cases.

I used to have this quote (or something close to it) on my whiteboard... 
I think it's appropriate here ;)

"The perfect is the enemy of the good. -Simon Riggs"
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Reduce pinning in btree indexes

From

Simon Riggs

Date:

18 March 2015, 06:00:12

On 18 March 2015 at 03:01, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 3/16/15 11:47 AM, Robert Haas wrote:
>>
>> I am sure there are more sophisticated things to be done here, but I
>> guess my feeling is that time is a good way to go here for a first cut
>> - lots of people have suggested it, and there's clearly a use case for
>> it.  If the setting turns out to be popular, we can fine-tune it more
>> once we've got some experience with it.  But I'm nervous about trying
>> to to design something more complicated than that right now,
>> especially so late in the release cycle.  We know that this setting,
>> with time-based units, will meet the needs of the customer for whom it
>> was developed, and that is a good-enough reason to think that time is
>> a reasonable metric for this, even if we eventually have others.
>
>
> +1. As you said, time is easy for people to understand, and I think it will
> handle a large chunk of the use cases.

If it wasn't clear, I had already said that my idea was for next
release, before Robert said that.


> I used to have this quote (or something close to it) on my whiteboard... I
> think it's appropriate here ;)
>
> "The perfect is the enemy of the good. -Simon Riggs"

Interesting to hear I was anybody's pinup ;-)

Not sure that is a saying of mine though, sounds more like Robert to me.

I do believe in something now, something better later. I'd call that
incremental benefit, as opposed to incremental development where the
benefit may not be realisable until the last point. Which is why I am
backing this patch for 9.5, even though the UI is less useful than I
think is eventually necessary.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Reduce pinning in btree indexes

From

Greg Stark

Date:

19 March 2015, 01:50:15

On Sat, Mar 14, 2015 at 4:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

-1 for a time based setting.

After years of consideration, bloat is now controllable by altering
the size of the undo tablespace.

Hm. Well, fwiw the situation is rather more complicated than that. You're correct that you can create a fixed size undo tablespace in which case Oracle will tell you approximately how much retention period you can rely on given that size. But alternately you can set the undo tablespace to be autoextend and Oracle will tune it to ensure there's enough retention to cover the longest-running query in the system. You can also set a time-based parameter (undo_retention) to ensure there's always at least enough data retained to cover a fixed amount of time for flashback (Oracle's equivalent to our time travel feature in ancient versions of Postgres).

greg

Re: Reduce pinning in btree indexes

From

Kevin Grittner

Date:

25 March 2015, 20:03:02

Kevin Grittner <kgrittn@ymail.com> wrote:
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

>> I attached the your latest patch to this mail as
>> bt-nopin-v4.patch for now. Please check that there's no problem
>> in it.
>
> I checked out master, applied the patch and checked it against my
> latest code (merged with master), and it matched.  So, looks
> good.

Pushed with the addition of one more paragraph of comments,
regarding something to watch out for in any follow-on patch to
eliminate the pin for a scan using a non-MVCC snapshot.

Thanks again to Kyotaro-san and Heikki for the reviews!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company