Thread: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was:[HACKERS] [WIP] Zipfian distribution in pgbench)

On Fri, Jul 14, 2017 at 5:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that what this probably comes down to, more than anything
> else, is that you have leftmost hot/bloated leaf pages like this:
>
>
>           idx          | level | l_item | blkno | btpo_prev |
> btpo_next | btpo_flags | type | live_items | dead_items |
> avg_item_size | page_size | free_size |         highkey
>
-----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------
>  ...
>  pgbench_accounts_pkey |     0 |      1 |     1 |         0 |
> 2751 |         65 | l    |        100 |         41 |            16 |
>    8192 |      5328 | 11 00 00 00 00 00 00 00
>  pgbench_accounts_pkey |     0 |      2 |  2751 |         1 |
> 2746 |         65 | l    |         48 |         90 |            16 |
>    8192 |      5388 | 32 00 00 00 00 00 00 00
> ...
>
> The high key for the leftmost shows that only values below 0x11 belong
> on the first page. This is about 16 or 17 possible distinct values,
> and yet the page has 100 live items, and 41 dead items; in total,
> there is room for 367 items. It's only slightly better with other
> nearby pages. This is especially bad because once the keyspace gets
> split up this finely, it's *impossible* to reverse it -- it's more or
> less a permanent problem, at least until a REINDEX.

I've been thinking about this a lot, because this really does look
like a pathological case to me. I think that this workload is very
sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
least, I can imagine that mechanism doing a lot better than it
actually manages to do here. I wonder if it's possible that commit
2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
pages, should have considered workloads like this.

The whole point of that commit was to reduce blocking by VACUUM on
index scans, and I think that it was effective in doing that in many
important cases. My concern is that the commit added this, which could
make LP_DEAD hinting occur less frequently:

+ * If the pin was released after reading the page, then we re-read it.  If it
+ * has been modified since we read it (as determined by the LSN), we dare not
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple. */void
-_bt_killitems(IndexAnd, ScanDesc scan, bool haveLock)
+_bt_killitems(IndexScanDesc scan)

Even if I'm basically right about this making LP_DEAD hinting occur
less frequently in cases like the one from Alik, it might have
actually been worth it even from the point of view of index bloat,
because VACUUM operations complete more frequently, and so there is an
indirect boost. It's not as if we've been that great at assessing
problems like this before now, and so it bears considering if we could
do better here without introducing much additional complexity. I just
don't know.

I would like to hear opinions on:

How much of a difference might this have actually made?

If it's a significant factor for any important workload, what can be
done about it? Does anyone have any ideas about a less restrictive
strategy for avoiding spuriously killing an index tuple whose heap TID
was just recycled?

ISTM that by doing this LSN check, _bt_killitems() is least likely to
work when and where it is most needed.

The most obvious way to correctly proceed even when LSN has changed
would be to revisit the heap page when the leaf page LSN check doesn't
work out, and revalidate the safety of proceeding with killing tuples.
At least with unique indexes. It's probably extremely unlikely that
the problem that we must avoid here will actually be observed in the
real world, so it seems likely that an approach like this will be
worth it.

-- 
Peter Geoghegan



On Tue, Jul 25, 2017 at 3:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I've been thinking about this a lot, because this really does look
> like a pathological case to me. I think that this workload is very
> sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
> least, I can imagine that mechanism doing a lot better than it
> actually manages to do here. I wonder if it's possible that commit
> 2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
> pages, should have considered workloads like this.

While the benchmark Alik came up with is non-trivial to reproduce, I
can show a consistent regression for a simple case with only one
active backend. I'm not sure whether or not this is considered an
acceptable trade-off -- I didn't look through the archives from around
March of 2015 just yet.

Setup:

Initialize pgbench (any scale factor).
create index on pgbench_accounts (aid);

The point of this example is to show a simple query that is never
quite HOT-safe, where we need to create a new index entry in an index
for that reason alone. Theoretically, only one index needs to be
updated, not all indexes, because only one index covers attributes
that have truly changed. For example, if we had WARM, I think that
many of these theoretically unnecessary index tuple insertions would
not happen. When they do happen, because we don't have something like
WARM, it seems important that the kill_prior_tuples/LP_DEAD stuff does
its job. I don't think that it will consistently do that, though.

Steps:

(Set debugger breakpoint from within _bt_killitems())

Test queries (run these from a single interactive psql session):

update pgbench_accounts set abalance = abalance + 1 where aid = 763;
update pgbench_accounts set abalance = abalance + 1 where aid = 763;
update pgbench_accounts set abalance = abalance + 1 where aid = 763;

We now see that no update ever kills items within _bt_killitems(),
because our own update to the index leaf page itself nullifies our
ability to kill anything, by changing the page LSN from the one
stashed in the index scan state variable. Fortunately, we are not
really "self-blocking" dead item cleanup here, because the
_bt_check_unique() logic for killing all_dead index entries saves the
day by not caring about LSNs. However, that only happens because the
index on "aid" is a unique index.

If we drop the primary key, and create a regular index on "aid" in its
place, then the same UPDATE queries really do self-block from killing
index tuples due to changes in 2ed5b87f9, which could be pretty bad if
there wasn't some selects to do the kill_prior_tuple stuff instead. I
verified that this regression against 9.4 exists, just to be sure that
the problem wasn't somehow there all along -- the regression is real.:-(

-- 
Peter Geoghegan



On Tue, Jul 25, 2017 at 8:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Setup:
>
> Initialize pgbench (any scale factor).
> create index on pgbench_accounts (aid);

That "create index" was meant to be on "abalance", to make the UPDATE
queries always HOT-unsafe. (You'll want to *also* create this index
once the PK is dropped, to show how killing items on recent Postgres
versions hinges upon this being a unique index).

-- 
Peter Geoghegan



On Tue, Jul 25, 2017 at 11:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> While the benchmark Alik came up with is non-trivial to reproduce, I
> can show a consistent regression for a simple case with only one
> active backend.

That's not good.

> We now see that no update ever kills items within _bt_killitems(),
> because our own update to the index leaf page itself nullifies our
> ability to kill anything, by changing the page LSN from the one
> stashed in the index scan state variable. Fortunately, we are not
> really "self-blocking" dead item cleanup here, because the
> _bt_check_unique() logic for killing all_dead index entries saves the
> day by not caring about LSNs. However, that only happens because the
> index on "aid" is a unique index.

This seems like an oversight.  If we modify the page ourselves, could
we check whether the saved LSN is still current just before, and if
so, update the saved LSN just after?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> wrote:
>> We now see that no update ever kills items within _bt_killitems(),
>> because our own update to the index leaf page itself nullifies our
>> ability to kill anything, by changing the page LSN from the one
>> stashed in the index scan state variable. Fortunately, we are not
>> really "self-blocking" dead item cleanup here, because the
>> _bt_check_unique() logic for killing all_dead index entries saves the
>> day by not caring about LSNs. However, that only happens because the
>> index on "aid" is a unique index.
>
>This seems like an oversight.  If we modify the page ourselves, could
>we check whether the saved LSN is still current just before, and if
>so, update the saved LSN just after?

I really don't know if that would be worthwhile. It would certainly fix
the regression shown in my test case, but that might not go far enough.
I strongly suspect that there are more complicated workloads where
LP_DEAD cleanup from SELECT statements matters, which is prevented by
the LSN check thing, just because there are always other sessions that
modify the page concurrently. This might be true of Alik's Zipfian test
case, for example.

In Alik's workload, there are two queries: One UPDATE, one SELECT.  Even
though the bloated index was a unique index, and so still gets
_bt_check_unique() item killing, the regression is still going to block
LP_DEAD cleanup by the SELECTs, which seems like it might be quite
important there.  After all, _bt_check_unique() cleanup has to happen
with an exclusive buffer lock held, whereas the kill_prior_tuple stuff
happens with only a shared buffer lock held.  It's not hard to imaging
that there will be a whole lot less LP_DEAD setting, overall.

Alik's workload was one with a high degree of contention on just a few
leaf pages, pages that become very bloated. It's not fair to blame the
bloat we saw there on this regression, but I have to wonder how much it
may have contributed.

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> wrote:
>In Alik's workload, there are two queries: One UPDATE, one SELECT.  Even
>though the bloated index was a unique index, and so still gets
>_bt_check_unique() item killing, the regression is still going to block
>LP_DEAD cleanup by the SELECTs, which seems like it might be quite
>important there.  After all, _bt_check_unique() cleanup has to happen
>with an exclusive buffer lock held, whereas the kill_prior_tuple stuff
>happens with only a shared buffer lock held.  It's not hard to imaging
>that there will be a whole lot less LP_DEAD setting, overall.

Actually, there is a much bigger reason why SELECT statement LP_DEAD
killing matters more to Alik's workload than _bt_check_unique() LP_DEAD
killing: The UPDATE query itself does not affect indexed columns. Most
UPDATEs are actually HOT-safe (despite the degree of index bloat we
see), and so most UPDATEs will never reach _bt_check_unique().

-- 
Peter Geoghegan



On Wed, Jul 26, 2017 at 3:32 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jul 14, 2017 at 5:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I think that what this probably comes down to, more than anything
>> else, is that you have leftmost hot/bloated leaf pages like this:
>>
>>
>>           idx          | level | l_item | blkno | btpo_prev |
>> btpo_next | btpo_flags | type | live_items | dead_items |
>> avg_item_size | page_size | free_size |         highkey
>>
-----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------
>>  ...
>>  pgbench_accounts_pkey |     0 |      1 |     1 |         0 |
>> 2751 |         65 | l    |        100 |         41 |            16 |
>>    8192 |      5328 | 11 00 00 00 00 00 00 00
>>  pgbench_accounts_pkey |     0 |      2 |  2751 |         1 |
>> 2746 |         65 | l    |         48 |         90 |            16 |
>>    8192 |      5388 | 32 00 00 00 00 00 00 00
>> ...
>>
>> The high key for the leftmost shows that only values below 0x11 belong
>> on the first page. This is about 16 or 17 possible distinct values,
>> and yet the page has 100 live items, and 41 dead items; in total,
>> there is room for 367 items. It's only slightly better with other
>> nearby pages. This is especially bad because once the keyspace gets
>> split up this finely, it's *impossible* to reverse it -- it's more or
>> less a permanent problem, at least until a REINDEX.
>
> I've been thinking about this a lot, because this really does look
> like a pathological case to me. I think that this workload is very
> sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
> least, I can imagine that mechanism doing a lot better than it
> actually manages to do here. I wonder if it's possible that commit
> 2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
> pages, should have considered workloads like this.
>

Isn't it possible to confirm if the problem is due to commit
2ed5b87f9?  Basically, if we have unlogged tables, then it won't
release the pin.  So if the commit in question is the culprit, then
the same workload should not lead to bloat.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Amit Kapila <amit.kapila16@gmail.com> wrote:
>Isn't it possible to confirm if the problem is due to commit
>2ed5b87f9?  Basically, if we have unlogged tables, then it won't
>release the pin.  So if the commit in question is the culprit, then
>the same workload should not lead to bloat.

That's a great idea. Alik?

-- 
Peter Geoghegan



On Thu, Jul 27, 2017 at 10:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I really don't know if that would be worthwhile. It would certainly fix
> the regression shown in my test case, but that might not go far enough.
> I strongly suspect that there are more complicated workloads where
> LP_DEAD cleanup from SELECT statements matters, which is prevented by
> the LSN check thing, just because there are always other sessions that
> modify the page concurrently. This might be true of Alik's Zipfian test
> case, for example.

I haven't studied the test case, but I think as a general principle it
makes sense to be happy when someone comes up with an algorithm that
holds a lock for a shorter period of time (and buffer pins are a type
of lock).  There are a number of places (fast-path locking, for
example, or vacuum skipping pinned heap pages) where we have
fast-paths that get taken most of the time and slow paths that get
used when concurrent activity happens; empirically, such things often
work out to a win.  I think it's disturbing that this code seems to be
taking the slow-path (which, in context, means skipping LP_DEAD
cleanup) even there is no concurrent activity.  That's hard to
justify.  But the fact that it is taking the slow-path when there *is*
concurrent activity is harder to complain about.  That might win or it
might lose; the non-concurrent case only loses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> wrote:
>On Thu, Jul 27, 2017 at 10:05 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I really don't know if that would be worthwhile. It would certainly fix
>> the regression shown in my test case, but that might not go far enough.
>> I strongly suspect that there are more complicated workloads where
>> LP_DEAD cleanup from SELECT statements matters, which is prevented by
>> the LSN check thing, just because there are always other sessions that
>> modify the page concurrently. This might be true of Alik's Zipfian test
>> case, for example.
>
>I haven't studied the test case, but I think as a general principle it
>makes sense to be happy when someone comes up with an algorithm that
>holds a lock for a shorter period of time (and buffer pins are a type
>of lock).  There are a number of places (fast-path locking, for
>example, or vacuum skipping pinned heap pages) where we have
>fast-paths that get taken most of the time and slow paths that get
>used when concurrent activity happens; empirically, such things often
>work out to a win.  I think it's disturbing that this code seems to be
>taking the slow-path (which, in context, means skipping LP_DEAD
>cleanup) even there is no concurrent activity.  That's hard to
>justify.

That is hard to justify. I don't think that failing to set LP_DEAD hints
is the cost that must be paid to realize a benefit elsewhere, though. I
don't see much problem with having both benefits consistently. It's
actually very unlikely that VACUUM will run, and a TID will be recycled
at exactly the wrong time. We could probably come up with a more
discriminating way of detecting that that may have happened, at least
for Postgres 11. We'd continue to use LSN; the slow path would be taken
when the LSN changed, but we do not give up on setting LP_DEAD bits. I
think we can justify going to the heap again in this slow path, if
that's what it takes.

>But the fact that it is taking the slow-path when there *is*
>concurrent activity is harder to complain about.  That might win or it
>might lose; the non-concurrent case only loses.

Let's wait to see what difference it makes if Alik's zipfian
distribution pgbench test case uses unlogged tables. That may gives us a
good sense of the problem for cases with contention/concurrency.

-- 
Peter Geoghegan



On Mon, Jul 31, 2017 at 1:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> That is hard to justify. I don't think that failing to set LP_DEAD hints
> is the cost that must be paid to realize a benefit elsewhere, though. I
> don't see much problem with having both benefits consistently. It's
> actually very unlikely that VACUUM will run, and a TID will be recycled
> at exactly the wrong time. We could probably come up with a more
> discriminating way of detecting that that may have happened, at least
> for Postgres 11. We'd continue to use LSN; the slow path would be taken
> when the LSN changed, but we do not give up on setting LP_DEAD bits. I
> think we can justify going to the heap again in this slow path, if
> that's what it takes.

Yeah, that might possibly be a good approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> wrote:
>On Mon, Jul 31, 2017 at 1:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> That is hard to justify. I don't think that failing to set LP_DEAD hints
>> is the cost that must be paid to realize a benefit elsewhere, though. I
>> don't see much problem with having both benefits consistently. It's
>> actually very unlikely that VACUUM will run, and a TID will be recycled
>> at exactly the wrong time. We could probably come up with a more
>> discriminating way of detecting that that may have happened, at least
>> for Postgres 11. We'd continue to use LSN; the slow path would be taken
>> when the LSN changed, but we do not give up on setting LP_DEAD bits. I
>> think we can justify going to the heap again in this slow path, if
>> that's what it takes.
>
>Yeah, that might possibly be a good approach.

I also wonder if it's worth teaching lazy_scan_heap() to keep around a
list of TIDs that can at least have their LP_DEAD bit set within their
index page, for use during subsequent index vacuuming. Doing at least
this much for TIDs from heap pages that are skipped due to some other
session concurrently holding a pin on the heap page ("pinskipped_pages"
pages) could help some cases, and seems doable. VACUUM does not need an
extra interlock against another VACUUM (such as a buffer pin) here, of
course.

I wouldn't expect this to help very much on many workloads, including
Alik's Zipfian workload, but it might be useful to have a real guarantee
about how long it can be, in VACUUM cycles, before a dead index tuple at
least has its LP_DEAD bit set. LP_DEAD will only be set when an index
scan goes to the heap, and it's not hard to imagine workloads where no
index scan ever does that with dead tuples whose heap TIDs were killed
only very recently.

Unlike with heap pruning, setting the LP_DEAD bit of all dead index
tuples on a leaf page is pretty much as good as a full VACUUM of the
page. The only thing that makes it not quite as good is that you cannot
assume that it's safe to kill the heap TIDs afterwards.

-- 
Peter Geoghegan



On Mon, Jul 31, 2017 at 10:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Let's wait to see what difference it makes if Alik's zipfian
> distribution pgbench test case uses unlogged tables. That may gives us a
> good sense of the problem for cases with contention/concurrency.

Yura Sokolov of Postgres Pro performed this benchmark at my request.
He took the 9.5 commit immediately proceeding 2ed5b87f9 as a baseline.
In all cases, logged tables were used. Note that this is not the most
effective benchmark for showing the regression, because he didn't
replace the PK with a non-unique index, though that is planned as
follow-up; I wanted to stick with Alik's Zipfian test case for the
time being.

(Using a unique index is not the most effective thing for showing the
regression because unique indexes have most LP_DEAD setting done in
_bt_check_unique(), so only SELECTs will do less LP_DEAD cleanup
there; SELECTs are 50% of all queries.)

His results with 10 minute pgbench runs:

Logged
clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock
--------+----------+----------+--------+----------+----------------
     10 |   201569 |   204154 | 201095 |   201793 |          206111
     20 |   328180 |   333358 | 334858 |   336721 |          370769
     40 |   196352 |   194692 | 232413 |   231968 |          393947
     70 |   121083 |   116881 | 148066 |   148443 |          224162
    110 |    77989 |    73414 |  99305 |   111404 |          161900
    160 |    48958 |    45830 |  65830 |    82340 |          115678
    230 |    27699 |    25510 |  38186 |    57617 |           80575
    310 |    16369 |    15137 |  21571 |    39745 |           56819
    400 |    10327 |     9486 |  13157 |    27318 |           40859
    500 |     6920 |     6496 |   8638 |    18677 |           29429
    650 |     4315 |     3971 |   5196 |    11762 |           17883
    800 |     2850 |     2772 |   3523 |     7733 |           10977

Note that you also see numbers from various patches from Yura, and the
master branch mixed in here, but 8217fb14 (before) and 2ed5b87f
(after) are the interesting numbers as far as this regression goes.

There is an appreciable reduction in TPS here, though this workload is
not as bad by that measure as first thought. There is a roughly 5%
regression here past 40 clients or so. The regression in the
*consistency* of transaction *throughput* is far more interesting,
though. I've been doing analysis of this by drilling down to
individual test cases with vimdiff, as follows:

$ vimdiff test_8217fb14_logged_1_pgbench_40.out
test_2ed5b87f_logged_1_pgbench_40.out

(I attach these two files as a single example. I can provide the full
results to those that ask for them privately; it's too much data to
attach to an e-mail to the list.)

You can see in this example that for most 5 second slices of the 10
minute benchmark, commit 2ed5b87f actually increases TPS somewhat,
which is good. But there are also occasional *big* drops in TPS,
sometimes by as much as 50% over a single 5 second period (when
ANALYZE runs, adding random I/O during holding an exclusive buffer
lock [1]?). When this slowdown happens, latency can be over 3 times
higher, too.

We see much more consistent performance without the B-Tree buffer pin
VACUUM optimization, where there is no discernible pattern of
performance dropping. The headline regression of 4% or 5% is not the
main problem here, it seems.

In summary, commit 2ed5b87f makes the workload have increased
throughput most of the time, but occasionally sharply reduces
throughput, which averages out to TPS being 4% or 5% lower overall. I
think we'll find that there are bigger problems TPS-wise with
non-unique indexes when that other test is performed by Yura; let's
wait for those results to come back.

Finally, I find it interesting that when Yura did the same benchmark,
but with 5% SELECTs + 95% UPDATEs, rather than 50% SELECTs + 50%
UPDATEs as above, the overall impact was surprisingly similar. His
results:

clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock
--------+----------+----------+--------+----------+----------------
     20 |   187697 |   187335 | 217558 |   215059 |          266894
     50 |    81272 |    78784 |  97948 |    97659 |          157710
     85 |    49446 |    47683 |  64597 |    70814 |          107748
    130 |    32358 |    30393 |  42216 |    50531 |           75001
    190 |    19403 |    17569 |  25704 |    35506 |           51292
    270 |    10803 |     9878 |  14166 |    23851 |           35257
    370 |     6108 |     5645 |   7684 |    15390 |           23659
    500 |     3649 |     3297 |   4225 |     9172 |           14643
    650 |     2239 |     2049 |   2635 |     5887 |            8588
    800 |     1475 |     1424 |   1804 |     4035 |            5611

If nothing else, this shows how generally reliant these kinds of
workload can be on LP_DEAD setting. And, there is one difference: The
regression is seen here at *all* client counts, even with only 20
clients, This is presumably because with only 5% SELECTs it's more
important that those few remaining SELECTs be able to perform LP_DEAD
setting.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, Aug 4, 2017 at 3:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Yura Sokolov of Postgres Pro performed this benchmark at my request.
> He took the 9.5 commit immediately proceeding 2ed5b87f9 as a baseline.

I attach a simple patch that comments out the release of the buffer
pin for logged tables where an MVCC snapshot is used for a scan that
is not an index-only scan. This is the simplest way of evaluating the
effects of disabling 2ed5b87f9. Yura or others may find this helpful.

To be clear, I am certainly not suggesting that we should revert
2ed5b87f9. I do think that we need to give serious thought to fixing
the regression that 2ed5b87f9 introduced for LP_DEAD setting by index
scans, though.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment