Re: Confine vacuum skip logic to lazy_scan_skip - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Confine vacuum skip logic to lazy_scan_skip
Date
Msg-id 20240306234733.nd4a636colxkgq2e@liskov
Whole thread Raw
In response to Re: Confine vacuum skip logic to lazy_scan_skip  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Confine vacuum skip logic to lazy_scan_skip
List pgsql-hackers
On Tue, Feb 27, 2024 at 02:47:03PM -0500, Melanie Plageman wrote:
> On Mon, Jan 29, 2024 at 8:18 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Fri, Jan 26, 2024 at 8:28 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > CFBot shows that the patch does not apply anymore as in [1]:
> > > === applying patch
> > > ./v3-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch
> > > patching file src/backend/access/heap/vacuumlazy.c
> > > ...
> > > Hunk #10 FAILED at 1042.
> > > Hunk #11 FAILED at 1121.
> > > Hunk #12 FAILED at 1132.
> > > Hunk #13 FAILED at 1161.
> > > Hunk #14 FAILED at 1172.
> > > Hunk #15 FAILED at 1194.
> > > ...
> > > 6 out of 21 hunks FAILED -- saving rejects to file
> > > src/backend/access/heap/vacuumlazy.c.rej
> > >
> > > Please post an updated version for the same.
> > >
> > > [1] - http://cfbot.cputube.org/patch_46_4755.log
> >
> > Fixed in attached rebased v4
> 
> In light of Thomas' update to the streaming read API [1], I have
> rebased and updated this patch set.
> 
> The attached v5 has some simplifications when compared to v4 but takes
> largely the same approach.

Attached is a patch set (v5a) which updates the streaming read user for
vacuum to fix an issue Andrey Borodin pointed out to me off-list.

Note that I started writing this email before seeing Heikki's upthread
review [1], so I will respond to that in a bit. There are no changes in
v5a to any of the prelim refactoring patches which Heikki reviewed in
that email. I only changed the vacuum streaming read users (last two
patches in the set).

Back to this patch set:
Andrey pointed out that it was failing to compile on windows and the
reason is that I had accidentally left an undefined variable "index" in
these places

    Assert(index > 0);
...
    ereport(DEBUG2,
            (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
                    vacrel->relname, (long long) index, vacuumed_pages)));

See https://cirrus-ci.com/task/6312305361682432

I don't understand how this didn't warn me (or fail to compile) for an
assert build on my own workstation. It seems to think "index" is a
function?

Anyway, thinking about what the correct assertion would be here:

    Assert(index > 0);
    Assert(vacrel->num_index_scans > 1 ||
           (rbstate->end_idx == vacrel->lpdead_items &&
            vacuumed_pages == vacrel->lpdead_item_pages));

I think I can just replace "index" with "rbstate->end_index". At the end
of reaping, this should have the same value that index would have had.
The issue with this is if pg_streaming_read_buffer_get_next() somehow
never returned a valid buffer (there were no dead items), then rbstate
would potentially be uninitialized. The old assertion (index > 0) would
only have been true if there were some dead items, but there isn't an
explicit assertion in this function that there were some dead items.
Perhaps it is worth adding this? Even if we add this, perhaps it is
unacceptable from a programming standpoint to use rbstate in that scope?

In addition to fixing this slip-up, I have done some performance testing
for streaming read vacuum. Note that these tests are for both vacuum
passes (1 and 2) using streaming read.

Performance results:

The TL;DR of my performance results is that streaming read vacuum is
faster. However there is an issue with the interaction of the streaming
read code and the vacuum buffer access strategy which must be addressed.

Note that "master" in the results below is actually just a commit on my
branch [2] before the one adding the vacuum streaming read users. So it
includes all of my refactoring of the vacuum code from the preliminary
patches.

I tested two vacuum "data states". Both are relatively small tables
because the impact of streaming read can easily be seen even at small
table sizes. DDL for both data states is at the end of the email.

The first data state is a 28 MB table which has never been vacuumed and
has one or two dead tuples on every block. All of the blocks have dead
tuples, so all of the blocks must be vacuumed. We'll call this the
"sequential" data state.

The second data state is a 67 MB table which has been vacuumed and then
a small percentage of the blocks (non-consecutive blocks at irregular
intervals) are updated afterward. Because the visibility map has been
updated and only a few blocks have dead tuples, large ranges of blocks
do not need to be vacuumed. There is at least one run of blocks with
dead tuples larger than 1 block but most of the blocks with dead tuples
are a single block followed by many blocks with no dead tuples. We'll
call this the "few" data state.

I tested these data states with "master" and with streaming read vacuum
with three caching options:

- table data fully in shared buffers (via pg_prewarm)
- table data in the kernel buffer cache but not in shared buffers
- table data completely uncached

I tested the OS cached and uncached caching options with both the
default vacuum buffer access strategy and with BUFFER_USAGE_LIMIT 0
(which uses as many shared buffers as needed).

For the streaming read vacuum, I tested with maintenance_io_concurrency
10, 100, and 1000. 10 is the current default on master.
maintenance_io_concurrency is not used by vacuum on master AFAICT.

maintenance_io_concurrency is used by streaming read to determine how
many buffers it can pin at the same time (with the hope of combining
consecutive blocks into larger IOs) and, in the case of vacuum, it is
used to determine prefetch distance.

In the following results, I ran vacuum at least five times and averaged
the timing results.

Table data cached in shared buffers
===================================

Sequential data state
---------------------

The only noticeable difference in performance was that streaming read
vacuum took 2% longer than master (19 ms vs 18.6 ms). It was a bit more
noticeable at maintenance_io_concurrency 1000 than 10.

This may be resolved by a patch Thomas is working on to avoid pinning
too many buffers if larger IOs cannot be created (like in a fully SB
resident workload). We should revisit this when that patch is available.

Few data state
--------------

There was no difference in timing for any of the scenarios.

Table data cached in OS buffer cache
====================================

Sequential data state
---------------------

With the buffer access strategy disabled, streaming read vacuum took 11%
less time regardless of maintenance_io_concurrency (26 ms vs 23 ms).

With the default vacuum buffer access strategy,
maintenance_io_concurrency had a large impact:

  Note that "mic" is maintenace_io_concurrency

| data state | code      | mic  | time (ms) |
+------------+-----------+------+-----------+
| sequential | master    | NA   | 99        |
| sequential | streaming | 10   | 122       |
| sequential | streaming | 100  | 28        |

The streaming read API calculates the maximum number of pinned buffers
as 4 * maintenance_io_concurrency. The default vacuum buffer access
strategy ring buffer is 256 kB -- which is 32 buffers.

With maintenance_io_concurrency 10, streaming read code wants to pin 40
buffers. There is likely an interaction between this and the buffer
access strategy which leads to the slowdown at
maintenance_io_concurrency 10.

We could change the default maintenance_io_concurrency, but a better
option is to take the buffer access strategy into account in the
streaming read code.

Few data state
--------------

There was no difference in timing for any of the scenarios.

Table data uncached
===================

Sequential data state
---------------------

When the buffer access strategy is disabled, streaming read vacuum takes
12% less time regardless of maintenance_io_concurrency (36 ms vs 41 ms).

With the default buffer access strategy (ring buffer 256 kB) and
maintenance_io_concurrency 10 (the default), the streaming read vacuum
takes 19% more time. But if we bump maintenance_io_concurrency up to
100+, streaming read vacuum takes 64% less time:

| data state | code      | mic  | time (ms) |
+------------+-----------+------+-----------+
| sequential | master    | NA   | 113       |
| sequential | streaming | 10   | 140       |
| sequential | streaming | 100  | 41        |

This is likely due to the same adverse interaction between streaming
reads' max pinned buffers and the buffer access strategy ring buffer
size.

Few data state
--------------

The buffer access strategy had no impact here, so all of these results
are with the default buffer access strategy. The streaming read vacuum
takes 20-25% less time than master vacuum.

| data state | code      | mic  | time (ms) |
+------------+-----------+------+-----------+
| few        | master    | NA   | 4.5       |
| few        | streaming | 10   | 3.4       |
| few        | streaming | 100  | 3.5       |

The improvement is likely due to prefetching and the one range of
consecutive blocks containing dead tuples which could be merged into a
larger IO.

Higher maintenance_io_concurrency only helps a little probably because:

1) most the blocks to vacuum are not consecutive so we can't make bigger
IOs in most cases
2) we are not vacuuming enough blocks such that we want to prefetch more
than 10 blocks.

This experiment should probably be redone with larger tables containing
more blocks needing vacuum. At 3-4 ms, a 20% performance difference
isn't really that interesting.

The next step (other than the preliminary refactoring patches) is to
decide how the streaming read API should use the buffer access strategy.

Sequential Data State DDL:
  drop table if exists foo;
  create table foo (a int) with (autovacuum_enabled=false, fillfactor=25);
  insert into foo select i % 3 from generate_series(1,200000)i;
  update foo set a = 5 where a = 1;

Few Data State DDL:
  drop table if exists foo;
  create table foo (a int) with (autovacuum_enabled=false, fillfactor=25);
  insert into foo select i from generate_series(2,20000)i;
  insert into foo select 1 from generate_series(1,200)i;
  insert into foo select i from generate_series(2,20000)i;
  insert into foo select 1 from generate_series(1,200)i;
  insert into foo select i from generate_series(2,200000)i;
  insert into foo select 1 from generate_series(1,200)i;
  insert into foo select i from generate_series(2,20000)i;
  insert into foo select 1 from generate_series(1,2000)i;
  insert into foo select i from generate_series(2,20000)i;
  insert into foo select 1 from generate_series(1,200)i;
  insert into foo select i from generate_series(2,200000)i;
  insert into foo select 1 from generate_series(1,200)i;
  vacuum (freeze) foo;
  update foo set a = 5 where a = 1;

- Melanie

[1] https://www.postgresql.org/message-id/1eeccf12-d5d1-4b7e-b88b-7342410129d7%40iki.fi
[2] https://github.com/melanieplageman/postgres/tree/vac_pgsr

Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Stack overflow issue
Next
From: Tom Lane
Date:
Subject: Re: Stack overflow issue