Parallel heap vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Parallel heap vacuum
Date
Msg-id CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
Whole thread Raw
Responses Re: Parallel heap vacuum  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi all,

The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.

I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.

# Overall idea (for parallel heap scan in lazy vacuum)

At the beginning of vacuum, we determine how many workers to launch
based on the table size like other parallel query operations. The
number of workers is capped by max_parallel_maitenance_workers. Once
we decided to use parallel heap scan, we prepared DSM to share data
among parallel workers and leader. The information include at least
the vacuum option such as aggressive, the counters collected during
lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs
and GlobalVisState, and parallel scan description.

Before starting heap scan in lazy vacuum, we launch parallel workers
and then each worker (and the leader) process different blocks. Each
worker does HOT-pruning on pages and collects dead tuple TIDs. When
adding dead tuple TIDs, workers need to hold an exclusive lock on
TidStore. At the end of heap scan phase, workers exit and the leader
will wait for all workers to exit. After that, the leader process
gather the counters collected by parallel workers, and compute the
oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is
also enabled, we launch other parallel workers for parallel index
vacuuming.

When it comes to parallel heap scan in lazy vacuum, I think we can use
the table_block_parallelscan_XXX() family. One tricky thing we need to
deal with is that if the TideStore memory usage reaches the limit, we
stop the parallel scan, do index vacuum and table vacuum, and then
resume the parallel scan from the previous state. In order to do that,
in the patch, we store ParallelBlockTableScanWorker, per-worker
parallel scan state, into DSM so that different parallel workers can
resume the scan using the same parallel scan state.

In addition to that, since we could end up launching fewer workers
than requested, it could happen that some ParallelBlockTableScanWorker
data is used once and never be used while remaining unprocessed
blocks. To handle this case, in the patch, the leader process checks
at the end of the parallel scan if there is an uncompleted parallel
scan. If so, the leader process does the scan using worker's
ParallelBlockTableScanWorker data on behalf of workers.

# Discussions

I'm somewhat convinced the brief design of this feature, but there are
some points regarding the implementation we need to discuss.

In the patch, I extended vacuumparalle.c to support parallel table
scan (and vacuum in the future). So I was required to add some table
AM callbacks such as DSM size estimation, DSM initialization, and
actual table scans etc. We need to verify these APIs are appropriate.
Specifically, if we want to support both parallel heap scan and
parallel heap vacuum, do we want to add separate callbacks for them?
It could be overkill since such a 2-pass vacuum strategy is specific
to heap AM.

As another implementation idea, we might want to implement parallel
heap scan/vacuum in lazyvacuum.c while minimizing changes for
vacuumparallel.c. That way, we would not need to add table AM
callbacks. However, we would end up having duplicate codes related to
parallel operation in vacuum such as vacuum delays.

Also, we might need to add some functions to share GlobalVisState
among parallel workers, since GlobalVisState is a private struct.

Other points I'm somewhat uncomfortable with or need to be discussed
remain in the code with XXX comments.

# Benchmark results

* Test-1: parallel heap scan on the table without indexes

I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

Here are the results (total time and heap scan time):

PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2:   8.36 s
PARALLEL 3:   6.14 s
PARALLEL 4:   5.08 s

* Test-2: parallel heap scan on the table with one index

I used a similar table to the test case 1 but created one btree index on it:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Regards,

[1] https://db.in.tum.de/~leis/papers/artsync.pdf

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Stepan Neretin
Date:
Subject: Re: stale comments about fastgetattr and heap_getattr
Next
From: Michael Paquier
Date:
Subject: Re: stale comments about fastgetattr and heap_getattr