Thread: Parallel heap vacuum

Parallel heap vacuum

From
Masahiko Sawada
Date:
Hi all,

The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.

I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.

# Overall idea (for parallel heap scan in lazy vacuum)

At the beginning of vacuum, we determine how many workers to launch
based on the table size like other parallel query operations. The
number of workers is capped by max_parallel_maitenance_workers. Once
we decided to use parallel heap scan, we prepared DSM to share data
among parallel workers and leader. The information include at least
the vacuum option such as aggressive, the counters collected during
lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs
and GlobalVisState, and parallel scan description.

Before starting heap scan in lazy vacuum, we launch parallel workers
and then each worker (and the leader) process different blocks. Each
worker does HOT-pruning on pages and collects dead tuple TIDs. When
adding dead tuple TIDs, workers need to hold an exclusive lock on
TidStore. At the end of heap scan phase, workers exit and the leader
will wait for all workers to exit. After that, the leader process
gather the counters collected by parallel workers, and compute the
oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is
also enabled, we launch other parallel workers for parallel index
vacuuming.

When it comes to parallel heap scan in lazy vacuum, I think we can use
the table_block_parallelscan_XXX() family. One tricky thing we need to
deal with is that if the TideStore memory usage reaches the limit, we
stop the parallel scan, do index vacuum and table vacuum, and then
resume the parallel scan from the previous state. In order to do that,
in the patch, we store ParallelBlockTableScanWorker, per-worker
parallel scan state, into DSM so that different parallel workers can
resume the scan using the same parallel scan state.

In addition to that, since we could end up launching fewer workers
than requested, it could happen that some ParallelBlockTableScanWorker
data is used once and never be used while remaining unprocessed
blocks. To handle this case, in the patch, the leader process checks
at the end of the parallel scan if there is an uncompleted parallel
scan. If so, the leader process does the scan using worker's
ParallelBlockTableScanWorker data on behalf of workers.

# Discussions

I'm somewhat convinced the brief design of this feature, but there are
some points regarding the implementation we need to discuss.

In the patch, I extended vacuumparalle.c to support parallel table
scan (and vacuum in the future). So I was required to add some table
AM callbacks such as DSM size estimation, DSM initialization, and
actual table scans etc. We need to verify these APIs are appropriate.
Specifically, if we want to support both parallel heap scan and
parallel heap vacuum, do we want to add separate callbacks for them?
It could be overkill since such a 2-pass vacuum strategy is specific
to heap AM.

As another implementation idea, we might want to implement parallel
heap scan/vacuum in lazyvacuum.c while minimizing changes for
vacuumparallel.c. That way, we would not need to add table AM
callbacks. However, we would end up having duplicate codes related to
parallel operation in vacuum such as vacuum delays.

Also, we might need to add some functions to share GlobalVisState
among parallel workers, since GlobalVisState is a private struct.

Other points I'm somewhat uncomfortable with or need to be discussed
remain in the code with XXX comments.

# Benchmark results

* Test-1: parallel heap scan on the table without indexes

I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

Here are the results (total time and heap scan time):

PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2:   8.36 s
PARALLEL 3:   6.14 s
PARALLEL 4:   5.08 s

* Test-2: parallel heap scan on the table with one index

I used a similar table to the test case 1 but created one btree index on it:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Regards,

[1] https://db.in.tum.de/~leis/papers/artsync.pdf

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Parallel heap vacuum

From
Amit Kapila
Date:
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> # Benchmark results
>
> * Test-1: parallel heap scan on the table without indexes
>
> I created 20GB table, made garbage on the table, and run vacuum while
> changing parallel degree:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> Here are the results (total time and heap scan time):
>
> PARALLEL 0: 21.99 s (single process)
> PARALLEL 1: 11.39 s
> PARALLEL 2:   8.36 s
> PARALLEL 3:   6.14 s
> PARALLEL 4:   5.08 s
>
> * Test-2: parallel heap scan on the table with one index
>
> I used a similar table to the test case 1 but created one btree index on it:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> create index on test (a);
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> I've measured the total execution time as well as the time of each
> vacuum phase (from left heap scan time, index vacuum time, and heap
> vacuum time):
>
> PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
>
> Overall, I can see the parallel heap scan in lazy vacuum has a decent
> scalability; In both test-1 and test-2, the execution time of heap
> scan got ~4x faster with 4 parallel workers. On the other hand, when
> it comes to the total vacuum execution time, I could not see much
> performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> worker (12.75 vs. 11.39).
>
> I think the reason is the shared TidStore is not very scalable since
> we have a single lock on it. In all cases in the test-1, we don't use
> the shared TidStore since all dead tuples are removed during heap
> pruning. So the scalability was better overall than in test-2. In
> parallel 0 case in test-2, we use the local TidStore, and from
> parallel degree of 1 in test-2, we use the shared TidStore and
> parallel worker concurrently update it. Also, I guess that the lookup
> performance of the local TidStore is better than the shared TidStore's
> lookup performance because of the differences between a bump context
> and an DSA area. I think that this difference contributed the fact
> that index vacuuming got slower (16.74 vs. 22.04).
>
> There are two obvious improvement ideas to improve overall vacuum
> execution time: (1) improve the shared TidStore scalability and (2)
> support parallel heap vacuum. For (1), several ideas are proposed by
> the ART authors[1]. I've not tried these ideas but it might be
> applicable to our ART implementation. But I prefer to start with (2)
> since it would be easier. Feedback is very welcome.
>

Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

--
With Regards,
Amit Kapila.



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> The parallel vacuum we have today supports only for index vacuuming.
> Therefore, while multiple workers can work on different indexes in
> parallel, the heap table is always processed by the single process.
> I'd like to propose $subject, which enables us to have multiple
> workers running on the single heap table. This would be helpful to
> speedup vacuuming for tables without indexes or tables with
> INDEX_CLENAUP = off.

Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.

> I've attached a PoC patch for this feature. It implements only
> parallel heap scans in lazyvacum. We can extend this feature to
> support parallel heap vacuum as well in the future or in the same
> patch.

Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up. 

When parallel was set to zero, I got:
```
INFO:  vacuuming "postgres.public.test"
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```

This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:

```
INFO:  vacuuming "postgres.public.test"
INFO:  launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```

It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...

Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.

Can you reproduce and see the reason? Based on the requirement I can
provide further information.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > The parallel vacuum we have today supports only for index vacuuming.
> > Therefore, while multiple workers can work on different indexes in
> > parallel, the heap table is always processed by the single process.
> > I'd like to propose $subject, which enables us to have multiple
> > workers running on the single heap table. This would be helpful to
> > speedup vacuuming for tables without indexes or tables with
> > INDEX_CLENAUP = off.
>
> Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.
>
> > I've attached a PoC patch for this feature. It implements only
> > parallel heap scans in lazyvacum. We can extend this feature to
> > support parallel heap vacuum as well in the future or in the same
> > patch.
>
> Before diving into deep, I tested your PoC but found unclear point.
> When the vacuuming is requested with parallel > 0 with almost the same workload
> as yours, only the first page was scanned and cleaned up.
>
> When parallel was set to zero, I got:
> ```
> INFO:  vacuuming "postgres.public.test"
> INFO:  finished vacuuming "postgres.public.test": index scans: 0
> pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
> tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
> removable cutoff: 752, which was 0 XIDs old when operation ended
> new relfrozenxid: 739, which is 1 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
> avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
> buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
> WAL usage: 1 records, 1 full page images, 937 bytes
> system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
> ```
>
> This meant that all pages were surely scanned and dead tuples were removed.
> However, when parallel was set to one, I got another result:
>
> ```
> INFO:  vacuuming "postgres.public.test"
> INFO:  launched 1 parallel vacuum worker for table scanning (planned: 1)
> INFO:  finished vacuuming "postgres.public.test": index scans: 0
> pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
> tuples: 12 removed, 0 remain, 0 are dead but not yet removable
> removable cutoff: 752, which was 0 XIDs old when operation ended
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
> avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
> buffer usage: 96 hits, 660 misses, 6 dirtied
> WAL usage: 1 records, 1 full page images, 937 bytes
> system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
> ```
>
> It looked like that only a page was scanned and 12 tuples were removed.
> It looks very strange for me...
>
> Attached script emulate my test. IIUC it was almost the same as yours, but
> the instance was restarted before vacuuming.
>
> Can you reproduce and see the reason? Based on the requirement I can
> provide further information.

Thank you for the test!

I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.

BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > # Benchmark results
> >
> > * Test-1: parallel heap scan on the table without indexes
> >
> > I created 20GB table, made garbage on the table, and run vacuum while
> > changing parallel degree:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > Here are the results (total time and heap scan time):
> >
> > PARALLEL 0: 21.99 s (single process)
> > PARALLEL 1: 11.39 s
> > PARALLEL 2:   8.36 s
> > PARALLEL 3:   6.14 s
> > PARALLEL 4:   5.08 s
> >
> > * Test-2: parallel heap scan on the table with one index
> >
> > I used a similar table to the test case 1 but created one btree index on it:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > create index on test (a);
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > I've measured the total execution time as well as the time of each
> > vacuum phase (from left heap scan time, index vacuum time, and heap
> > vacuum time):
> >
> > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
> >
> > Overall, I can see the parallel heap scan in lazy vacuum has a decent
> > scalability; In both test-1 and test-2, the execution time of heap
> > scan got ~4x faster with 4 parallel workers. On the other hand, when
> > it comes to the total vacuum execution time, I could not see much
> > performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> > and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> > worker (12.75 vs. 11.39).
> >
> > I think the reason is the shared TidStore is not very scalable since
> > we have a single lock on it. In all cases in the test-1, we don't use
> > the shared TidStore since all dead tuples are removed during heap
> > pruning. So the scalability was better overall than in test-2. In
> > parallel 0 case in test-2, we use the local TidStore, and from
> > parallel degree of 1 in test-2, we use the shared TidStore and
> > parallel worker concurrently update it. Also, I guess that the lookup
> > performance of the local TidStore is better than the shared TidStore's
> > lookup performance because of the differences between a bump context
> > and an DSA area. I think that this difference contributed the fact
> > that index vacuuming got slower (16.74 vs. 22.04).
> >

Thank you for the comments!

> > There are two obvious improvement ideas to improve overall vacuum
> > execution time: (1) improve the shared TidStore scalability and (2)
> > support parallel heap vacuum. For (1), several ideas are proposed by
> > the ART authors[1]. I've not tried these ideas but it might be
> > applicable to our ART implementation. But I prefer to start with (2)
> > since it would be easier. Feedback is very welcome.
> >
>
> Starting with (2) sounds like a reasonable approach. We should study a
> few more things like (a) the performance results where there are 3-4
> indexes,

Here are the results with 4 indexes (and restarting the server before
the benchmark):

PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1:  74.88 s (17.11, 44.43, 13.25)
PARALLEL 2:  71.15 s (14.13, 44.82, 12.12)
PARALLEL 3:  46.78 s (10.74, 24.50, 11.43)
PARALLEL 4:  46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)

> (b) What is the reason for performance improvement seen with
> only heap scans. We normally get benefits of parallelism because of
> using multiple CPUs but parallelizing scans (I/O) shouldn't give much
> benefits. Is it possible that you are seeing benefits because most of
> the data is either in shared_buffers or in memory? We can probably try
> vacuuming tables by restarting the nodes to ensure the data is not in
> memory.

I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).

I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:

PARALLEL 0:  32.75 s
PARALLEL 1:  17.46 s
PARALLEL 2:  13.41 s
PARALLEL 3:  10.31 s
PARALLEL 4:    8.48 s

With the above two tests, I used the updated patch that I just submitted[1].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> Thank you for the test!
> 
> I could reproduce this issue and it's a bug; it skipped even
> non-all-visible pages. I've attached the new version patch.
> 
> BTW since we compute the number of parallel workers for the heap scan
> based on the table size, it's possible that we launch multiple workers
> even if most blocks are all-visible. It seems to be better if we
> calculate it based on (relpages - relallvisible).

Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.

```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```

I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.

1.
This patch cannot be built when debug mode is enabled. See [1].
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".

2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)

[1]:
```
vacuumlazy.c: In function ‘lazy_scan_prune’:
vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’
  Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
                                  ^~
....
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Thank you for the test!
> >
> > I could reproduce this issue and it's a bug; it skipped even
> > non-all-visible pages. I've attached the new version patch.
> >
> > BTW since we compute the number of parallel workers for the heap scan
> > based on the table size, it's possible that we launch multiple workers
> > even if most blocks are all-visible. It seems to be better if we
> > calculate it based on (relpages - relallvisible).
>
> Thanks for updating the patch. I applied and confirmed all pages are scanned.
> I used almost the same script (just changed max_parallel_maintenance_workers)
> and got below result. I think the tendency was the same as yours.
>
> ```
> parallel 0: 61114.369 ms
> parallel 1: 34870.316 ms
> parallel 2: 23598.115 ms
> parallel 3: 17732.363 ms
> parallel 4: 15203.271 ms
> parallel 5: 13836.025 ms
> ```

Thank you for testing!

>
> I started to read your codes but takes much time because I've never seen before...
> Below part contains initial comments.
>
> 1.
> This patch cannot be built when debug mode is enabled. See [1].
> IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
> So you should update like " vacrel->counters->NewRelminMxid".

Right, will fix.

> 2.
> I compared parallel heap scan and found that it does not have compute_worker API.
> Can you clarify the reason why there is an inconsistency?
> (I feel it is intentional because the calculation logic seems to depend on the heap structure,
> so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> Thank you for testing!

I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.

> > 2.
> > I compared parallel heap scan and found that it does not have compute_worker
> API.
> > Can you clarify the reason why there is an inconsistency?
> > (I feel it is intentional because the calculation logic seems to depend on the
> heap structure,
> > so should we add the API for table scan as well?)
> 
> There is room to consider a better API design, but yes, the reason is
> that the calculation logic depends on table AM implementation. For
> example, I thought it might make sense to consider taking the number
> of all-visible pages into account for the calculation of the number of
> parallel workers as we don't want to launch many workers on the table
> where most pages are all-visible. Which might not work for other table
> AMs.
>

Okay, thanks for confirming. I wanted to ask others as well.


> I'm updating the patch to implement parallel heap vacuum and will
> share the updated patch. It might take time as it requires to
> implement shared iteration support in radx tree.

Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones. 

1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?

2.
Per my understanding, the vacuuming goes like below steps.

a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces

If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?

3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?

4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.


How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED 

Attachment

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
Sorry for the very late reply.

On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Thank you for testing!
>
> I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
> and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.
>
> > > 2.
> > > I compared parallel heap scan and found that it does not have compute_worker
> > API.
> > > Can you clarify the reason why there is an inconsistency?
> > > (I feel it is intentional because the calculation logic seems to depend on the
> > heap structure,
> > > so should we add the API for table scan as well?)
> >
> > There is room to consider a better API design, but yes, the reason is
> > that the calculation logic depends on table AM implementation. For
> > example, I thought it might make sense to consider taking the number
> > of all-visible pages into account for the calculation of the number of
> > parallel workers as we don't want to launch many workers on the table
> > where most pages are all-visible. Which might not work for other table
> > AMs.
> >
>
> Okay, thanks for confirming. I wanted to ask others as well.
>
>
> > I'm updating the patch to implement parallel heap vacuum and will
> > share the updated patch. It might take time as it requires to
> > implement shared iteration support in radx tree.
>
> Here are other preliminary comments for v2 patch. This does not contain
> cosmetic ones.
>
> 1.
> Shared data structure PHVShared does not contain the mutex lock. Is it intentional
> because they are accessed by leader only after parallel workers exit?
>

Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.

> 2.
> Per my understanding, the vacuuming goes like below steps.
>
> a. paralell workers are launched for scanning pages
> b. leader waits until scans are done
> c. leader does vacuum alone (you may extend here...)
> d. parallel workers are launched again to cleanup indeces
>
> If so, can we reuse parallel workers for the cleanup? Or, this is painful
> engineering than the benefit?

I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.

> 3.
> According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
> "parallel worker ..." Can we extend this to improve the trackability in the
> pg_stat_activity?

It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.

>
> 4.
> I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
> bottleneck when tid is added to the shared TidStore. My another primitive idea
> is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
> and gather after the heap scanning. If you extend like parallel workers do vacuuming,
> the gathering may not be needed: they can access own TidStore and clean up.
> One downside is that the memory consumption may be quite large.
>

Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.

FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawda-san,

> 
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.

Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

Another approach is to restrict TidStoreBeginIterate() to support only the local one.
How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawda-san,
>
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
>
> Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
> for Linux/Windows said ok.
> I'm still learning the feature so I can post only one comment :-(.
>
> I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Parallel heap vacuum

From
vignesh C
Date:
On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.
>

I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1].

2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.

3) This should be included in typedefs.list:
3.a)
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+       BlockNumber scanned_pages;      /* # pages examined (not
skipped via VM) */
+       BlockNumber removed_pages;      /* # pages removed by relation
truncation */
+       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */

3.b) Similarly this too:
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+       bool            aggressive;
+       bool            skipwithvm;
+

3.c) Similarly this too:
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+       bool            initialized;

3.d)  Similarly this too:
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+       /* Parallel scan description shared among parallel workers */

4) Since we are initializing almost all the members of structure,
should we use palloc0 in this case:
+       scan_stats = palloc(sizeof(LVRelScanStats));
+       scan_stats->scanned_pages = 0;
+       scan_stats->removed_pages = 0;
+       scan_stats->frozen_pages = 0;
+       scan_stats->lpdead_item_pages = 0;
+       scan_stats->missed_dead_pages = 0;
+       scan_stats->nonempty_pages = 0;
+
+       /* Initialize remaining counters (be tidy) */
+       scan_stats->tuples_deleted = 0;
+       scan_stats->tuples_frozen = 0;
+       scan_stats->lpdead_items = 0;
+       scan_stats->live_tuples = 0;
+       scan_stats->recently_dead_tuples = 0;
+       scan_stats->missed_dead_tuples = 0;

5) typo paralle should be parallel
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

[1] - https://www.postgresql.org/docs/devel/sql-vacuum.html

Regards,
Vignesh



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> TidStoreBeginIterateShared() is designed for multiple parallel workers
> to iterate a shared TidStore. During an iteration, parallel workers
> share the iteration state and iterate the underlying radix tree while
> taking appropriate locks. Therefore, it's available only for a shared
> TidStore. This is required to implement the parallel heap vacuum,
> where multiple parallel workers do the iteration on the shared
> TidStore.
> 
> On the other hand, TidStoreBeginIterate() is designed for a single
> process to iterate a TidStore. It accepts even a shared TidStore as
> you mentioned, but during an iteration there is no inter-process
> coordination such as locking. When it comes to parallel vacuum,
> supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> cover the case where we use only parallel index vacuum but not
> parallel heap scan/vacuum. In this case, we need to store dead tuple
> TIDs on the shared TidStore during heap scan so parallel workers can
> use it during index vacuum. But it's not necessary to use
> TidStoreBeginIterateShared() because only one (leader) process does
> heap vacuum.

Okay, thanks for the description. I felt it is OK to keep.

I read 0001 again and here are comments.

01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
```

I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?

02. LVRelState
```
+    BlockNumber next_fsm_block_to_vacuum;
```

Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?

03. parallel_heap_vacuum_gather_scan_stats
```
+        vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```

Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?

04. heap_parallel_vacuum_estimate
```
+
+    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+                                              &shared_len, &pscanwork_len);
+
+    /* space for PHVShared */
+    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for ParallelBlockTableScanDesc */
+    pscan_len = table_block_parallelscan_estimate(rel);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for per-worker scan state, PHVScanWorkerState */
+    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
```

I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?

05. Idea

Can you update documentations?

06. Idea

AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > TidStoreBeginIterateShared() is designed for multiple parallel workers
> > to iterate a shared TidStore. During an iteration, parallel workers
> > share the iteration state and iterate the underlying radix tree while
> > taking appropriate locks. Therefore, it's available only for a shared
> > TidStore. This is required to implement the parallel heap vacuum,
> > where multiple parallel workers do the iteration on the shared
> > TidStore.
> >
> > On the other hand, TidStoreBeginIterate() is designed for a single
> > process to iterate a TidStore. It accepts even a shared TidStore as
> > you mentioned, but during an iteration there is no inter-process
> > coordination such as locking. When it comes to parallel vacuum,
> > supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> > cover the case where we use only parallel index vacuum but not
> > parallel heap scan/vacuum. In this case, we need to store dead tuple
> > TIDs on the shared TidStore during heap scan so parallel workers can
> > use it during index vacuum. But it's not necessary to use
> > TidStoreBeginIterateShared() because only one (leader) process does
> > heap vacuum.
>
> Okay, thanks for the description. I felt it is OK to keep.
>
> I read 0001 again and here are comments.

Thank you for the review comments!

>
> 01. vacuumlazy.c
> ```
> +#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
> +#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
> +#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
> ```
>
> I checked other DMS keys used for parallel work, and they seems to have name
> like PARALEL_KEY_XXX. Can we follow it?

Yes. How about LV_PARALLEL_KEY_XXX?

>
> 02. LVRelState
> ```
> +    BlockNumber next_fsm_block_to_vacuum;
> ```
>
> Only the attribute does not have comments Can we add like:
> "Next freespace map page to be checked"?

Agreed. I'll add a comment "next block to check for FSM vacuum*.

>
> 03. parallel_heap_vacuum_gather_scan_stats
> ```
> +        vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
> ```
>
> Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
> in 0004. Can you move it?

Will remove.

>
> 04. heap_parallel_vacuum_estimate
> ```
> +
> +    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
> +                                              &shared_len, &pscanwork_len);
> +
> +    /* space for PHVShared */
> +    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> +    /* space for ParallelBlockTableScanDesc */
> +    pscan_len = table_block_parallelscan_estimate(rel);
> +    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> +    /* space for per-worker scan state, PHVScanWorkerState */
> +    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
> +    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> ```
>
> I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
> Can we remove table_block_parallelscan_estimate() and mul_size() from here?

Yes, it's an oversight. Will remove.

>
> 05. Idea
>
> Can you update documentations?

Will update the doc as well.

>
> 06. Idea
>
> AFAICS pg_stat_progress_vacuum does not contain information related with the
> parallel execution. How do you think adding an attribute which shows a list of pids?
> Not sure it is helpful for users but it can show the parallelism.

I think it's possible to show the parallelism even today (for parallel
index vacuuming):

=# select leader.pid, leader.query, array_agg(worker.pid) from
pg_stat_activity as leader, pg_stat_activity as worker,
pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and
leader.pid = v.pid group by 1, 2;
   pid   |        query        |     array_agg
---------+---------------------+-------------------
 2952103 | vacuum (verbose) t; | {2952257,2952258}
(1 row)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
> >
>
> I just started reviewing the patch and found the following comments
> while going through the patch:
> 1) I felt we should add some documentation for this at [1].
>
> 2) Can we add some tests in vacuum_parallel with tables having no
> indexes and having dead tuples.
>
> 3) This should be included in typedefs.list:
> 3.a)
> +/*
> + * Relation statistics collected during heap scanning and need to be
> shared among
> + * parallel vacuum workers.
> + */
> +typedef struct LVRelScanStats
> +{
> +       BlockNumber scanned_pages;      /* # pages examined (not
> skipped via VM) */
> +       BlockNumber removed_pages;      /* # pages removed by relation
> truncation */
> +       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */
>
> 3.b) Similarly this too:
> +/*
> + * Struct for information that need to be shared among parallel vacuum workers
> + */
> +typedef struct PHVShared
> +{
> +       bool            aggressive;
> +       bool            skipwithvm;
> +
>
> 3.c) Similarly this too:
> +/* Per-worker scan state for parallel heap vacuum scan */
> +typedef struct PHVScanWorkerState
> +{
> +       bool            initialized;
>
> 3.d)  Similarly this too:
> +/* Struct for parallel heap vacuum */
> +typedef struct PHVState
> +{
> +       /* Parallel scan description shared among parallel workers */
>
> 4) Since we are initializing almost all the members of structure,
> should we use palloc0 in this case:
> +       scan_stats = palloc(sizeof(LVRelScanStats));
> +       scan_stats->scanned_pages = 0;
> +       scan_stats->removed_pages = 0;
> +       scan_stats->frozen_pages = 0;
> +       scan_stats->lpdead_item_pages = 0;
> +       scan_stats->missed_dead_pages = 0;
> +       scan_stats->nonempty_pages = 0;
> +
> +       /* Initialize remaining counters (be tidy) */
> +       scan_stats->tuples_deleted = 0;
> +       scan_stats->tuples_frozen = 0;
> +       scan_stats->lpdead_items = 0;
> +       scan_stats->live_tuples = 0;
> +       scan_stats->recently_dead_tuples = 0;
> +       scan_stats->missed_dead_tuples = 0;
>
> 5) typo paralle should be parallel
> +/*
> + * Return the number of parallel workers for a parallel vacuum scan of this
> + * relation.
> + */
> +static inline int
> +table_paralle_vacuum_compute_workers(Relation rel, int requested)
> +{
> +       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
> +}
>

Thank you for the comments! I'll address these comments in the next
version patch.

BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com