Thread: Parallel heap vacuum

Parallel heap vacuum

From
Masahiko Sawada
Date:
Hi all,

The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.

I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.

# Overall idea (for parallel heap scan in lazy vacuum)

At the beginning of vacuum, we determine how many workers to launch
based on the table size like other parallel query operations. The
number of workers is capped by max_parallel_maitenance_workers. Once
we decided to use parallel heap scan, we prepared DSM to share data
among parallel workers and leader. The information include at least
the vacuum option such as aggressive, the counters collected during
lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs
and GlobalVisState, and parallel scan description.

Before starting heap scan in lazy vacuum, we launch parallel workers
and then each worker (and the leader) process different blocks. Each
worker does HOT-pruning on pages and collects dead tuple TIDs. When
adding dead tuple TIDs, workers need to hold an exclusive lock on
TidStore. At the end of heap scan phase, workers exit and the leader
will wait for all workers to exit. After that, the leader process
gather the counters collected by parallel workers, and compute the
oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is
also enabled, we launch other parallel workers for parallel index
vacuuming.

When it comes to parallel heap scan in lazy vacuum, I think we can use
the table_block_parallelscan_XXX() family. One tricky thing we need to
deal with is that if the TideStore memory usage reaches the limit, we
stop the parallel scan, do index vacuum and table vacuum, and then
resume the parallel scan from the previous state. In order to do that,
in the patch, we store ParallelBlockTableScanWorker, per-worker
parallel scan state, into DSM so that different parallel workers can
resume the scan using the same parallel scan state.

In addition to that, since we could end up launching fewer workers
than requested, it could happen that some ParallelBlockTableScanWorker
data is used once and never be used while remaining unprocessed
blocks. To handle this case, in the patch, the leader process checks
at the end of the parallel scan if there is an uncompleted parallel
scan. If so, the leader process does the scan using worker's
ParallelBlockTableScanWorker data on behalf of workers.

# Discussions

I'm somewhat convinced the brief design of this feature, but there are
some points regarding the implementation we need to discuss.

In the patch, I extended vacuumparalle.c to support parallel table
scan (and vacuum in the future). So I was required to add some table
AM callbacks such as DSM size estimation, DSM initialization, and
actual table scans etc. We need to verify these APIs are appropriate.
Specifically, if we want to support both parallel heap scan and
parallel heap vacuum, do we want to add separate callbacks for them?
It could be overkill since such a 2-pass vacuum strategy is specific
to heap AM.

As another implementation idea, we might want to implement parallel
heap scan/vacuum in lazyvacuum.c while minimizing changes for
vacuumparallel.c. That way, we would not need to add table AM
callbacks. However, we would end up having duplicate codes related to
parallel operation in vacuum such as vacuum delays.

Also, we might need to add some functions to share GlobalVisState
among parallel workers, since GlobalVisState is a private struct.

Other points I'm somewhat uncomfortable with or need to be discussed
remain in the code with XXX comments.

# Benchmark results

* Test-1: parallel heap scan on the table without indexes

I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

Here are the results (total time and heap scan time):

PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2:   8.36 s
PARALLEL 3:   6.14 s
PARALLEL 4:   5.08 s

* Test-2: parallel heap scan on the table with one index

I used a similar table to the test case 1 but created one btree index on it:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Regards,

[1] https://db.in.tum.de/~leis/papers/artsync.pdf

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Parallel heap vacuum

From
Amit Kapila
Date:
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> # Benchmark results
>
> * Test-1: parallel heap scan on the table without indexes
>
> I created 20GB table, made garbage on the table, and run vacuum while
> changing parallel degree:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> Here are the results (total time and heap scan time):
>
> PARALLEL 0: 21.99 s (single process)
> PARALLEL 1: 11.39 s
> PARALLEL 2:   8.36 s
> PARALLEL 3:   6.14 s
> PARALLEL 4:   5.08 s
>
> * Test-2: parallel heap scan on the table with one index
>
> I used a similar table to the test case 1 but created one btree index on it:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> create index on test (a);
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> I've measured the total execution time as well as the time of each
> vacuum phase (from left heap scan time, index vacuum time, and heap
> vacuum time):
>
> PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
>
> Overall, I can see the parallel heap scan in lazy vacuum has a decent
> scalability; In both test-1 and test-2, the execution time of heap
> scan got ~4x faster with 4 parallel workers. On the other hand, when
> it comes to the total vacuum execution time, I could not see much
> performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> worker (12.75 vs. 11.39).
>
> I think the reason is the shared TidStore is not very scalable since
> we have a single lock on it. In all cases in the test-1, we don't use
> the shared TidStore since all dead tuples are removed during heap
> pruning. So the scalability was better overall than in test-2. In
> parallel 0 case in test-2, we use the local TidStore, and from
> parallel degree of 1 in test-2, we use the shared TidStore and
> parallel worker concurrently update it. Also, I guess that the lookup
> performance of the local TidStore is better than the shared TidStore's
> lookup performance because of the differences between a bump context
> and an DSA area. I think that this difference contributed the fact
> that index vacuuming got slower (16.74 vs. 22.04).
>
> There are two obvious improvement ideas to improve overall vacuum
> execution time: (1) improve the shared TidStore scalability and (2)
> support parallel heap vacuum. For (1), several ideas are proposed by
> the ART authors[1]. I've not tried these ideas but it might be
> applicable to our ART implementation. But I prefer to start with (2)
> since it would be easier. Feedback is very welcome.
>

Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

--
With Regards,
Amit Kapila.



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> The parallel vacuum we have today supports only for index vacuuming.
> Therefore, while multiple workers can work on different indexes in
> parallel, the heap table is always processed by the single process.
> I'd like to propose $subject, which enables us to have multiple
> workers running on the single heap table. This would be helpful to
> speedup vacuuming for tables without indexes or tables with
> INDEX_CLENAUP = off.

Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.

> I've attached a PoC patch for this feature. It implements only
> parallel heap scans in lazyvacum. We can extend this feature to
> support parallel heap vacuum as well in the future or in the same
> patch.

Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up. 

When parallel was set to zero, I got:
```
INFO:  vacuuming "postgres.public.test"
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```

This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:

```
INFO:  vacuuming "postgres.public.test"
INFO:  launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```

It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...

Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.

Can you reproduce and see the reason? Based on the requirement I can
provide further information.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Attachment

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > The parallel vacuum we have today supports only for index vacuuming.
> > Therefore, while multiple workers can work on different indexes in
> > parallel, the heap table is always processed by the single process.
> > I'd like to propose $subject, which enables us to have multiple
> > workers running on the single heap table. This would be helpful to
> > speedup vacuuming for tables without indexes or tables with
> > INDEX_CLENAUP = off.
>
> Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.
>
> > I've attached a PoC patch for this feature. It implements only
> > parallel heap scans in lazyvacum. We can extend this feature to
> > support parallel heap vacuum as well in the future or in the same
> > patch.
>
> Before diving into deep, I tested your PoC but found unclear point.
> When the vacuuming is requested with parallel > 0 with almost the same workload
> as yours, only the first page was scanned and cleaned up.
>
> When parallel was set to zero, I got:
> ```
> INFO:  vacuuming "postgres.public.test"
> INFO:  finished vacuuming "postgres.public.test": index scans: 0
> pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
> tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
> removable cutoff: 752, which was 0 XIDs old when operation ended
> new relfrozenxid: 739, which is 1 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
> avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
> buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
> WAL usage: 1 records, 1 full page images, 937 bytes
> system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
> ```
>
> This meant that all pages were surely scanned and dead tuples were removed.
> However, when parallel was set to one, I got another result:
>
> ```
> INFO:  vacuuming "postgres.public.test"
> INFO:  launched 1 parallel vacuum worker for table scanning (planned: 1)
> INFO:  finished vacuuming "postgres.public.test": index scans: 0
> pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
> tuples: 12 removed, 0 remain, 0 are dead but not yet removable
> removable cutoff: 752, which was 0 XIDs old when operation ended
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
> avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
> buffer usage: 96 hits, 660 misses, 6 dirtied
> WAL usage: 1 records, 1 full page images, 937 bytes
> system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
> ```
>
> It looked like that only a page was scanned and 12 tuples were removed.
> It looks very strange for me...
>
> Attached script emulate my test. IIUC it was almost the same as yours, but
> the instance was restarted before vacuuming.
>
> Can you reproduce and see the reason? Based on the requirement I can
> provide further information.

Thank you for the test!

I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.

BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > # Benchmark results
> >
> > * Test-1: parallel heap scan on the table without indexes
> >
> > I created 20GB table, made garbage on the table, and run vacuum while
> > changing parallel degree:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > Here are the results (total time and heap scan time):
> >
> > PARALLEL 0: 21.99 s (single process)
> > PARALLEL 1: 11.39 s
> > PARALLEL 2:   8.36 s
> > PARALLEL 3:   6.14 s
> > PARALLEL 4:   5.08 s
> >
> > * Test-2: parallel heap scan on the table with one index
> >
> > I used a similar table to the test case 1 but created one btree index on it:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > create index on test (a);
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > I've measured the total execution time as well as the time of each
> > vacuum phase (from left heap scan time, index vacuum time, and heap
> > vacuum time):
> >
> > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
> >
> > Overall, I can see the parallel heap scan in lazy vacuum has a decent
> > scalability; In both test-1 and test-2, the execution time of heap
> > scan got ~4x faster with 4 parallel workers. On the other hand, when
> > it comes to the total vacuum execution time, I could not see much
> > performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> > and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> > worker (12.75 vs. 11.39).
> >
> > I think the reason is the shared TidStore is not very scalable since
> > we have a single lock on it. In all cases in the test-1, we don't use
> > the shared TidStore since all dead tuples are removed during heap
> > pruning. So the scalability was better overall than in test-2. In
> > parallel 0 case in test-2, we use the local TidStore, and from
> > parallel degree of 1 in test-2, we use the shared TidStore and
> > parallel worker concurrently update it. Also, I guess that the lookup
> > performance of the local TidStore is better than the shared TidStore's
> > lookup performance because of the differences between a bump context
> > and an DSA area. I think that this difference contributed the fact
> > that index vacuuming got slower (16.74 vs. 22.04).
> >

Thank you for the comments!

> > There are two obvious improvement ideas to improve overall vacuum
> > execution time: (1) improve the shared TidStore scalability and (2)
> > support parallel heap vacuum. For (1), several ideas are proposed by
> > the ART authors[1]. I've not tried these ideas but it might be
> > applicable to our ART implementation. But I prefer to start with (2)
> > since it would be easier. Feedback is very welcome.
> >
>
> Starting with (2) sounds like a reasonable approach. We should study a
> few more things like (a) the performance results where there are 3-4
> indexes,

Here are the results with 4 indexes (and restarting the server before
the benchmark):

PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1:  74.88 s (17.11, 44.43, 13.25)
PARALLEL 2:  71.15 s (14.13, 44.82, 12.12)
PARALLEL 3:  46.78 s (10.74, 24.50, 11.43)
PARALLEL 4:  46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)

> (b) What is the reason for performance improvement seen with
> only heap scans. We normally get benefits of parallelism because of
> using multiple CPUs but parallelizing scans (I/O) shouldn't give much
> benefits. Is it possible that you are seeing benefits because most of
> the data is either in shared_buffers or in memory? We can probably try
> vacuuming tables by restarting the nodes to ensure the data is not in
> memory.

I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).

I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:

PARALLEL 0:  32.75 s
PARALLEL 1:  17.46 s
PARALLEL 2:  13.41 s
PARALLEL 3:  10.31 s
PARALLEL 4:    8.48 s

With the above two tests, I used the updated patch that I just submitted[1].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> Thank you for the test!
> 
> I could reproduce this issue and it's a bug; it skipped even
> non-all-visible pages. I've attached the new version patch.
> 
> BTW since we compute the number of parallel workers for the heap scan
> based on the table size, it's possible that we launch multiple workers
> even if most blocks are all-visible. It seems to be better if we
> calculate it based on (relpages - relallvisible).

Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.

```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```

I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.

1.
This patch cannot be built when debug mode is enabled. See [1].
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".

2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)

[1]:
```
vacuumlazy.c: In function ‘lazy_scan_prune’:
vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’
  Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
                                  ^~
....
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Thank you for the test!
> >
> > I could reproduce this issue and it's a bug; it skipped even
> > non-all-visible pages. I've attached the new version patch.
> >
> > BTW since we compute the number of parallel workers for the heap scan
> > based on the table size, it's possible that we launch multiple workers
> > even if most blocks are all-visible. It seems to be better if we
> > calculate it based on (relpages - relallvisible).
>
> Thanks for updating the patch. I applied and confirmed all pages are scanned.
> I used almost the same script (just changed max_parallel_maintenance_workers)
> and got below result. I think the tendency was the same as yours.
>
> ```
> parallel 0: 61114.369 ms
> parallel 1: 34870.316 ms
> parallel 2: 23598.115 ms
> parallel 3: 17732.363 ms
> parallel 4: 15203.271 ms
> parallel 5: 13836.025 ms
> ```

Thank you for testing!

>
> I started to read your codes but takes much time because I've never seen before...
> Below part contains initial comments.
>
> 1.
> This patch cannot be built when debug mode is enabled. See [1].
> IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
> So you should update like " vacrel->counters->NewRelminMxid".

Right, will fix.

> 2.
> I compared parallel heap scan and found that it does not have compute_worker API.
> Can you clarify the reason why there is an inconsistency?
> (I feel it is intentional because the calculation logic seems to depend on the heap structure,
> so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> Thank you for testing!

I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.

> > 2.
> > I compared parallel heap scan and found that it does not have compute_worker
> API.
> > Can you clarify the reason why there is an inconsistency?
> > (I feel it is intentional because the calculation logic seems to depend on the
> heap structure,
> > so should we add the API for table scan as well?)
> 
> There is room to consider a better API design, but yes, the reason is
> that the calculation logic depends on table AM implementation. For
> example, I thought it might make sense to consider taking the number
> of all-visible pages into account for the calculation of the number of
> parallel workers as we don't want to launch many workers on the table
> where most pages are all-visible. Which might not work for other table
> AMs.
>

Okay, thanks for confirming. I wanted to ask others as well.


> I'm updating the patch to implement parallel heap vacuum and will
> share the updated patch. It might take time as it requires to
> implement shared iteration support in radx tree.

Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones. 

1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?

2.
Per my understanding, the vacuuming goes like below steps.

a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces

If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?

3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?

4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.


How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED 

Attachment

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
Sorry for the very late reply.

On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Thank you for testing!
>
> I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
> and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.
>
> > > 2.
> > > I compared parallel heap scan and found that it does not have compute_worker
> > API.
> > > Can you clarify the reason why there is an inconsistency?
> > > (I feel it is intentional because the calculation logic seems to depend on the
> > heap structure,
> > > so should we add the API for table scan as well?)
> >
> > There is room to consider a better API design, but yes, the reason is
> > that the calculation logic depends on table AM implementation. For
> > example, I thought it might make sense to consider taking the number
> > of all-visible pages into account for the calculation of the number of
> > parallel workers as we don't want to launch many workers on the table
> > where most pages are all-visible. Which might not work for other table
> > AMs.
> >
>
> Okay, thanks for confirming. I wanted to ask others as well.
>
>
> > I'm updating the patch to implement parallel heap vacuum and will
> > share the updated patch. It might take time as it requires to
> > implement shared iteration support in radx tree.
>
> Here are other preliminary comments for v2 patch. This does not contain
> cosmetic ones.
>
> 1.
> Shared data structure PHVShared does not contain the mutex lock. Is it intentional
> because they are accessed by leader only after parallel workers exit?
>

Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.

> 2.
> Per my understanding, the vacuuming goes like below steps.
>
> a. paralell workers are launched for scanning pages
> b. leader waits until scans are done
> c. leader does vacuum alone (you may extend here...)
> d. parallel workers are launched again to cleanup indeces
>
> If so, can we reuse parallel workers for the cleanup? Or, this is painful
> engineering than the benefit?

I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.

> 3.
> According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
> "parallel worker ..." Can we extend this to improve the trackability in the
> pg_stat_activity?

It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.

>
> 4.
> I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
> bottleneck when tid is added to the shared TidStore. My another primitive idea
> is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
> and gather after the heap scanning. If you extend like parallel workers do vacuuming,
> the gathering may not be needed: they can access own TidStore and clean up.
> One downside is that the memory consumption may be quite large.
>

Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.

FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawda-san,

> 
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.

Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

Another approach is to restrict TidStoreBeginIterate() to support only the local one.
How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawda-san,
>
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
>
> Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
> for Linux/Windows said ok.
> I'm still learning the feature so I can post only one comment :-(.
>
> I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Parallel heap vacuum

From
vignesh C
Date:
On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.
>

I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1].

2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.

3) This should be included in typedefs.list:
3.a)
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+       BlockNumber scanned_pages;      /* # pages examined (not
skipped via VM) */
+       BlockNumber removed_pages;      /* # pages removed by relation
truncation */
+       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */

3.b) Similarly this too:
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+       bool            aggressive;
+       bool            skipwithvm;
+

3.c) Similarly this too:
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+       bool            initialized;

3.d)  Similarly this too:
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+       /* Parallel scan description shared among parallel workers */

4) Since we are initializing almost all the members of structure,
should we use palloc0 in this case:
+       scan_stats = palloc(sizeof(LVRelScanStats));
+       scan_stats->scanned_pages = 0;
+       scan_stats->removed_pages = 0;
+       scan_stats->frozen_pages = 0;
+       scan_stats->lpdead_item_pages = 0;
+       scan_stats->missed_dead_pages = 0;
+       scan_stats->nonempty_pages = 0;
+
+       /* Initialize remaining counters (be tidy) */
+       scan_stats->tuples_deleted = 0;
+       scan_stats->tuples_frozen = 0;
+       scan_stats->lpdead_items = 0;
+       scan_stats->live_tuples = 0;
+       scan_stats->recently_dead_tuples = 0;
+       scan_stats->missed_dead_tuples = 0;

5) typo paralle should be parallel
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

[1] - https://www.postgresql.org/docs/devel/sql-vacuum.html

Regards,
Vignesh



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

> TidStoreBeginIterateShared() is designed for multiple parallel workers
> to iterate a shared TidStore. During an iteration, parallel workers
> share the iteration state and iterate the underlying radix tree while
> taking appropriate locks. Therefore, it's available only for a shared
> TidStore. This is required to implement the parallel heap vacuum,
> where multiple parallel workers do the iteration on the shared
> TidStore.
> 
> On the other hand, TidStoreBeginIterate() is designed for a single
> process to iterate a TidStore. It accepts even a shared TidStore as
> you mentioned, but during an iteration there is no inter-process
> coordination such as locking. When it comes to parallel vacuum,
> supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> cover the case where we use only parallel index vacuum but not
> parallel heap scan/vacuum. In this case, we need to store dead tuple
> TIDs on the shared TidStore during heap scan so parallel workers can
> use it during index vacuum. But it's not necessary to use
> TidStoreBeginIterateShared() because only one (leader) process does
> heap vacuum.

Okay, thanks for the description. I felt it is OK to keep.

I read 0001 again and here are comments.

01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
```

I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?

02. LVRelState
```
+    BlockNumber next_fsm_block_to_vacuum;
```

Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?

03. parallel_heap_vacuum_gather_scan_stats
```
+        vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```

Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?

04. heap_parallel_vacuum_estimate
```
+
+    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+                                              &shared_len, &pscanwork_len);
+
+    /* space for PHVShared */
+    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for ParallelBlockTableScanDesc */
+    pscan_len = table_block_parallelscan_estimate(rel);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for per-worker scan state, PHVScanWorkerState */
+    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
```

I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?

05. Idea

Can you update documentations?

06. Idea

AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > TidStoreBeginIterateShared() is designed for multiple parallel workers
> > to iterate a shared TidStore. During an iteration, parallel workers
> > share the iteration state and iterate the underlying radix tree while
> > taking appropriate locks. Therefore, it's available only for a shared
> > TidStore. This is required to implement the parallel heap vacuum,
> > where multiple parallel workers do the iteration on the shared
> > TidStore.
> >
> > On the other hand, TidStoreBeginIterate() is designed for a single
> > process to iterate a TidStore. It accepts even a shared TidStore as
> > you mentioned, but during an iteration there is no inter-process
> > coordination such as locking. When it comes to parallel vacuum,
> > supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> > cover the case where we use only parallel index vacuum but not
> > parallel heap scan/vacuum. In this case, we need to store dead tuple
> > TIDs on the shared TidStore during heap scan so parallel workers can
> > use it during index vacuum. But it's not necessary to use
> > TidStoreBeginIterateShared() because only one (leader) process does
> > heap vacuum.
>
> Okay, thanks for the description. I felt it is OK to keep.
>
> I read 0001 again and here are comments.

Thank you for the review comments!

>
> 01. vacuumlazy.c
> ```
> +#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
> +#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
> +#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
> ```
>
> I checked other DMS keys used for parallel work, and they seems to have name
> like PARALEL_KEY_XXX. Can we follow it?

Yes. How about LV_PARALLEL_KEY_XXX?

>
> 02. LVRelState
> ```
> +    BlockNumber next_fsm_block_to_vacuum;
> ```
>
> Only the attribute does not have comments Can we add like:
> "Next freespace map page to be checked"?

Agreed. I'll add a comment "next block to check for FSM vacuum*.

>
> 03. parallel_heap_vacuum_gather_scan_stats
> ```
> +        vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
> ```
>
> Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
> in 0004. Can you move it?

Will remove.

>
> 04. heap_parallel_vacuum_estimate
> ```
> +
> +    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
> +                                              &shared_len, &pscanwork_len);
> +
> +    /* space for PHVShared */
> +    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> +    /* space for ParallelBlockTableScanDesc */
> +    pscan_len = table_block_parallelscan_estimate(rel);
> +    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> +    /* space for per-worker scan state, PHVScanWorkerState */
> +    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
> +    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> ```
>
> I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
> Can we remove table_block_parallelscan_estimate() and mul_size() from here?

Yes, it's an oversight. Will remove.

>
> 05. Idea
>
> Can you update documentations?

Will update the doc as well.

>
> 06. Idea
>
> AFAICS pg_stat_progress_vacuum does not contain information related with the
> parallel execution. How do you think adding an attribute which shows a list of pids?
> Not sure it is helpful for users but it can show the parallelism.

I think it's possible to show the parallelism even today (for parallel
index vacuuming):

=# select leader.pid, leader.query, array_agg(worker.pid) from
pg_stat_activity as leader, pg_stat_activity as worker,
pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and
leader.pid = v.pid group by 1, 2;
   pid   |        query        |     array_agg
---------+---------------------+-------------------
 2952103 | vacuum (verbose) t; | {2952257,2952258}
(1 row)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
> >
>
> I just started reviewing the patch and found the following comments
> while going through the patch:
> 1) I felt we should add some documentation for this at [1].
>
> 2) Can we add some tests in vacuum_parallel with tables having no
> indexes and having dead tuples.
>
> 3) This should be included in typedefs.list:
> 3.a)
> +/*
> + * Relation statistics collected during heap scanning and need to be
> shared among
> + * parallel vacuum workers.
> + */
> +typedef struct LVRelScanStats
> +{
> +       BlockNumber scanned_pages;      /* # pages examined (not
> skipped via VM) */
> +       BlockNumber removed_pages;      /* # pages removed by relation
> truncation */
> +       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */
>
> 3.b) Similarly this too:
> +/*
> + * Struct for information that need to be shared among parallel vacuum workers
> + */
> +typedef struct PHVShared
> +{
> +       bool            aggressive;
> +       bool            skipwithvm;
> +
>
> 3.c) Similarly this too:
> +/* Per-worker scan state for parallel heap vacuum scan */
> +typedef struct PHVScanWorkerState
> +{
> +       bool            initialized;
>
> 3.d)  Similarly this too:
> +/* Struct for parallel heap vacuum */
> +typedef struct PHVState
> +{
> +       /* Parallel scan description shared among parallel workers */
>
> 4) Since we are initializing almost all the members of structure,
> should we use palloc0 in this case:
> +       scan_stats = palloc(sizeof(LVRelScanStats));
> +       scan_stats->scanned_pages = 0;
> +       scan_stats->removed_pages = 0;
> +       scan_stats->frozen_pages = 0;
> +       scan_stats->lpdead_item_pages = 0;
> +       scan_stats->missed_dead_pages = 0;
> +       scan_stats->nonempty_pages = 0;
> +
> +       /* Initialize remaining counters (be tidy) */
> +       scan_stats->tuples_deleted = 0;
> +       scan_stats->tuples_frozen = 0;
> +       scan_stats->lpdead_items = 0;
> +       scan_stats->live_tuples = 0;
> +       scan_stats->recently_dead_tuples = 0;
> +       scan_stats->missed_dead_tuples = 0;
>
> 5) typo paralle should be parallel
> +/*
> + * Return the number of parallel workers for a parallel vacuum scan of this
> + * relation.
> + */
> +static inline int
> +table_paralle_vacuum_compute_workers(Relation rel, int requested)
> +{
> +       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
> +}
>

Thank you for the comments! I'll address these comments in the next
version patch.

BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Swada-san,

> 
> BTW while updating the patch, I found that we might want to launch
> different numbers of workers for scanning heap and vacuuming heap. The
> number of parallel workers is determined based on the number of blocks
> in the table. However, even if this number is high, it could happen
> that we want to launch fewer workers to vacuum heap pages when there
> are not many pages having garbage. And the number of workers for
> vacuuming heap could vary on each vacuum pass. I'm considering
> implementing it.

Just to clarify - this idea looks good to me. I imagine you will add new APIs for
tableam like parallel_vacuum_compute_workers_for_scaning and parallel_vacuum_compute_workers_for_vacuuming.
If other tableam developers want to use the same number of workers as scanning,
they can pass the same function to the pointer. Is it right?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Parallel heap vacuum

From
Peter Smith
Date:
Hi Sawada-San,

FYI, the patch 0001 fails to build stand-alone

vacuumlazy.c: In function ‘parallel_heap_vacuum_gather_scan_stats’:
vacuumlazy.c:3739:21: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
   vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
                     ^
vacuumlazy.c:3739:43: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
   vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
                                           ^
make[4]: *** [vacuumlazy.o] Error 1


It appears to be using a struct field which is not even introduced
until the patch 0004 of the patch set.

======
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Parallel heap vacuum

From
Peter Smith
Date:
Hi Sawada-San,

I started to look at patch v4-0001 in this thread.

It is quite a big patch so this is a WIP, and these below are just the
comments I have so far.

======
src/backend/access/heap/vacuumlazy.c

1.1.
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats

The comment wording is not quite right.

/Relation statistics collected during heap scanning/Relation
statistics that are collected during heap scanning/

~~~

1.2
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared

The comment wording is not quite right.

/that need to be shared/that needs to be shared/

~~~

1.3.
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+ /* Parallel scan description shared among parallel workers */
+ ParallelBlockTableScanDesc pscandesc;
+
+ /* Shared information */
+ PHVShared  *shared;

If 'pscandesc' is described as 'shared among parallel workers', should
that field be within 'PHVShared' instead?

~~~

1.4.
  /* Initialize page counters explicitly (be tidy) */
- vacrel->scanned_pages = 0;
- vacrel->removed_pages = 0;
- vacrel->frozen_pages = 0;
- vacrel->lpdead_item_pages = 0;
- vacrel->missed_dead_pages = 0;
- vacrel->nonempty_pages = 0;
- /* dead_items_alloc allocates vacrel->dead_items later on */
+ scan_stats = palloc(sizeof(LVRelScanStats));
+ scan_stats->scanned_pages = 0;
+ scan_stats->removed_pages = 0;
+ scan_stats->frozen_pages = 0;
+ scan_stats->lpdead_item_pages = 0;
+ scan_stats->missed_dead_pages = 0;
+ scan_stats->nonempty_pages = 0;
+
+ /* Initialize remaining counters (be tidy) */
+ scan_stats->tuples_deleted = 0;
+ scan_stats->tuples_frozen = 0;
+ scan_stats->lpdead_items = 0;
+ scan_stats->live_tuples = 0;
+ scan_stats->recently_dead_tuples = 0;
+ scan_stats->missed_dead_tuples = 0;
+
+ vacrel->scan_stats = scan_stats;

1.4a.
Or, maybe just palloc0 this and provide a comment to say all counters
have been zapped to 0.

~

1.4b.
Maybe you don't need that 'scan_stats' variable; just assign the
palloc0 directly to the field instead.

~~~

1.5.
- vacrel->missed_dead_tuples = 0;
+ /* dead_items_alloc allocates vacrel->dead_items later on */

The patch seems to have moved this "dead_items_alloc" comment to now
be below the "Allocate/initialize output statistics state" stuff (??).

======
src/backend/commands/vacuumparallel.c

parallel_vacuum_init:

1.6.
  int parallel_workers = 0;
+ int nworkers_table;
+ int nworkers_index;

The local vars and function params are named like this (here and in
other functions). But the struct field names say 'nworkers_for_XXX'
e.g.
shared->nworkers_for_table = nworkers_table;
shared->nworkers_for_index = nworkers_index;

It may be better to use these 'nworkers_for_table' and
'nworkers_for_index' names consistently everywhere.

~~~

parallel_vacuum_compute_workers:

1.7.
- int parallel_workers;
+ int parallel_workers_table = 0;
+ int parallel_workers_index = 0;
+
+ *nworkers_table = 0;
+ *nworkers_index = 0;

The local variables 'parallel_workers_table' and
'parallel_workers_index; are hardly needed because AFAICT those
results can be directly assigned to *nworkers_table and
*nworkers_index.

~~~

parallel_vacuum_process_all_indexes:

1.8.
  /* Reinitialize parallel context to relaunch parallel workers */
- if (num_index_scans > 0)
+ if (num_index_scans > 0 || pvs->num_table_scans > 0)
  ReinitializeParallelDSM(pvs->pcxt);

I don't know if it is feasible or even makes sense to change, but
somehow it seemed strange that the 'num_index_scans' field is not
co-located with the 'num_table_scans' in the ParallelVacuumState. If
this is doable, then lots of functions also would no longer need to
pass 'num_index_scans' since they are already passing 'pvs'.

~~~

parallel_vacuum_table_scan_begin:

1.9.
+ (errmsg(ngettext("launched %d parallel vacuum worker for table
processing (planned: %d)",
+ "launched %d parallel vacuum workers for table processing (planned: %d)",
+ pvs->pcxt->nworkers_launched),

Isn't this the same as errmsg_plural?

~~~

1.10.

+/* Return the array of indexes associated to the given table to be vacuumed */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)

Even though the function comment can fit on one line it is nicer to
use a block-style comment with a period, like below. It then will be
consistent with other function comments (e.g.
parallel_vacuum_table_scan_end, parallel_vacuum_process_table, etc).
There are multiple places that this review comment can apply to.

(also typo /associated to/associated with/)

SUGGESTION
/*
 * Return the array of indexes associated with the given table to be vacuumed.
 */

~~~

parallel_vacuum_get_nworkers_table:
parallel_vacuum_get_nworkers_index:

1.11.
+/* Return the number of workers for parallel table vacuum */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_table;
+}
+
+/* Return the number of workers for parallel index processing */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_index;
+}
+

Are these functions needed? AFAICT, they are called only from macros
where it seems just as easy to reference the pvs fields directly.

~~~

parallel_vacuum_process_table:

1.12.
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+ Assert(VacuumActiveNWorkers);

Maybe here also we should Assert(pvs.shared->do_vacuum_table_scan);

~~~

1.13.
- /* Process indexes to perform vacuum/cleanup */
- parallel_vacuum_process_safe_indexes(&pvs);
+ if (pvs.shared->do_vacuum_table_scan)
+ {
+ parallel_vacuum_process_table(&pvs);
+ }
+ else
+ {
+ ErrorContextCallback errcallback;
+
+ /* Setup error traceback support for ereport() */
+ errcallback.callback = parallel_vacuum_error_callback;
+ errcallback.arg = &pvs;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ }

There are still some functions following this code (like
'shm_toc_lookup') that could potentially raise ERRORs. But, now the
error_context_stack is getting assigned/reset earlier than previously
was the case. Is that going to be a potential problem?

======
src/include/access/tableam.h

1.14.
+ /*
+ * Compute the amount of DSM space AM need in the parallel table vacuum.
+ *

Maybe reword this comment to be more like table_parallelscan_estimate.

SUGGESTION
Estimate the size of shared memory that the parallel table vacuum needs for AM.


~~~

1.15.
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
int nworkers,
+    void *state)
+{
+ rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
int nworkers,
+ void *state)
+{
+ rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+    ParallelWorkerContext *pwcxt)
+{
+ rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+

All of the "Callbacks for parallel table vacuum." had comments saying
"Not called if parallel table vacuum is disabled.". So, IIUC that
means all of these table_parallel_vacuum_XXX functions (other than the
compute_workers one) could have Assert(nworkers > 0); just to
double-check that is true.

~~~

table_paralle_vacuum_compute_workers:

1.16.
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+ return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

Typo in function name. /paralle/parallel/

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Parallel heap vacuum

From
Peter Smith
Date:
Hi Sawada-San.

FWIW, here are the remainder of my review comments for the patch v4-0001

======
src/backend/access/heap/vacuumlazy.c

lazy_scan_heap:

2.1.

+ /*
+ * Do the actual work. If parallel heap vacuum is active, we scan and
+ * vacuum heap with parallel workers.
+ */

/with/using/

~~~

2.2.
+ if (ParallelHeapVacuumIsActive(vacrel))
+ do_parallel_lazy_scan_heap(vacrel);
+ else
+ do_lazy_scan_heap(vacrel);

The do_lazy_scan_heap() returns a boolean and according to that
function comment it should always be true if it is not using the
parallel heap scan. So should we get the function return value here
and Assert that it is true?

~~~

2.3.

Start uppercase even for all the single line comments for consistency
with exasiting code.

e.g.
+ /* report that everything is now scanned */

e.g
+ /* now we can compute the new value for pg_class.reltuples */

e.g.
+ /* report all blocks vacuumed */

~~~

heap_vac_scan_next_block_parallel:

2.4.
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+   bool *all_visible_according_to_vm)

The function comment should explain the return value.

~~~

2.5.
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+ {
+
+ if (vacrel->aggressive)
+ break;

Unnecessary whitespace.

~~~

dead_items_alloc:

2.6.
+ /*
+ * We initialize parallel heap scan/vacuuming or index vacuuming
+ * or both based on the table size and the number of indexes. Note
+ * that only one worker can be used for an index, we invoke
+ * parallelism for index vacuuming only if there are at least two
+ * indexes on a table.
+ */
  vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
     vacrel->nindexes, nworkers,
     vac_work_mem,
     vacrel->verbose ? INFO : DEBUG2,
-    vacrel->bstrategy);
+    vacrel->bstrategy, (void *) vacrel);

Is this information misplaced? Why describe here "only one worker" and
"at least two indexes on a table" I don't see anything here checking
those conditions.

~~~

heap_parallel_vacuum_compute_workers:

2.7.
+ /*
+ * Select the number of workers based on the log of the size of the
+ * relation.  This probably needs to be a good deal more
+ * sophisticated, but we need something here for now.  Note that the
+ * upper limit of the min_parallel_table_scan_size GUC is chosen to
+ * prevent overflow here.
+ */

The "This probably needs to..." part maybe should have an "XXX" marker
in the comment which AFAIK is used to highlight current decisions and
potential for future changes.

~~~

heap_parallel_vacuum_initialize:

2.8.
There is inconsistent capitalization of the single-line comments in
this function. The same occurs in many functions in this file. but it
is just a bit more obvious in this one. Please see all the others too.

~~~

parallel_heap_complete_unfinised_scan:

2.9.
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)

TYPO in function name /unfinised/unfinished/

~~~

2.10.
+ if (!wstate->maybe_have_blocks)
+
+ continue;

Unnecessary blank line.

~~~

2.11.
+
+ /* Attache the worker's scan state and do heap scan */
+ vacrel->phvstate->myscanstate = wstate;
+ scan_done = do_lazy_scan_heap(vacrel);

/Attache/Attach/

~~~

2.12.
+ /*
+ * We don't need to gather the scan statistics here because statistics
+ * have already been accumulated the leaders statistics directly.
+ */

"have already been accumulated the leaders" -- missing word there somewhere?

~~~

do_parallel_lazy_scan_heap:

2.13.
+ /*
+ * If the heap scan paused in the middle of the table due to full of
+ * dead_items TIDs, perform a round of index and heap vacuuming.
+ */
+ if (!scan_done)
+ {
+ /* Perform a round of index and heap vacuuming */
+ vacrel->consider_bypass_optimization = false;
+ lazy_vacuum(vacrel);
+
+ /*
+ * Vacuum the Free Space Map to make newly-freed space visible on
+ * upper-level FSM pages.
+ */
+ if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum)
+ {
+ /*
+ * min_blkno should have already been updated when gathering
+ * statistics
+ */
+ FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+ vacrel->phvstate->min_blkno + 1);
+ vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno;
+ }
+
+ /* Report that we are once again scanning the heap */
+ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+ PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+ /* re-launcher workers */
+ vacrel->phvstate->nworkers_launched =
+ parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+ continue;
+ }
+
+ /* We reach the end of the table */
+ break;

Instead of:

if (!scan_done)
{
   <other code ...>
   continue;
}

break;

Won't it be better to refactor like:

SUGGESTION
if (scan_done)
  break;

<other code...>

~~~

2.14.
+ /*
+ * The parallel heap vacuum finished, but it's possible that some workers
+ * have allocated blocks but not processed yet. This can happen for
+ * example when workers exit because of full of dead_items TIDs and the
+ * leader process could launch fewer workers in the next cycle.
+ */

There seem to be some missing words:

e.g. /not processed yet./not processed them yet./
e.g. /because of full of dead_items/because they are full of dead_items/

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Parallel heap vacuum

From
Tomas Vondra
Date:
Hi,

Thanks for working on this. I took a quick look at this today, to do
some basic review. I plan to do a bunch of testing, but that's mostly to
get a better idea of what kind of improvements to expect - the initial
results look quite nice and sensible.

A couple basic comments:

1) I really like the idea of introducing "compute_workers" callback to
the heap AM interface. I faced a similar issue with calculating workers
for index builds, because right now plan_create_index_workers is doing
that the logic works for btree, but really not brin etc. It didn't occur
to me we might make this part of the index AM ...

2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
because it really has nothing to do with planning / paths. I realize it
needs the min_parallel_table_scan_size, but it doesn't seem right. I
guess it's a sign this bit of code (calculating parallel workers based
on log of relation size) should in some "shared" location.

3) The difference in naming ParallelVacuumState vs. PHVState is a bit
weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
make it consistent and clear.

4) I think it would be good to have some sort of README explaining how
the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
while to realize how the workers coordinate which blocks to scan.

5) Wouldn't it be better to introduce the scan_stats (grouping some of
the fields in a separate patch)? Seems entirely independent from the
parallel part, so doing it separately would make it easier to review.
Also, maybe reference the fields through scan_stats->x, instead of
through vacrel->scan_stats->x, when there's the pointer.

6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
AFAIK it's not a statistic, it's actually a parameter affecting the
decisions, right?

7) I find it a bit strange that heap_vac_scan_next_block() needs to
check if it's a parallel scan, and redirect to the parallel callback. I
mean, shouldn't the caller know which callback to invoke? Why should the
serial callback care about this?

8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
current block. Can you explain?

9) I'm a bit confused why the code checks IsParallelWorker() in so many
places. Doesn't that mean the leader can't participate in the vacuum
like a regular worker?

10) I'm not quite sure I understand the comments at the end of
do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
means "index vacuuming", right? And then it says "pause without invoking
index and heap vacuuming" but isn't the whole point of this block to do
that cleanup so that the TidStore can be discarded? Maybe I just don't
understand how the work is divided between the leader and workers ...

11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
the patch still builds fine for me.


thanks

-- 
Tomas Vondra




RE: Parallel heap vacuum

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Tomas,

> 1) I really like the idea of introducing "compute_workers" callback to
> the heap AM interface. I faced a similar issue with calculating workers
> for index builds, because right now plan_create_index_workers is doing
> that the logic works for btree, but really not brin etc. It didn't occur
> to me we might make this part of the index AM ...

+1, so let's keep the proposed style. Or, can we even propose the idea
to table/index access method API?
I've considered bit and the point seemed that which arguments should be required.

> 4) I think it would be good to have some sort of README explaining how
> the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
> while to realize how the workers coordinate which blocks to scan.

I love the idea, it is quite helpful for reviewers like me.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> Thanks for working on this. I took a quick look at this today, to do
> some basic review. I plan to do a bunch of testing, but that's mostly to
> get a better idea of what kind of improvements to expect - the initial
> results look quite nice and sensible.

Thank you for reviewing the patch!

> A couple basic comments:
>
> 1) I really like the idea of introducing "compute_workers" callback to
> the heap AM interface. I faced a similar issue with calculating workers
> for index builds, because right now plan_create_index_workers is doing
> that the logic works for btree, but really not brin etc. It didn't occur
> to me we might make this part of the index AM ...

Thanks.

>
> 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
> because it really has nothing to do with planning / paths. I realize it
> needs the min_parallel_table_scan_size, but it doesn't seem right. I
> guess it's a sign this bit of code (calculating parallel workers based
> on log of relation size) should in some "shared" location.

True. The same is actually true also for vacuumparallel.c. It includes
optimizer/paths.h to use min_parallel_index_scan_size.

>
> 3) The difference in naming ParallelVacuumState vs. PHVState is a bit
> weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
> make it consistent and clear.

With the patch, since ParallelVacuumState is no longer dedicated for
parallel index vacuuming we cannot rename them in this way. Both
parallel table scanning/vacuuming and parallel index vacuuming can use
the same ParallelVacuumState instance. The heap-specific necessary
data for parallel heap scanning and vacuuming are stored in PHVState.

>
> 4) I think it would be good to have some sort of README explaining how
> the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
> while to realize how the workers coordinate which blocks to scan.

+1. I will add README in the next version patch.

>
> 5) Wouldn't it be better to introduce the scan_stats (grouping some of
> the fields in a separate patch)? Seems entirely independent from the
> parallel part, so doing it separately would make it easier to review.
> Also, maybe reference the fields through scan_stats->x, instead of
> through vacrel->scan_stats->x, when there's the pointer.

Agreed.

> 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
> AFAIK it's not a statistic, it's actually a parameter affecting the
> decisions, right?

Right. It would be better to move them to a separate struct or somewhere.

>
> 7) I find it a bit strange that heap_vac_scan_next_block() needs to
> check if it's a parallel scan, and redirect to the parallel callback. I
> mean, shouldn't the caller know which callback to invoke? Why should the
> serial callback care about this?

do_lazy_scan_heap(), the sole caller of heap_vac_scan_next_block(), is
called in serial vacuum and parallel vacuum cases. I wanted to make
heap_vac_scan_next_block() workable in both cases. I think it also
makes sense to have do_lazy_scan_heap() calls either function
depending on parallel scan being enabled.

>
> 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
> current block. Can you explain?

The workers' current block numbers are used to calculate the minimum
block number where we've scanned so far. In serial scan case, we
vacuum FSM of the particular block range for every
VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan
case, it doesn't make sense to vacuum FSM in that way because we might
not have processed some blocks in the block range. So the idea is to
calculate the minimum block number where we've scanned so far and
vacuum FSM of the range of consecutive already-scanned blocks.

>
> 9) I'm a bit confused why the code checks IsParallelWorker() in so many
> places. Doesn't that mean the leader can't participate in the vacuum
> like a regular worker?

I used '!isParallelWorker()' for some jobs that should be done only by
the leader process. For example, checking failsafe mode, vacuuming FSM
etc.

>
> 10) I'm not quite sure I understand the comments at the end of
> do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
> means "index vacuuming", right?

It means both index vacuuming and heap vacuuming.

> And then it says "pause without invoking
> index and heap vacuuming" but isn't the whole point of this block to do
> that cleanup so that the TidStore can be discarded? Maybe I just don't
> understand how the work is divided between the leader and workers ...

The comment needs to be updated. But what the patch does is that when
the memory usage of the shared TidStore reaches the limit, worker
processes exit after updating the shared statistics, and then the
leader invokes (parallel) index vacuuming and parallel heap vacuuming.
Since the different number workers could be used for parallel heap
scan, parallel index vacuuming, and parallel heap vacuuming, the
leader process waits for all workers to finish at end of each phase.

> 11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
> the patch still builds fine for me.

Oh, I might have missed something. I'll check if it's really necessary.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Parallel heap vacuum

From
Tomas Vondra
Date:
On 12/13/24 00:04, Tomas Vondra wrote:
> ...
>
> The main difference is here:
> 
> 
> master / no parallel workers:
> 
>   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> 
> 1 parallel worker:
> 
>   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> 
> 
> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> essentially just those with deleted tuples, which is ~1/20 of pages.
> That's close to the 15x speedup.
> 
> This effect is clearest without indexes, but it does affect even runs
> with indexes - having to scan the indexes makes it much less pronounced,
> though. However, these indexes are pretty massive (about the same size
> as the table) - multiple times larger than the table. Chances are it'd
> be clearer on realistic data sets.
> 
> So the question is - is this correct? And if yes, why doesn't the
> regular (serial) vacuum do that?
> 
> There's some more strange things, though. For example, how come the avg
> read rate is 0.000 MB/s?
> 
>   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> 
> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> pages are in shared buffers (thanks to the DELETE earlier in that session).
> 

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.


IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

    SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.


regards

-- 
Tomas Vondra




Re: Parallel heap vacuum

From
Masahiko Sawada
Date:
On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 12/13/24 00:04, Tomas Vondra wrote:
> > ...
> >
> > The main difference is here:
> >
> >
> > master / no parallel workers:
> >
> >   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> >
> > 1 parallel worker:
> >
> >   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> >
> >
> > Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> > essentially just those with deleted tuples, which is ~1/20 of pages.
> > That's close to the 15x speedup.
> >
> > This effect is clearest without indexes, but it does affect even runs
> > with indexes - having to scan the indexes makes it much less pronounced,
> > though. However, these indexes are pretty massive (about the same size
> > as the table) - multiple times larger than the table. Chances are it'd
> > be clearer on realistic data sets.
> >
> > So the question is - is this correct? And if yes, why doesn't the
> > regular (serial) vacuum do that?
> >
> > There's some more strange things, though. For example, how come the avg
> > read rate is 0.000 MB/s?
> >
> >   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> >
> > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> > pages are in shared buffers (thanks to the DELETE earlier in that session).
> >
>
> OK, after looking into this a bit more I think the reason is rather
> simple - SKIP_PAGES_THRESHOLD.
>
> With serial runs, we end up scanning all pages, because even with an
> update every 5000 tuples, that's still only ~25 pages apart, well within
> the 32-page window. So we end up skipping no pages, scan and vacuum all
> everything.
>
> But parallel runs have this skipping logic disabled, or rather the logic
> that switches to sequential scans if the gap is less than 32 pages.
>
>
> IMHO this raises two questions:
>
> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> sequential scans is the pages are close enough. Maybe there is a reason
> for this difference? Workers can reduce the difference between random
> and sequential I/0, similarly to prefetching. But that just means the
> workers should use a lower threshold, e.g. as
>
>     SKIP_PAGES_THRESHOLD / nworkers
>
> or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

>
> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> storage. If I can get an order of magnitude improvement (or more than
> that) by disabling the threshold, and just doing random I/O, maybe
> there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Parallel heap vacuum

From
Tomas Vondra
Date:

On 12/19/24 23:05, Masahiko Sawada wrote:
> On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 12/13/24 00:04, Tomas Vondra wrote:
>>> ...
>>>
>>> The main difference is here:
>>>
>>>
>>> master / no parallel workers:
>>>
>>>   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
>>>
>>> 1 parallel worker:
>>>
>>>   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
>>>
>>>
>>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
>>> essentially just those with deleted tuples, which is ~1/20 of pages.
>>> That's close to the 15x speedup.
>>>
>>> This effect is clearest without indexes, but it does affect even runs
>>> with indexes - having to scan the indexes makes it much less pronounced,
>>> though. However, these indexes are pretty massive (about the same size
>>> as the table) - multiple times larger than the table. Chances are it'd
>>> be clearer on realistic data sets.
>>>
>>> So the question is - is this correct? And if yes, why doesn't the
>>> regular (serial) vacuum do that?
>>>
>>> There's some more strange things, though. For example, how come the avg
>>> read rate is 0.000 MB/s?
>>>
>>>   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
>>>
>>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
>>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
>>> pages are in shared buffers (thanks to the DELETE earlier in that session).
>>>
>>
>> OK, after looking into this a bit more I think the reason is rather
>> simple - SKIP_PAGES_THRESHOLD.
>>
>> With serial runs, we end up scanning all pages, because even with an
>> update every 5000 tuples, that's still only ~25 pages apart, well within
>> the 32-page window. So we end up skipping no pages, scan and vacuum all
>> everything.
>>
>> But parallel runs have this skipping logic disabled, or rather the logic
>> that switches to sequential scans if the gap is less than 32 pages.
>>
>>
>> IMHO this raises two questions:
>>
>> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
>> sequential scans is the pages are close enough. Maybe there is a reason
>> for this difference? Workers can reduce the difference between random
>> and sequential I/0, similarly to prefetching. But that just means the
>> workers should use a lower threshold, e.g. as
>>
>>     SKIP_PAGES_THRESHOLD / nworkers
>>
>> or something like that? I don't see this discussed in this thread.
> 
> Each parallel heap scan worker allocates a chunk of blocks which is
> 8192 blocks at maximum, so we would need to use the
> SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
> need to evaluate the differences anyway. WIll do the benchmark test
> and share the results.
> 

Right. I don't think this really matters for small tables, and for large
tables the chunks should be fairly large (possibly up to 8192 blocks),
in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
case. There might be differences at boundaries between chunks, but that
seems like a minor / expected detail. I haven't checked know if the code
would need to change / how much.

>>
>> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
>> storage. If I can get an order of magnitude improvement (or more than
>> that) by disabling the threshold, and just doing random I/O, maybe
>> there's time to adjust it a bit.
> 
> Yeah, you've started a thread for this so let's discuss it there.
> 

OK. FWIW as suggested in the other thread, it doesn't seem to be merely
a question of VACUUM performance, as not skipping pages gives vacuum the
opportunity to do cleanup that would otherwise need to happen later.

If only for this reason, I think it would be good to keep the serial and
parallel vacuum consistent.


regards

-- 
Tomas Vondra