Thread: Parallel heap vacuum
Hi all, The parallel vacuum we have today supports only for index vacuuming. Therefore, while multiple workers can work on different indexes in parallel, the heap table is always processed by the single process. I'd like to propose $subject, which enables us to have multiple workers running on the single heap table. This would be helpful to speedup vacuuming for tables without indexes or tables with INDEX_CLENAUP = off. I've attached a PoC patch for this feature. It implements only parallel heap scans in lazyvacum. We can extend this feature to support parallel heap vacuum as well in the future or in the same patch. # Overall idea (for parallel heap scan in lazy vacuum) At the beginning of vacuum, we determine how many workers to launch based on the table size like other parallel query operations. The number of workers is capped by max_parallel_maitenance_workers. Once we decided to use parallel heap scan, we prepared DSM to share data among parallel workers and leader. The information include at least the vacuum option such as aggressive, the counters collected during lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs and GlobalVisState, and parallel scan description. Before starting heap scan in lazy vacuum, we launch parallel workers and then each worker (and the leader) process different blocks. Each worker does HOT-pruning on pages and collects dead tuple TIDs. When adding dead tuple TIDs, workers need to hold an exclusive lock on TidStore. At the end of heap scan phase, workers exit and the leader will wait for all workers to exit. After that, the leader process gather the counters collected by parallel workers, and compute the oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is also enabled, we launch other parallel workers for parallel index vacuuming. When it comes to parallel heap scan in lazy vacuum, I think we can use the table_block_parallelscan_XXX() family. One tricky thing we need to deal with is that if the TideStore memory usage reaches the limit, we stop the parallel scan, do index vacuum and table vacuum, and then resume the parallel scan from the previous state. In order to do that, in the patch, we store ParallelBlockTableScanWorker, per-worker parallel scan state, into DSM so that different parallel workers can resume the scan using the same parallel scan state. In addition to that, since we could end up launching fewer workers than requested, it could happen that some ParallelBlockTableScanWorker data is used once and never be used while remaining unprocessed blocks. To handle this case, in the patch, the leader process checks at the end of the parallel scan if there is an uncompleted parallel scan. If so, the leader process does the scan using worker's ParallelBlockTableScanWorker data on behalf of workers. # Discussions I'm somewhat convinced the brief design of this feature, but there are some points regarding the implementation we need to discuss. In the patch, I extended vacuumparalle.c to support parallel table scan (and vacuum in the future). So I was required to add some table AM callbacks such as DSM size estimation, DSM initialization, and actual table scans etc. We need to verify these APIs are appropriate. Specifically, if we want to support both parallel heap scan and parallel heap vacuum, do we want to add separate callbacks for them? It could be overkill since such a 2-pass vacuum strategy is specific to heap AM. As another implementation idea, we might want to implement parallel heap scan/vacuum in lazyvacuum.c while minimizing changes for vacuumparallel.c. That way, we would not need to add table AM callbacks. However, we would end up having duplicate codes related to parallel operation in vacuum such as vacuum delays. Also, we might need to add some functions to share GlobalVisState among parallel workers, since GlobalVisState is a private struct. Other points I'm somewhat uncomfortable with or need to be discussed remain in the code with XXX comments. # Benchmark results * Test-1: parallel heap scan on the table without indexes I created 20GB table, made garbage on the table, and run vacuum while changing parallel degree: create unlogged table test (a int) with (autovacuum_enabled = off); insert into test select generate_series(1, 600000000); --- 20GB table delete from test where a % 5 = 0; vacuum (verbose, parallel 0) test; Here are the results (total time and heap scan time): PARALLEL 0: 21.99 s (single process) PARALLEL 1: 11.39 s PARALLEL 2: 8.36 s PARALLEL 3: 6.14 s PARALLEL 4: 5.08 s * Test-2: parallel heap scan on the table with one index I used a similar table to the test case 1 but created one btree index on it: create unlogged table test (a int) with (autovacuum_enabled = off); insert into test select generate_series(1, 600000000); --- 20GB table create index on test (a); delete from test where a % 5 = 0; vacuum (verbose, parallel 0) test; I've measured the total execution time as well as the time of each vacuum phase (from left heap scan time, index vacuum time, and heap vacuum time): PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) Overall, I can see the parallel heap scan in lazy vacuum has a decent scalability; In both test-1 and test-2, the execution time of heap scan got ~4x faster with 4 parallel workers. On the other hand, when it comes to the total vacuum execution time, I could not see much performance improvement in test-2 (45.11 vs. 35.84). Looking at the results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), and heap scan in case 2 was not as fast as in case 1 with 1 parallel worker (12.75 vs. 11.39). I think the reason is the shared TidStore is not very scalable since we have a single lock on it. In all cases in the test-1, we don't use the shared TidStore since all dead tuples are removed during heap pruning. So the scalability was better overall than in test-2. In parallel 0 case in test-2, we use the local TidStore, and from parallel degree of 1 in test-2, we use the shared TidStore and parallel worker concurrently update it. Also, I guess that the lookup performance of the local TidStore is better than the shared TidStore's lookup performance because of the differences between a bump context and an DSA area. I think that this difference contributed the fact that index vacuuming got slower (16.74 vs. 22.04). There are two obvious improvement ideas to improve overall vacuum execution time: (1) improve the shared TidStore scalability and (2) support parallel heap vacuum. For (1), several ideas are proposed by the ART authors[1]. I've not tried these ideas but it might be applicable to our ART implementation. But I prefer to start with (2) since it would be easier. Feedback is very welcome. Regards, [1] https://db.in.tum.de/~leis/papers/artsync.pdf -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > # Benchmark results > > * Test-1: parallel heap scan on the table without indexes > > I created 20GB table, made garbage on the table, and run vacuum while > changing parallel degree: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 600000000); --- 20GB table > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > Here are the results (total time and heap scan time): > > PARALLEL 0: 21.99 s (single process) > PARALLEL 1: 11.39 s > PARALLEL 2: 8.36 s > PARALLEL 3: 6.14 s > PARALLEL 4: 5.08 s > > * Test-2: parallel heap scan on the table with one index > > I used a similar table to the test case 1 but created one btree index on it: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 600000000); --- 20GB table > create index on test (a); > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > I've measured the total execution time as well as the time of each > vacuum phase (from left heap scan time, index vacuum time, and heap > vacuum time): > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > scalability; In both test-1 and test-2, the execution time of heap > scan got ~4x faster with 4 parallel workers. On the other hand, when > it comes to the total vacuum execution time, I could not see much > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > worker (12.75 vs. 11.39). > > I think the reason is the shared TidStore is not very scalable since > we have a single lock on it. In all cases in the test-1, we don't use > the shared TidStore since all dead tuples are removed during heap > pruning. So the scalability was better overall than in test-2. In > parallel 0 case in test-2, we use the local TidStore, and from > parallel degree of 1 in test-2, we use the shared TidStore and > parallel worker concurrently update it. Also, I guess that the lookup > performance of the local TidStore is better than the shared TidStore's > lookup performance because of the differences between a bump context > and an DSA area. I think that this difference contributed the fact > that index vacuuming got slower (16.74 vs. 22.04). > > There are two obvious improvement ideas to improve overall vacuum > execution time: (1) improve the shared TidStore scalability and (2) > support parallel heap vacuum. For (1), several ideas are proposed by > the ART authors[1]. I've not tried these ideas but it might be > applicable to our ART implementation. But I prefer to start with (2) > since it would be easier. Feedback is very welcome. > Starting with (2) sounds like a reasonable approach. We should study a few more things like (a) the performance results where there are 3-4 indexes, (b) What is the reason for performance improvement seen with only heap scans. We normally get benefits of parallelism because of using multiple CPUs but parallelizing scans (I/O) shouldn't give much benefits. Is it possible that you are seeing benefits because most of the data is either in shared_buffers or in memory? We can probably try vacuuming tables by restarting the nodes to ensure the data is not in memory. -- With Regards, Amit Kapila.
Dear Sawada-san, > The parallel vacuum we have today supports only for index vacuuming. > Therefore, while multiple workers can work on different indexes in > parallel, the heap table is always processed by the single process. > I'd like to propose $subject, which enables us to have multiple > workers running on the single heap table. This would be helpful to > speedup vacuuming for tables without indexes or tables with > INDEX_CLENAUP = off. Sounds great. IIUC, vacuuming is still one of the main weak point of postgres. > I've attached a PoC patch for this feature. It implements only > parallel heap scans in lazyvacum. We can extend this feature to > support parallel heap vacuum as well in the future or in the same > patch. Before diving into deep, I tested your PoC but found unclear point. When the vacuuming is requested with parallel > 0 with almost the same workload as yours, only the first page was scanned and cleaned up. When parallel was set to zero, I got: ``` INFO: vacuuming "postgres.public.test" INFO: finished vacuuming "postgres.public.test": index scans: 0 pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total) tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable removable cutoff: 752, which was 0 XIDs old when operation ended new relfrozenxid: 739, which is 1 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied WAL usage: 1 records, 1 full page images, 937 bytes system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s ``` This meant that all pages were surely scanned and dead tuples were removed. However, when parallel was set to one, I got another result: ``` INFO: vacuuming "postgres.public.test" INFO: launched 1 parallel vacuum worker for table scanning (planned: 1) INFO: finished vacuuming "postgres.public.test": index scans: 0 pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total) tuples: 12 removed, 0 remain, 0 are dead but not yet removable removable cutoff: 752, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s buffer usage: 96 hits, 660 misses, 6 dirtied WAL usage: 1 records, 1 full page images, 937 bytes system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s ``` It looked like that only a page was scanned and 12 tuples were removed. It looks very strange for me... Attached script emulate my test. IIUC it was almost the same as yours, but the instance was restarted before vacuuming. Can you reproduce and see the reason? Based on the requirement I can provide further information. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > The parallel vacuum we have today supports only for index vacuuming. > > Therefore, while multiple workers can work on different indexes in > > parallel, the heap table is always processed by the single process. > > I'd like to propose $subject, which enables us to have multiple > > workers running on the single heap table. This would be helpful to > > speedup vacuuming for tables without indexes or tables with > > INDEX_CLENAUP = off. > > Sounds great. IIUC, vacuuming is still one of the main weak point of postgres. > > > I've attached a PoC patch for this feature. It implements only > > parallel heap scans in lazyvacum. We can extend this feature to > > support parallel heap vacuum as well in the future or in the same > > patch. > > Before diving into deep, I tested your PoC but found unclear point. > When the vacuuming is requested with parallel > 0 with almost the same workload > as yours, only the first page was scanned and cleaned up. > > When parallel was set to zero, I got: > ``` > INFO: vacuuming "postgres.public.test" > INFO: finished vacuuming "postgres.public.test": index scans: 0 > pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total) > tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable > removable cutoff: 752, which was 0 XIDs old when operation ended > new relfrozenxid: 739, which is 1 XIDs ahead of previous value > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed > avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s > buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied > WAL usage: 1 records, 1 full page images, 937 bytes > system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s > ``` > > This meant that all pages were surely scanned and dead tuples were removed. > However, when parallel was set to one, I got another result: > > ``` > INFO: vacuuming "postgres.public.test" > INFO: launched 1 parallel vacuum worker for table scanning (planned: 1) > INFO: finished vacuuming "postgres.public.test": index scans: 0 > pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total) > tuples: 12 removed, 0 remain, 0 are dead but not yet removable > removable cutoff: 752, which was 0 XIDs old when operation ended > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed > avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s > buffer usage: 96 hits, 660 misses, 6 dirtied > WAL usage: 1 records, 1 full page images, 937 bytes > system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s > ``` > > It looked like that only a page was scanned and 12 tuples were removed. > It looks very strange for me... > > Attached script emulate my test. IIUC it was almost the same as yours, but > the instance was restarted before vacuuming. > > Can you reproduce and see the reason? Based on the requirement I can > provide further information. Thank you for the test! I could reproduce this issue and it's a bug; it skipped even non-all-visible pages. I've attached the new version patch. BTW since we compute the number of parallel workers for the heap scan based on the table size, it's possible that we launch multiple workers even if most blocks are all-visible. It seems to be better if we calculate it based on (relpages - relallvisible). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > # Benchmark results > > > > * Test-1: parallel heap scan on the table without indexes > > > > I created 20GB table, made garbage on the table, and run vacuum while > > changing parallel degree: > > > > create unlogged table test (a int) with (autovacuum_enabled = off); > > insert into test select generate_series(1, 600000000); --- 20GB table > > delete from test where a % 5 = 0; > > vacuum (verbose, parallel 0) test; > > > > Here are the results (total time and heap scan time): > > > > PARALLEL 0: 21.99 s (single process) > > PARALLEL 1: 11.39 s > > PARALLEL 2: 8.36 s > > PARALLEL 3: 6.14 s > > PARALLEL 4: 5.08 s > > > > * Test-2: parallel heap scan on the table with one index > > > > I used a similar table to the test case 1 but created one btree index on it: > > > > create unlogged table test (a int) with (autovacuum_enabled = off); > > insert into test select generate_series(1, 600000000); --- 20GB table > > create index on test (a); > > delete from test where a % 5 = 0; > > vacuum (verbose, parallel 0) test; > > > > I've measured the total execution time as well as the time of each > > vacuum phase (from left heap scan time, index vacuum time, and heap > > vacuum time): > > > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > > scalability; In both test-1 and test-2, the execution time of heap > > scan got ~4x faster with 4 parallel workers. On the other hand, when > > it comes to the total vacuum execution time, I could not see much > > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > > worker (12.75 vs. 11.39). > > > > I think the reason is the shared TidStore is not very scalable since > > we have a single lock on it. In all cases in the test-1, we don't use > > the shared TidStore since all dead tuples are removed during heap > > pruning. So the scalability was better overall than in test-2. In > > parallel 0 case in test-2, we use the local TidStore, and from > > parallel degree of 1 in test-2, we use the shared TidStore and > > parallel worker concurrently update it. Also, I guess that the lookup > > performance of the local TidStore is better than the shared TidStore's > > lookup performance because of the differences between a bump context > > and an DSA area. I think that this difference contributed the fact > > that index vacuuming got slower (16.74 vs. 22.04). > > Thank you for the comments! > > There are two obvious improvement ideas to improve overall vacuum > > execution time: (1) improve the shared TidStore scalability and (2) > > support parallel heap vacuum. For (1), several ideas are proposed by > > the ART authors[1]. I've not tried these ideas but it might be > > applicable to our ART implementation. But I prefer to start with (2) > > since it would be easier. Feedback is very welcome. > > > > Starting with (2) sounds like a reasonable approach. We should study a > few more things like (a) the performance results where there are 3-4 > indexes, Here are the results with 4 indexes (and restarting the server before the benchmark): PARALLEL 0: 115.48 s (32.76, 64.46, 18.24) PARALLEL 1: 74.88 s (17.11, 44.43, 13.25) PARALLEL 2: 71.15 s (14.13, 44.82, 12.12) PARALLEL 3: 46.78 s (10.74, 24.50, 11.43) PARALLEL 4: 46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap scan and 3 workers for index vacuum) > (b) What is the reason for performance improvement seen with > only heap scans. We normally get benefits of parallelism because of > using multiple CPUs but parallelizing scans (I/O) shouldn't give much > benefits. Is it possible that you are seeing benefits because most of > the data is either in shared_buffers or in memory? We can probably try > vacuuming tables by restarting the nodes to ensure the data is not in > memory. I think it depends on the storage performance. FYI I use an EC2 instance (m6id.metal). I've run the same benchmark script (table with no index) with restarting the server before executing the vacuum, and here are the results: PARALLEL 0: 32.75 s PARALLEL 1: 17.46 s PARALLEL 2: 13.41 s PARALLEL 3: 10.31 s PARALLEL 4: 8.48 s With the above two tests, I used the updated patch that I just submitted[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, > Thank you for the test! > > I could reproduce this issue and it's a bug; it skipped even > non-all-visible pages. I've attached the new version patch. > > BTW since we compute the number of parallel workers for the heap scan > based on the table size, it's possible that we launch multiple workers > even if most blocks are all-visible. It seems to be better if we > calculate it based on (relpages - relallvisible). Thanks for updating the patch. I applied and confirmed all pages are scanned. I used almost the same script (just changed max_parallel_maintenance_workers) and got below result. I think the tendency was the same as yours. ``` parallel 0: 61114.369 ms parallel 1: 34870.316 ms parallel 2: 23598.115 ms parallel 3: 17732.363 ms parallel 4: 15203.271 ms parallel 5: 13836.025 ms ``` I started to read your codes but takes much time because I've never seen before... Below part contains initial comments. 1. This patch cannot be built when debug mode is enabled. See [1]. IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared. So you should update like " vacrel->counters->NewRelminMxid". 2. I compared parallel heap scan and found that it does not have compute_worker API. Can you clarify the reason why there is an inconsistency? (I feel it is intentional because the calculation logic seems to depend on the heap structure, so should we add the API for table scan as well?) [1]: ``` vacuumlazy.c: In function ‘lazy_scan_prune’: vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’ Assert(MultiXactIdIsValid(vacrel->NewRelminMxid)); ^~ .... ``` Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > Thank you for the test! > > > > I could reproduce this issue and it's a bug; it skipped even > > non-all-visible pages. I've attached the new version patch. > > > > BTW since we compute the number of parallel workers for the heap scan > > based on the table size, it's possible that we launch multiple workers > > even if most blocks are all-visible. It seems to be better if we > > calculate it based on (relpages - relallvisible). > > Thanks for updating the patch. I applied and confirmed all pages are scanned. > I used almost the same script (just changed max_parallel_maintenance_workers) > and got below result. I think the tendency was the same as yours. > > ``` > parallel 0: 61114.369 ms > parallel 1: 34870.316 ms > parallel 2: 23598.115 ms > parallel 3: 17732.363 ms > parallel 4: 15203.271 ms > parallel 5: 13836.025 ms > ``` Thank you for testing! > > I started to read your codes but takes much time because I've never seen before... > Below part contains initial comments. > > 1. > This patch cannot be built when debug mode is enabled. See [1]. > IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared. > So you should update like " vacrel->counters->NewRelminMxid". Right, will fix. > 2. > I compared parallel heap scan and found that it does not have compute_worker API. > Can you clarify the reason why there is an inconsistency? > (I feel it is intentional because the calculation logic seems to depend on the heap structure, > so should we add the API for table scan as well?) There is room to consider a better API design, but yes, the reason is that the calculation logic depends on table AM implementation. For example, I thought it might make sense to consider taking the number of all-visible pages into account for the calculation of the number of parallel workers as we don't want to launch many workers on the table where most pages are all-visible. Which might not work for other table AMs. I'm updating the patch to implement parallel heap vacuum and will share the updated patch. It might take time as it requires to implement shared iteration support in radx tree. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, > Thank you for testing! I tried to profile the vacuuming with the larger case (40 workers for the 20G table) and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks. > > 2. > > I compared parallel heap scan and found that it does not have compute_worker > API. > > Can you clarify the reason why there is an inconsistency? > > (I feel it is intentional because the calculation logic seems to depend on the > heap structure, > > so should we add the API for table scan as well?) > > There is room to consider a better API design, but yes, the reason is > that the calculation logic depends on table AM implementation. For > example, I thought it might make sense to consider taking the number > of all-visible pages into account for the calculation of the number of > parallel workers as we don't want to launch many workers on the table > where most pages are all-visible. Which might not work for other table > AMs. > Okay, thanks for confirming. I wanted to ask others as well. > I'm updating the patch to implement parallel heap vacuum and will > share the updated patch. It might take time as it requires to > implement shared iteration support in radx tree. Here are other preliminary comments for v2 patch. This does not contain cosmetic ones. 1. Shared data structure PHVShared does not contain the mutex lock. Is it intentional because they are accessed by leader only after parallel workers exit? 2. Per my understanding, the vacuuming goes like below steps. a. paralell workers are launched for scanning pages b. leader waits until scans are done c. leader does vacuum alone (you may extend here...) d. parallel workers are launched again to cleanup indeces If so, can we reuse parallel workers for the cleanup? Or, this is painful engineering than the benefit? 3. According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as "parallel worker ..." Can we extend this to improve the trackability in the pg_stat_activity? 4. I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a bottleneck when tid is added to the shared TidStore. My another primitive idea is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?) and gather after the heap scanning. If you extend like parallel workers do vacuuming, the gathering may not be needed: they can access own TidStore and clean up. One downside is that the memory consumption may be quite large. How do you think? Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Sorry for the very late reply. On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > Thank you for testing! > > I tried to profile the vacuuming with the larger case (40 workers for the 20G table) > and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks. > > > > 2. > > > I compared parallel heap scan and found that it does not have compute_worker > > API. > > > Can you clarify the reason why there is an inconsistency? > > > (I feel it is intentional because the calculation logic seems to depend on the > > heap structure, > > > so should we add the API for table scan as well?) > > > > There is room to consider a better API design, but yes, the reason is > > that the calculation logic depends on table AM implementation. For > > example, I thought it might make sense to consider taking the number > > of all-visible pages into account for the calculation of the number of > > parallel workers as we don't want to launch many workers on the table > > where most pages are all-visible. Which might not work for other table > > AMs. > > > > Okay, thanks for confirming. I wanted to ask others as well. > > > > I'm updating the patch to implement parallel heap vacuum and will > > share the updated patch. It might take time as it requires to > > implement shared iteration support in radx tree. > > Here are other preliminary comments for v2 patch. This does not contain > cosmetic ones. > > 1. > Shared data structure PHVShared does not contain the mutex lock. Is it intentional > because they are accessed by leader only after parallel workers exit? > Yes, the fields in PHVShared are read-only for workers. Since no concurrent reads/writes happen on these fields we don't need to protect them. > 2. > Per my understanding, the vacuuming goes like below steps. > > a. paralell workers are launched for scanning pages > b. leader waits until scans are done > c. leader does vacuum alone (you may extend here...) > d. parallel workers are launched again to cleanup indeces > > If so, can we reuse parallel workers for the cleanup? Or, this is painful > engineering than the benefit? I've not thought of this idea but I think it's possible from a technical perspective. It saves overheads of relaunching workers but I'm not sure how much it would help for a better performance and I'm concerned it would make the code complex. For example, different numbers of workers might be required for table vacuuming and index vacuuming. So we would end up increasing or decreasing workers. > 3. > According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as > "parallel worker ..." Can we extend this to improve the trackability in the > pg_stat_activity? It would be a good improvement for better trackability. But I think we should do that in a separate patch as it's not just a problem for parallel heap vacuum. > > 4. > I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a > bottleneck when tid is added to the shared TidStore. My another primitive idea > is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?) > and gather after the heap scanning. If you extend like parallel workers do vacuuming, > the gathering may not be needed: they can access own TidStore and clean up. > One downside is that the memory consumption may be quite large. > Interesting idea. Suppose we support parallel heap vacuum as well, we wouldn't need locks and to support shared-iteration on TidStore. I think each worker should use a fraction of maintenance_work_mem. However, one downside would be that we need to check as many TidStore as workers during index vacuuming. FYI I've implemented the parallel heap vacuum part and am doing some benchmark tests. I'll share the updated patches along with test results this week. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawda-san, > > I've attached new version patches that fixes failures reported by > cfbot. I hope these changes make cfbot happy. Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot for Linux/Windows said ok. I'm still learning the feature so I can post only one comment :-(. I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? Another approach is to restrict TidStoreBeginIterate() to support only the local one. How do you think? Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawda-san, > > > > > I've attached new version patches that fixes failures reported by > > cfbot. I hope these changes make cfbot happy. > > Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot > for Linux/Windows said ok. > I'm still learning the feature so I can post only one comment :-(. > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? TidStoreBeginIterateShared() is designed for multiple parallel workers to iterate a shared TidStore. During an iteration, parallel workers share the iteration state and iterate the underlying radix tree while taking appropriate locks. Therefore, it's available only for a shared TidStore. This is required to implement the parallel heap vacuum, where multiple parallel workers do the iteration on the shared TidStore. On the other hand, TidStoreBeginIterate() is designed for a single process to iterate a TidStore. It accepts even a shared TidStore as you mentioned, but during an iteration there is no inter-process coordination such as locking. When it comes to parallel vacuum, supporting TidStoreBeginIterate() on a shared TidStore is necessary to cover the case where we use only parallel index vacuum but not parallel heap scan/vacuum. In this case, we need to store dead tuple TIDs on the shared TidStore during heap scan so parallel workers can use it during index vacuum. But it's not necessary to use TidStoreBeginIterateShared() because only one (leader) process does heap vacuum. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached new version patches that fixes failures reported by > cfbot. I hope these changes make cfbot happy. > I just started reviewing the patch and found the following comments while going through the patch: 1) I felt we should add some documentation for this at [1]. 2) Can we add some tests in vacuum_parallel with tables having no indexes and having dead tuples. 3) This should be included in typedefs.list: 3.a) +/* + * Relation statistics collected during heap scanning and need to be shared among + * parallel vacuum workers. + */ +typedef struct LVRelScanStats +{ + BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */ + BlockNumber removed_pages; /* # pages removed by relation truncation */ + BlockNumber frozen_pages; /* # pages with newly frozen tuples */ 3.b) Similarly this too: +/* + * Struct for information that need to be shared among parallel vacuum workers + */ +typedef struct PHVShared +{ + bool aggressive; + bool skipwithvm; + 3.c) Similarly this too: +/* Per-worker scan state for parallel heap vacuum scan */ +typedef struct PHVScanWorkerState +{ + bool initialized; 3.d) Similarly this too: +/* Struct for parallel heap vacuum */ +typedef struct PHVState +{ + /* Parallel scan description shared among parallel workers */ 4) Since we are initializing almost all the members of structure, should we use palloc0 in this case: + scan_stats = palloc(sizeof(LVRelScanStats)); + scan_stats->scanned_pages = 0; + scan_stats->removed_pages = 0; + scan_stats->frozen_pages = 0; + scan_stats->lpdead_item_pages = 0; + scan_stats->missed_dead_pages = 0; + scan_stats->nonempty_pages = 0; + + /* Initialize remaining counters (be tidy) */ + scan_stats->tuples_deleted = 0; + scan_stats->tuples_frozen = 0; + scan_stats->lpdead_items = 0; + scan_stats->live_tuples = 0; + scan_stats->recently_dead_tuples = 0; + scan_stats->missed_dead_tuples = 0; 5) typo paralle should be parallel +/* + * Return the number of parallel workers for a parallel vacuum scan of this + * relation. + */ +static inline int +table_paralle_vacuum_compute_workers(Relation rel, int requested) +{ + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); +} [1] - https://www.postgresql.org/docs/devel/sql-vacuum.html Regards, Vignesh
Dear Sawada-san, > TidStoreBeginIterateShared() is designed for multiple parallel workers > to iterate a shared TidStore. During an iteration, parallel workers > share the iteration state and iterate the underlying radix tree while > taking appropriate locks. Therefore, it's available only for a shared > TidStore. This is required to implement the parallel heap vacuum, > where multiple parallel workers do the iteration on the shared > TidStore. > > On the other hand, TidStoreBeginIterate() is designed for a single > process to iterate a TidStore. It accepts even a shared TidStore as > you mentioned, but during an iteration there is no inter-process > coordination such as locking. When it comes to parallel vacuum, > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > cover the case where we use only parallel index vacuum but not > parallel heap scan/vacuum. In this case, we need to store dead tuple > TIDs on the shared TidStore during heap scan so parallel workers can > use it during index vacuum. But it's not necessary to use > TidStoreBeginIterateShared() because only one (leader) process does > heap vacuum. Okay, thanks for the description. I felt it is OK to keep. I read 0001 again and here are comments. 01. vacuumlazy.c ``` +#define LV_PARALLEL_SCAN_SHARED 0xFFFF0001 +#define LV_PARALLEL_SCAN_DESC 0xFFFF0002 +#define LV_PARALLEL_SCAN_DESC_WORKER 0xFFFF0003 ``` I checked other DMS keys used for parallel work, and they seems to have name like PARALEL_KEY_XXX. Can we follow it? 02. LVRelState ``` + BlockNumber next_fsm_block_to_vacuum; ``` Only the attribute does not have comments Can we add like: "Next freespace map page to be checked"? 03. parallel_heap_vacuum_gather_scan_stats ``` + vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; ``` Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined in 0004. Can you move it? 04. heap_parallel_vacuum_estimate ``` + + heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len, + &shared_len, &pscanwork_len); + + /* space for PHVShared */ + shm_toc_estimate_chunk(&pcxt->estimator, shared_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + /* space for ParallelBlockTableScanDesc */ + pscan_len = table_block_parallelscan_estimate(rel); + shm_toc_estimate_chunk(&pcxt->estimator, pscan_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + /* space for per-worker scan state, PHVScanWorkerState */ + pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers); + shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); ``` I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size(). Can we remove table_block_parallelscan_estimate() and mul_size() from here? 05. Idea Can you update documentations? 06. Idea AFAICS pg_stat_progress_vacuum does not contain information related with the parallel execution. How do you think adding an attribute which shows a list of pids? Not sure it is helpful for users but it can show the parallelism. Best regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > TidStoreBeginIterateShared() is designed for multiple parallel workers > > to iterate a shared TidStore. During an iteration, parallel workers > > share the iteration state and iterate the underlying radix tree while > > taking appropriate locks. Therefore, it's available only for a shared > > TidStore. This is required to implement the parallel heap vacuum, > > where multiple parallel workers do the iteration on the shared > > TidStore. > > > > On the other hand, TidStoreBeginIterate() is designed for a single > > process to iterate a TidStore. It accepts even a shared TidStore as > > you mentioned, but during an iteration there is no inter-process > > coordination such as locking. When it comes to parallel vacuum, > > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > > cover the case where we use only parallel index vacuum but not > > parallel heap scan/vacuum. In this case, we need to store dead tuple > > TIDs on the shared TidStore during heap scan so parallel workers can > > use it during index vacuum. But it's not necessary to use > > TidStoreBeginIterateShared() because only one (leader) process does > > heap vacuum. > > Okay, thanks for the description. I felt it is OK to keep. > > I read 0001 again and here are comments. Thank you for the review comments! > > 01. vacuumlazy.c > ``` > +#define LV_PARALLEL_SCAN_SHARED 0xFFFF0001 > +#define LV_PARALLEL_SCAN_DESC 0xFFFF0002 > +#define LV_PARALLEL_SCAN_DESC_WORKER 0xFFFF0003 > ``` > > I checked other DMS keys used for parallel work, and they seems to have name > like PARALEL_KEY_XXX. Can we follow it? Yes. How about LV_PARALLEL_KEY_XXX? > > 02. LVRelState > ``` > + BlockNumber next_fsm_block_to_vacuum; > ``` > > Only the attribute does not have comments Can we add like: > "Next freespace map page to be checked"? Agreed. I'll add a comment "next block to check for FSM vacuum*. > > 03. parallel_heap_vacuum_gather_scan_stats > ``` > + vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; > ``` > > Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined > in 0004. Can you move it? Will remove. > > 04. heap_parallel_vacuum_estimate > ``` > + > + heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len, > + &shared_len, &pscanwork_len); > + > + /* space for PHVShared */ > + shm_toc_estimate_chunk(&pcxt->estimator, shared_len); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > + > + /* space for ParallelBlockTableScanDesc */ > + pscan_len = table_block_parallelscan_estimate(rel); > + shm_toc_estimate_chunk(&pcxt->estimator, pscan_len); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > + > + /* space for per-worker scan state, PHVScanWorkerState */ > + pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers); > + shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > ``` > > I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size(). > Can we remove table_block_parallelscan_estimate() and mul_size() from here? Yes, it's an oversight. Will remove. > > 05. Idea > > Can you update documentations? Will update the doc as well. > > 06. Idea > > AFAICS pg_stat_progress_vacuum does not contain information related with the > parallel execution. How do you think adding an attribute which shows a list of pids? > Not sure it is helpful for users but it can show the parallelism. I think it's possible to show the parallelism even today (for parallel index vacuuming): =# select leader.pid, leader.query, array_agg(worker.pid) from pg_stat_activity as leader, pg_stat_activity as worker, pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and leader.pid = v.pid group by 1, 2; pid | query | array_agg ---------+---------------------+------------------- 2952103 | vacuum (verbose) t; | {2952257,2952258} (1 row) Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've attached new version patches that fixes failures reported by > > cfbot. I hope these changes make cfbot happy. > > > > I just started reviewing the patch and found the following comments > while going through the patch: > 1) I felt we should add some documentation for this at [1]. > > 2) Can we add some tests in vacuum_parallel with tables having no > indexes and having dead tuples. > > 3) This should be included in typedefs.list: > 3.a) > +/* > + * Relation statistics collected during heap scanning and need to be > shared among > + * parallel vacuum workers. > + */ > +typedef struct LVRelScanStats > +{ > + BlockNumber scanned_pages; /* # pages examined (not > skipped via VM) */ > + BlockNumber removed_pages; /* # pages removed by relation > truncation */ > + BlockNumber frozen_pages; /* # pages with newly frozen tuples */ > > 3.b) Similarly this too: > +/* > + * Struct for information that need to be shared among parallel vacuum workers > + */ > +typedef struct PHVShared > +{ > + bool aggressive; > + bool skipwithvm; > + > > 3.c) Similarly this too: > +/* Per-worker scan state for parallel heap vacuum scan */ > +typedef struct PHVScanWorkerState > +{ > + bool initialized; > > 3.d) Similarly this too: > +/* Struct for parallel heap vacuum */ > +typedef struct PHVState > +{ > + /* Parallel scan description shared among parallel workers */ > > 4) Since we are initializing almost all the members of structure, > should we use palloc0 in this case: > + scan_stats = palloc(sizeof(LVRelScanStats)); > + scan_stats->scanned_pages = 0; > + scan_stats->removed_pages = 0; > + scan_stats->frozen_pages = 0; > + scan_stats->lpdead_item_pages = 0; > + scan_stats->missed_dead_pages = 0; > + scan_stats->nonempty_pages = 0; > + > + /* Initialize remaining counters (be tidy) */ > + scan_stats->tuples_deleted = 0; > + scan_stats->tuples_frozen = 0; > + scan_stats->lpdead_items = 0; > + scan_stats->live_tuples = 0; > + scan_stats->recently_dead_tuples = 0; > + scan_stats->missed_dead_tuples = 0; > > 5) typo paralle should be parallel > +/* > + * Return the number of parallel workers for a parallel vacuum scan of this > + * relation. > + */ > +static inline int > +table_paralle_vacuum_compute_workers(Relation rel, int requested) > +{ > + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); > +} > Thank you for the comments! I'll address these comments in the next version patch. BTW while updating the patch, I found that we might want to launch different numbers of workers for scanning heap and vacuuming heap. The number of parallel workers is determined based on the number of blocks in the table. However, even if this number is high, it could happen that we want to launch fewer workers to vacuum heap pages when there are not many pages having garbage. And the number of workers for vacuuming heap could vary on each vacuum pass. I'm considering implementing it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Swada-san, > > BTW while updating the patch, I found that we might want to launch > different numbers of workers for scanning heap and vacuuming heap. The > number of parallel workers is determined based on the number of blocks > in the table. However, even if this number is high, it could happen > that we want to launch fewer workers to vacuum heap pages when there > are not many pages having garbage. And the number of workers for > vacuuming heap could vary on each vacuum pass. I'm considering > implementing it. Just to clarify - this idea looks good to me. I imagine you will add new APIs for tableam like parallel_vacuum_compute_workers_for_scaning and parallel_vacuum_compute_workers_for_vacuuming. If other tableam developers want to use the same number of workers as scanning, they can pass the same function to the pointer. Is it right? Best regards, Hayato Kuroda FUJITSU LIMITED
Hi Sawada-San, FYI, the patch 0001 fails to build stand-alone vacuumlazy.c: In function ‘parallel_heap_vacuum_gather_scan_stats’: vacuumlazy.c:3739:21: error: ‘LVRelScanStats’ has no member named ‘vacuumed_pages’ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; ^ vacuumlazy.c:3739:43: error: ‘LVRelScanStats’ has no member named ‘vacuumed_pages’ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; ^ make[4]: *** [vacuumlazy.o] Error 1 It appears to be using a struct field which is not even introduced until the patch 0004 of the patch set. ====== Kind Regards, Peter Smith. Fujitsu Australia.
Hi Sawada-San, I started to look at patch v4-0001 in this thread. It is quite a big patch so this is a WIP, and these below are just the comments I have so far. ====== src/backend/access/heap/vacuumlazy.c 1.1. +/* + * Relation statistics collected during heap scanning and need to be shared among + * parallel vacuum workers. + */ +typedef struct LVRelScanStats The comment wording is not quite right. /Relation statistics collected during heap scanning/Relation statistics that are collected during heap scanning/ ~~~ 1.2 +/* + * Struct for information that need to be shared among parallel vacuum workers + */ +typedef struct PHVShared The comment wording is not quite right. /that need to be shared/that needs to be shared/ ~~~ 1.3. +/* Struct for parallel heap vacuum */ +typedef struct PHVState +{ + /* Parallel scan description shared among parallel workers */ + ParallelBlockTableScanDesc pscandesc; + + /* Shared information */ + PHVShared *shared; If 'pscandesc' is described as 'shared among parallel workers', should that field be within 'PHVShared' instead? ~~~ 1.4. /* Initialize page counters explicitly (be tidy) */ - vacrel->scanned_pages = 0; - vacrel->removed_pages = 0; - vacrel->frozen_pages = 0; - vacrel->lpdead_item_pages = 0; - vacrel->missed_dead_pages = 0; - vacrel->nonempty_pages = 0; - /* dead_items_alloc allocates vacrel->dead_items later on */ + scan_stats = palloc(sizeof(LVRelScanStats)); + scan_stats->scanned_pages = 0; + scan_stats->removed_pages = 0; + scan_stats->frozen_pages = 0; + scan_stats->lpdead_item_pages = 0; + scan_stats->missed_dead_pages = 0; + scan_stats->nonempty_pages = 0; + + /* Initialize remaining counters (be tidy) */ + scan_stats->tuples_deleted = 0; + scan_stats->tuples_frozen = 0; + scan_stats->lpdead_items = 0; + scan_stats->live_tuples = 0; + scan_stats->recently_dead_tuples = 0; + scan_stats->missed_dead_tuples = 0; + + vacrel->scan_stats = scan_stats; 1.4a. Or, maybe just palloc0 this and provide a comment to say all counters have been zapped to 0. ~ 1.4b. Maybe you don't need that 'scan_stats' variable; just assign the palloc0 directly to the field instead. ~~~ 1.5. - vacrel->missed_dead_tuples = 0; + /* dead_items_alloc allocates vacrel->dead_items later on */ The patch seems to have moved this "dead_items_alloc" comment to now be below the "Allocate/initialize output statistics state" stuff (??). ====== src/backend/commands/vacuumparallel.c parallel_vacuum_init: 1.6. int parallel_workers = 0; + int nworkers_table; + int nworkers_index; The local vars and function params are named like this (here and in other functions). But the struct field names say 'nworkers_for_XXX' e.g. shared->nworkers_for_table = nworkers_table; shared->nworkers_for_index = nworkers_index; It may be better to use these 'nworkers_for_table' and 'nworkers_for_index' names consistently everywhere. ~~~ parallel_vacuum_compute_workers: 1.7. - int parallel_workers; + int parallel_workers_table = 0; + int parallel_workers_index = 0; + + *nworkers_table = 0; + *nworkers_index = 0; The local variables 'parallel_workers_table' and 'parallel_workers_index; are hardly needed because AFAICT those results can be directly assigned to *nworkers_table and *nworkers_index. ~~~ parallel_vacuum_process_all_indexes: 1.8. /* Reinitialize parallel context to relaunch parallel workers */ - if (num_index_scans > 0) + if (num_index_scans > 0 || pvs->num_table_scans > 0) ReinitializeParallelDSM(pvs->pcxt); I don't know if it is feasible or even makes sense to change, but somehow it seemed strange that the 'num_index_scans' field is not co-located with the 'num_table_scans' in the ParallelVacuumState. If this is doable, then lots of functions also would no longer need to pass 'num_index_scans' since they are already passing 'pvs'. ~~~ parallel_vacuum_table_scan_begin: 1.9. + (errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)", + "launched %d parallel vacuum workers for table processing (planned: %d)", + pvs->pcxt->nworkers_launched), Isn't this the same as errmsg_plural? ~~~ 1.10. +/* Return the array of indexes associated to the given table to be vacuumed */ +Relation * +parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes) Even though the function comment can fit on one line it is nicer to use a block-style comment with a period, like below. It then will be consistent with other function comments (e.g. parallel_vacuum_table_scan_end, parallel_vacuum_process_table, etc). There are multiple places that this review comment can apply to. (also typo /associated to/associated with/) SUGGESTION /* * Return the array of indexes associated with the given table to be vacuumed. */ ~~~ parallel_vacuum_get_nworkers_table: parallel_vacuum_get_nworkers_index: 1.11. +/* Return the number of workers for parallel table vacuum */ +int +parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs) +{ + return pvs->shared->nworkers_for_table; +} + +/* Return the number of workers for parallel index processing */ +int +parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs) +{ + return pvs->shared->nworkers_for_index; +} + Are these functions needed? AFAICT, they are called only from macros where it seems just as easy to reference the pvs fields directly. ~~~ parallel_vacuum_process_table: 1.12. +/* + * A parallel worker invokes table-AM specified vacuum scan callback. + */ +static void +parallel_vacuum_process_table(ParallelVacuumState *pvs) +{ + Assert(VacuumActiveNWorkers); Maybe here also we should Assert(pvs.shared->do_vacuum_table_scan); ~~~ 1.13. - /* Process indexes to perform vacuum/cleanup */ - parallel_vacuum_process_safe_indexes(&pvs); + if (pvs.shared->do_vacuum_table_scan) + { + parallel_vacuum_process_table(&pvs); + } + else + { + ErrorContextCallback errcallback; + + /* Setup error traceback support for ereport() */ + errcallback.callback = parallel_vacuum_error_callback; + errcallback.arg = &pvs; + errcallback.previous = error_context_stack; + error_context_stack = &errcallback; + + /* Process indexes to perform vacuum/cleanup */ + parallel_vacuum_process_safe_indexes(&pvs); + + /* Pop the error context stack */ + error_context_stack = errcallback.previous; + } There are still some functions following this code (like 'shm_toc_lookup') that could potentially raise ERRORs. But, now the error_context_stack is getting assigned/reset earlier than previously was the case. Is that going to be a potential problem? ====== src/include/access/tableam.h 1.14. + /* + * Compute the amount of DSM space AM need in the parallel table vacuum. + * Maybe reword this comment to be more like table_parallelscan_estimate. SUGGESTION Estimate the size of shared memory that the parallel table vacuum needs for AM. ~~~ 1.15. +/* + * Estimate the size of shared memory needed for a parallel vacuum scan of this + * of this relation. + */ +static inline void +table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers, + void *state) +{ + rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state); +} + +/* + * Initialize shared memory area for a parallel vacuum scan of this relation. + */ +static inline void +table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers, + void *state) +{ + rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state); +} + +/* + * Start a parallel vacuum scan of this relation. + */ +static inline void +table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs, + ParallelWorkerContext *pwcxt) +{ + rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt); +} + All of the "Callbacks for parallel table vacuum." had comments saying "Not called if parallel table vacuum is disabled.". So, IIUC that means all of these table_parallel_vacuum_XXX functions (other than the compute_workers one) could have Assert(nworkers > 0); just to double-check that is true. ~~~ table_paralle_vacuum_compute_workers: 1.16. +static inline int +table_paralle_vacuum_compute_workers(Relation rel, int requested) +{ + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); +} Typo in function name. /paralle/parallel/ ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. FWIW, here are the remainder of my review comments for the patch v4-0001 ====== src/backend/access/heap/vacuumlazy.c lazy_scan_heap: 2.1. + /* + * Do the actual work. If parallel heap vacuum is active, we scan and + * vacuum heap with parallel workers. + */ /with/using/ ~~~ 2.2. + if (ParallelHeapVacuumIsActive(vacrel)) + do_parallel_lazy_scan_heap(vacrel); + else + do_lazy_scan_heap(vacrel); The do_lazy_scan_heap() returns a boolean and according to that function comment it should always be true if it is not using the parallel heap scan. So should we get the function return value here and Assert that it is true? ~~~ 2.3. Start uppercase even for all the single line comments for consistency with exasiting code. e.g. + /* report that everything is now scanned */ e.g + /* now we can compute the new value for pg_class.reltuples */ e.g. + /* report all blocks vacuumed */ ~~~ heap_vac_scan_next_block_parallel: 2.4. +/* + * A parallel scan variant of heap_vac_scan_next_block. + * + * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization. + */ +static bool +heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno, + bool *all_visible_according_to_vm) The function comment should explain the return value. ~~~ 2.5. + if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0) + { + + if (vacrel->aggressive) + break; Unnecessary whitespace. ~~~ dead_items_alloc: 2.6. + /* + * We initialize parallel heap scan/vacuuming or index vacuuming + * or both based on the table size and the number of indexes. Note + * that only one worker can be used for an index, we invoke + * parallelism for index vacuuming only if there are at least two + * indexes on a table. + */ vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels, vacrel->nindexes, nworkers, vac_work_mem, vacrel->verbose ? INFO : DEBUG2, - vacrel->bstrategy); + vacrel->bstrategy, (void *) vacrel); Is this information misplaced? Why describe here "only one worker" and "at least two indexes on a table" I don't see anything here checking those conditions. ~~~ heap_parallel_vacuum_compute_workers: 2.7. + /* + * Select the number of workers based on the log of the size of the + * relation. This probably needs to be a good deal more + * sophisticated, but we need something here for now. Note that the + * upper limit of the min_parallel_table_scan_size GUC is chosen to + * prevent overflow here. + */ The "This probably needs to..." part maybe should have an "XXX" marker in the comment which AFAIK is used to highlight current decisions and potential for future changes. ~~~ heap_parallel_vacuum_initialize: 2.8. There is inconsistent capitalization of the single-line comments in this function. The same occurs in many functions in this file. but it is just a bit more obvious in this one. Please see all the others too. ~~~ parallel_heap_complete_unfinised_scan: 2.9. +static void +parallel_heap_complete_unfinised_scan(LVRelState *vacrel) TYPO in function name /unfinised/unfinished/ ~~~ 2.10. + if (!wstate->maybe_have_blocks) + + continue; Unnecessary blank line. ~~~ 2.11. + + /* Attache the worker's scan state and do heap scan */ + vacrel->phvstate->myscanstate = wstate; + scan_done = do_lazy_scan_heap(vacrel); /Attache/Attach/ ~~~ 2.12. + /* + * We don't need to gather the scan statistics here because statistics + * have already been accumulated the leaders statistics directly. + */ "have already been accumulated the leaders" -- missing word there somewhere? ~~~ do_parallel_lazy_scan_heap: 2.13. + /* + * If the heap scan paused in the middle of the table due to full of + * dead_items TIDs, perform a round of index and heap vacuuming. + */ + if (!scan_done) + { + /* Perform a round of index and heap vacuuming */ + vacrel->consider_bypass_optimization = false; + lazy_vacuum(vacrel); + + /* + * Vacuum the Free Space Map to make newly-freed space visible on + * upper-level FSM pages. + */ + if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum) + { + /* + * min_blkno should have already been updated when gathering + * statistics + */ + FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, + vacrel->phvstate->min_blkno + 1); + vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno; + } + + /* Report that we are once again scanning the heap */ + pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, + PROGRESS_VACUUM_PHASE_SCAN_HEAP); + + /* re-launcher workers */ + vacrel->phvstate->nworkers_launched = + parallel_vacuum_table_scan_begin(vacrel->pvs); + + continue; + } + + /* We reach the end of the table */ + break; Instead of: if (!scan_done) { <other code ...> continue; } break; Won't it be better to refactor like: SUGGESTION if (scan_done) break; <other code...> ~~~ 2.14. + /* + * The parallel heap vacuum finished, but it's possible that some workers + * have allocated blocks but not processed yet. This can happen for + * example when workers exit because of full of dead_items TIDs and the + * leader process could launch fewer workers in the next cycle. + */ There seem to be some missing words: e.g. /not processed yet./not processed them yet./ e.g. /because of full of dead_items/because they are full of dead_items/ ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, Thanks for working on this. I took a quick look at this today, to do some basic review. I plan to do a bunch of testing, but that's mostly to get a better idea of what kind of improvements to expect - the initial results look quite nice and sensible. A couple basic comments: 1) I really like the idea of introducing "compute_workers" callback to the heap AM interface. I faced a similar issue with calculating workers for index builds, because right now plan_create_index_workers is doing that the logic works for btree, but really not brin etc. It didn't occur to me we might make this part of the index AM ... 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h because it really has nothing to do with planning / paths. I realize it needs the min_parallel_table_scan_size, but it doesn't seem right. I guess it's a sign this bit of code (calculating parallel workers based on log of relation size) should in some "shared" location. 3) The difference in naming ParallelVacuumState vs. PHVState is a bit weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to make it consistent and clear. 4) I think it would be good to have some sort of README explaining how the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a while to realize how the workers coordinate which blocks to scan. 5) Wouldn't it be better to introduce the scan_stats (grouping some of the fields in a separate patch)? Seems entirely independent from the parallel part, so doing it separately would make it easier to review. Also, maybe reference the fields through scan_stats->x, instead of through vacrel->scan_stats->x, when there's the pointer. 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats? AFAIK it's not a statistic, it's actually a parameter affecting the decisions, right? 7) I find it a bit strange that heap_vac_scan_next_block() needs to check if it's a parallel scan, and redirect to the parallel callback. I mean, shouldn't the caller know which callback to invoke? Why should the serial callback care about this? 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the current block. Can you explain? 9) I'm a bit confused why the code checks IsParallelWorker() in so many places. Doesn't that mean the leader can't participate in the vacuum like a regular worker? 10) I'm not quite sure I understand the comments at the end of do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that means "index vacuuming", right? And then it says "pause without invoking index and heap vacuuming" but isn't the whole point of this block to do that cleanup so that the TidStore can be discarded? Maybe I just don't understand how the work is divided between the leader and workers ... 11) Why does GlobalVisState need to move to snapmgr.h? If I undo this the patch still builds fine for me. thanks -- Tomas Vondra
Dear Tomas, > 1) I really like the idea of introducing "compute_workers" callback to > the heap AM interface. I faced a similar issue with calculating workers > for index builds, because right now plan_create_index_workers is doing > that the logic works for btree, but really not brin etc. It didn't occur > to me we might make this part of the index AM ... +1, so let's keep the proposed style. Or, can we even propose the idea to table/index access method API? I've considered bit and the point seemed that which arguments should be required. > 4) I think it would be good to have some sort of README explaining how > the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a > while to realize how the workers coordinate which blocks to scan. I love the idea, it is quite helpful for reviewers like me. Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > Thanks for working on this. I took a quick look at this today, to do > some basic review. I plan to do a bunch of testing, but that's mostly to > get a better idea of what kind of improvements to expect - the initial > results look quite nice and sensible. Thank you for reviewing the patch! > A couple basic comments: > > 1) I really like the idea of introducing "compute_workers" callback to > the heap AM interface. I faced a similar issue with calculating workers > for index builds, because right now plan_create_index_workers is doing > that the logic works for btree, but really not brin etc. It didn't occur > to me we might make this part of the index AM ... Thanks. > > 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h > because it really has nothing to do with planning / paths. I realize it > needs the min_parallel_table_scan_size, but it doesn't seem right. I > guess it's a sign this bit of code (calculating parallel workers based > on log of relation size) should in some "shared" location. True. The same is actually true also for vacuumparallel.c. It includes optimizer/paths.h to use min_parallel_index_scan_size. > > 3) The difference in naming ParallelVacuumState vs. PHVState is a bit > weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to > make it consistent and clear. With the patch, since ParallelVacuumState is no longer dedicated for parallel index vacuuming we cannot rename them in this way. Both parallel table scanning/vacuuming and parallel index vacuuming can use the same ParallelVacuumState instance. The heap-specific necessary data for parallel heap scanning and vacuuming are stored in PHVState. > > 4) I think it would be good to have some sort of README explaining how > the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a > while to realize how the workers coordinate which blocks to scan. +1. I will add README in the next version patch. > > 5) Wouldn't it be better to introduce the scan_stats (grouping some of > the fields in a separate patch)? Seems entirely independent from the > parallel part, so doing it separately would make it easier to review. > Also, maybe reference the fields through scan_stats->x, instead of > through vacrel->scan_stats->x, when there's the pointer. Agreed. > 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats? > AFAIK it's not a statistic, it's actually a parameter affecting the > decisions, right? Right. It would be better to move them to a separate struct or somewhere. > > 7) I find it a bit strange that heap_vac_scan_next_block() needs to > check if it's a parallel scan, and redirect to the parallel callback. I > mean, shouldn't the caller know which callback to invoke? Why should the > serial callback care about this? do_lazy_scan_heap(), the sole caller of heap_vac_scan_next_block(), is called in serial vacuum and parallel vacuum cases. I wanted to make heap_vac_scan_next_block() workable in both cases. I think it also makes sense to have do_lazy_scan_heap() calls either function depending on parallel scan being enabled. > > 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the > current block. Can you explain? The workers' current block numbers are used to calculate the minimum block number where we've scanned so far. In serial scan case, we vacuum FSM of the particular block range for every VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan case, it doesn't make sense to vacuum FSM in that way because we might not have processed some blocks in the block range. So the idea is to calculate the minimum block number where we've scanned so far and vacuum FSM of the range of consecutive already-scanned blocks. > > 9) I'm a bit confused why the code checks IsParallelWorker() in so many > places. Doesn't that mean the leader can't participate in the vacuum > like a regular worker? I used '!isParallelWorker()' for some jobs that should be done only by the leader process. For example, checking failsafe mode, vacuuming FSM etc. > > 10) I'm not quite sure I understand the comments at the end of > do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that > means "index vacuuming", right? It means both index vacuuming and heap vacuuming. > And then it says "pause without invoking > index and heap vacuuming" but isn't the whole point of this block to do > that cleanup so that the TidStore can be discarded? Maybe I just don't > understand how the work is divided between the leader and workers ... The comment needs to be updated. But what the patch does is that when the memory usage of the shared TidStore reaches the limit, worker processes exit after updating the shared statistics, and then the leader invokes (parallel) index vacuuming and parallel heap vacuuming. Since the different number workers could be used for parallel heap scan, parallel index vacuuming, and parallel heap vacuuming, the leader process waits for all workers to finish at end of each phase. > 11) Why does GlobalVisState need to move to snapmgr.h? If I undo this > the patch still builds fine for me. Oh, I might have missed something. I'll check if it's really necessary. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 12/13/24 00:04, Tomas Vondra wrote: > ... > > The main difference is here: > > > master / no parallel workers: > > pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total) > > 1 parallel worker: > > pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total) > > > Clearly, with parallel vacuum we scan only a tiny fraction of the pages, > essentially just those with deleted tuples, which is ~1/20 of pages. > That's close to the 15x speedup. > > This effect is clearest without indexes, but it does affect even runs > with indexes - having to scan the indexes makes it much less pronounced, > though. However, these indexes are pretty massive (about the same size > as the table) - multiple times larger than the table. Chances are it'd > be clearer on realistic data sets. > > So the question is - is this correct? And if yes, why doesn't the > regular (serial) vacuum do that? > > There's some more strange things, though. For example, how come the avg > read rate is 0.000 MB/s? > > avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s > > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the > pages are in shared buffers (thanks to the DELETE earlier in that session). > OK, after looking into this a bit more I think the reason is rather simple - SKIP_PAGES_THRESHOLD. With serial runs, we end up scanning all pages, because even with an update every 5000 tuples, that's still only ~25 pages apart, well within the 32-page window. So we end up skipping no pages, scan and vacuum all everything. But parallel runs have this skipping logic disabled, or rather the logic that switches to sequential scans if the gap is less than 32 pages. IMHO this raises two questions: 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to sequential scans is the pages are close enough. Maybe there is a reason for this difference? Workers can reduce the difference between random and sequential I/0, similarly to prefetching. But that just means the workers should use a lower threshold, e.g. as SKIP_PAGES_THRESHOLD / nworkers or something like that? I don't see this discussed in this thread. 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good storage. If I can get an order of magnitude improvement (or more than that) by disabling the threshold, and just doing random I/O, maybe there's time to adjust it a bit. regards -- Tomas Vondra
On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 12/13/24 00:04, Tomas Vondra wrote: > > ... > > > > The main difference is here: > > > > > > master / no parallel workers: > > > > pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total) > > > > 1 parallel worker: > > > > pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total) > > > > > > Clearly, with parallel vacuum we scan only a tiny fraction of the pages, > > essentially just those with deleted tuples, which is ~1/20 of pages. > > That's close to the 15x speedup. > > > > This effect is clearest without indexes, but it does affect even runs > > with indexes - having to scan the indexes makes it much less pronounced, > > though. However, these indexes are pretty massive (about the same size > > as the table) - multiple times larger than the table. Chances are it'd > > be clearer on realistic data sets. > > > > So the question is - is this correct? And if yes, why doesn't the > > regular (serial) vacuum do that? > > > > There's some more strange things, though. For example, how come the avg > > read rate is 0.000 MB/s? > > > > avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s > > > > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's > > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the > > pages are in shared buffers (thanks to the DELETE earlier in that session). > > > > OK, after looking into this a bit more I think the reason is rather > simple - SKIP_PAGES_THRESHOLD. > > With serial runs, we end up scanning all pages, because even with an > update every 5000 tuples, that's still only ~25 pages apart, well within > the 32-page window. So we end up skipping no pages, scan and vacuum all > everything. > > But parallel runs have this skipping logic disabled, or rather the logic > that switches to sequential scans if the gap is less than 32 pages. > > > IMHO this raises two questions: > > 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to > sequential scans is the pages are close enough. Maybe there is a reason > for this difference? Workers can reduce the difference between random > and sequential I/0, similarly to prefetching. But that just means the > workers should use a lower threshold, e.g. as > > SKIP_PAGES_THRESHOLD / nworkers > > or something like that? I don't see this discussed in this thread. Each parallel heap scan worker allocates a chunk of blocks which is 8192 blocks at maximum, so we would need to use the SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we need to evaluate the differences anyway. WIll do the benchmark test and share the results. > > 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good > storage. If I can get an order of magnitude improvement (or more than > that) by disabling the threshold, and just doing random I/O, maybe > there's time to adjust it a bit. Yeah, you've started a thread for this so let's discuss it there. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 12/19/24 23:05, Masahiko Sawada wrote: > On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 12/13/24 00:04, Tomas Vondra wrote: >>> ... >>> >>> The main difference is here: >>> >>> >>> master / no parallel workers: >>> >>> pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total) >>> >>> 1 parallel worker: >>> >>> pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total) >>> >>> >>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages, >>> essentially just those with deleted tuples, which is ~1/20 of pages. >>> That's close to the 15x speedup. >>> >>> This effect is clearest without indexes, but it does affect even runs >>> with indexes - having to scan the indexes makes it much less pronounced, >>> though. However, these indexes are pretty massive (about the same size >>> as the table) - multiple times larger than the table. Chances are it'd >>> be clearer on realistic data sets. >>> >>> So the question is - is this correct? And if yes, why doesn't the >>> regular (serial) vacuum do that? >>> >>> There's some more strange things, though. For example, how come the avg >>> read rate is 0.000 MB/s? >>> >>> avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s >>> >>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's >>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the >>> pages are in shared buffers (thanks to the DELETE earlier in that session). >>> >> >> OK, after looking into this a bit more I think the reason is rather >> simple - SKIP_PAGES_THRESHOLD. >> >> With serial runs, we end up scanning all pages, because even with an >> update every 5000 tuples, that's still only ~25 pages apart, well within >> the 32-page window. So we end up skipping no pages, scan and vacuum all >> everything. >> >> But parallel runs have this skipping logic disabled, or rather the logic >> that switches to sequential scans if the gap is less than 32 pages. >> >> >> IMHO this raises two questions: >> >> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to >> sequential scans is the pages are close enough. Maybe there is a reason >> for this difference? Workers can reduce the difference between random >> and sequential I/0, similarly to prefetching. But that just means the >> workers should use a lower threshold, e.g. as >> >> SKIP_PAGES_THRESHOLD / nworkers >> >> or something like that? I don't see this discussed in this thread. > > Each parallel heap scan worker allocates a chunk of blocks which is > 8192 blocks at maximum, so we would need to use the > SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we > need to evaluate the differences anyway. WIll do the benchmark test > and share the results. > Right. I don't think this really matters for small tables, and for large tables the chunks should be fairly large (possibly up to 8192 blocks), in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial case. There might be differences at boundaries between chunks, but that seems like a minor / expected detail. I haven't checked know if the code would need to change / how much. >> >> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good >> storage. If I can get an order of magnitude improvement (or more than >> that) by disabling the threshold, and just doing random I/O, maybe >> there's time to adjust it a bit. > > Yeah, you've started a thread for this so let's discuss it there. > OK. FWIW as suggested in the other thread, it doesn't seem to be merely a question of VACUUM performance, as not skipping pages gives vacuum the opportunity to do cleanup that would otherwise need to happen later. If only for this reason, I think it would be good to keep the serial and parallel vacuum consistent. regards -- Tomas Vondra
On Tue, Jan 21, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote: > > > > Hi, > > > > Thanks for the new patches. I've repeated my benchmarking on v8, and I > > agree this looks fine - the speedups are reasonable and match what I'd > > expect on this hardware. I don't see any suspicious results like with > > the earlier patches, where it got much faster thanks to the absence of > > SKIP_PAGE_THRESHOLD logic. > > > > Attached is the benchmarking script, CSV with raw results, and then also > > two PDF reports comparing visualizing the impact of the patch by > > comparing it to current master. > > > > * parallel-vacuum-duration.pdf - Duration of the vacuum, and duration > > relative to master (green - faster, read - slower). The patch is clearly > > an improvement, with speedup up to ~3x depending on the index count and > > a fraction of updated rows. > > > > * parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM > > VERBOSE. With the patch it can reach up to ~3GB/s, which is about the > > max value possible on this hardware - so that's nice. I'll try to test > > this on a better storage, to see how far it can go. > > Thank you for doing a performance benchmark. These results make sense to me. > > > I haven't done any actual code review on the new patches, I'll try to > > find time for that sometime next week. > > Thank you! Since we introduced the eagar vacuum scan (052026c9b9), I need to update the parallel heap vacuum patch. After thinking of how to integrate these two features, I find some complexities. The region size used by eager vacuum scan and the chunk size used by parallel table scan are different. While the region is fixed size the chunk becomes smaller as we scan the table. A chunk of the table that a parallel vacuum worker took could be across different regions or be within one region, and different parallel heap vacuum workers might scan the same region. And parallel heap vacuum workers could be scanning different regions of the table simultaneously. During eager vacuum scan, we reset the eager_scan_remaining_fails counter when we start to scan the new region. So if we want to make parallel heap vacuum behaves exactly the same way as the single-progress vacuum in terms of the eager vacuum scan, we would need to have the eager_scan_remaining_fails counters for each region so that the workers can decrement it corresponding to the region of the block that the worker is scanning. But I'm concerned that it makes the logic very complex. I'd like to avoid making newly introduced codes more complex by adding yet another new code on top of that. Another idea is to disable the eager vacuum scan when parallel heap vacuum is enabled. It might look like just avoiding difficult things but it could make sense in a sense. The eager vacuum scan is aimed to amortize the aggressive vacuum by incrementally freezing pages that are potentially frozen by the next aggressive vacuum. On the other hand, parallel heap vacuum is available only in manual VACUUM and would be used to remove garbage on a large table as soon as possible or to freeze the entire table to avoid reaching the XID limit. So I think it might make sense to disable the eager vacuum scan when parallel vacuum. Thoughts? Thoughts? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > During eager vacuum scan, we reset the eager_scan_remaining_fails > counter when we start to scan the new region. So if we want to make > parallel heap vacuum behaves exactly the same way as the > single-progress vacuum in terms of the eager vacuum scan, we would > need to have the eager_scan_remaining_fails counters for each region > so that the workers can decrement it corresponding to the region of > the block that the worker is scanning. But I'm concerned that it makes > the logic very complex. I'd like to avoid making newly introduced > codes more complex by adding yet another new code on top of that. Would it be simpler to make only phase III parallel? In other words, how much of the infrastructure and complexity needed for parallel phase I is also needed for phase III? -- John Naylor Amazon Web Services
On Thu, Feb 13, 2025 at 8:16 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > During eager vacuum scan, we reset the eager_scan_remaining_fails > > counter when we start to scan the new region. So if we want to make > > parallel heap vacuum behaves exactly the same way as the > > single-progress vacuum in terms of the eager vacuum scan, we would > > need to have the eager_scan_remaining_fails counters for each region > > so that the workers can decrement it corresponding to the region of > > the block that the worker is scanning. But I'm concerned that it makes > > the logic very complex. I'd like to avoid making newly introduced > > codes more complex by adding yet another new code on top of that. > > Would it be simpler to make only phase III parallel? Yes, I think so. > In other words, > how much of the infrastructure and complexity needed for parallel > phase I is also needed for phase III? Both phases need some common changes to the parallel vacuum infrastructure so that we can launch parallel workers using ParallelVacuumContext for phase I and III and parallel vacuum workers can do its task on the heap table. Specifically, we need to change vacuumparallel.c to work also in parallel heap vacuum case, and add some table AM callbacks. Other than that, these phases have different complexity. As for supporting parallelism for phase III, changes to lazy vacuum would not be very complex, it needs to change both the radix tree and TidStore to support shared iteration through. Supporting parallelism of phase I is more complex since it integrates parallel table scan with lazy heap scan. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Since we introduced the eagar vacuum scan (052026c9b9), I need to > update the parallel heap vacuum patch. After thinking of how to > integrate these two features, I find some complexities. The region > size used by eager vacuum scan and the chunk size used by parallel > table scan are different. While the region is fixed size the chunk > becomes smaller as we scan the table. A chunk of the table that a > parallel vacuum worker took could be across different regions or be > within one region, and different parallel heap vacuum workers might > scan the same region. And parallel heap vacuum workers could be > scanning different regions of the table simultaneously. Ah, I see. What are the chunk size ranges? I picked a 32 MB region size after a little testing and mostly because it seemed reasonable. I think it would be fine to use different region size. Parallel workers could just consider the chunk they get an eager scan region (unless it is too small or too large -- then it might not make sense). > During eager vacuum scan, we reset the eager_scan_remaining_fails > counter when we start to scan the new region. So if we want to make > parallel heap vacuum behaves exactly the same way as the > single-progress vacuum in terms of the eager vacuum scan, we would > need to have the eager_scan_remaining_fails counters for each region > so that the workers can decrement it corresponding to the region of > the block that the worker is scanning. But I'm concerned that it makes > the logic very complex. I'd like to avoid making newly introduced > codes more complex by adding yet another new code on top of that. I don't think it would have to behave exactly the same. I think we just don't want to add a lot of complexity or make it hard to reason about. Since the failure rate is defined as a percent, couldn't we just have parallel workers set eager_scan_remaining_fails when they get their chunk assignment (as a percentage of their chunk size)? (I haven't looked at the code, so maybe this doesn't make sense). For the success cap, we could have whoever hits it first disable eager scanning for all future assigned chunks. > Another idea is to disable the eager vacuum scan when parallel heap > vacuum is enabled. It might look like just avoiding difficult things > but it could make sense in a sense. The eager vacuum scan is aimed to > amortize the aggressive vacuum by incrementally freezing pages that > are potentially frozen by the next aggressive vacuum. On the other > hand, parallel heap vacuum is available only in manual VACUUM and > would be used to remove garbage on a large table as soon as possible > or to freeze the entire table to avoid reaching the XID limit. So I > think it might make sense to disable the eager vacuum scan when > parallel vacuum. Do we only do parallelism in manual vacuum because we don't want to use up too many parallel workers for a maintenance subsystem? I never really tried to find out why parallel index vacuuming is only in manual vacuum. I assume you made the same choice they did for the same reasons. If the idea is to never allow parallelism in vacuum, then I think disabling eager scanning during manual parallel vacuum seems reasonable. People could use vacuum freeze if they want more freezing. Also, if you start with only doing parallelism for the third phase of heap vacuuming (second pass over the heap), this wouldn't be a problem because eager scanning only impacts the first phase. - Melanie
On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Since we introduced the eagar vacuum scan (052026c9b9), I need to > > update the parallel heap vacuum patch. After thinking of how to > > integrate these two features, I find some complexities. The region > > size used by eager vacuum scan and the chunk size used by parallel > > table scan are different. While the region is fixed size the chunk > > becomes smaller as we scan the table. A chunk of the table that a > > parallel vacuum worker took could be across different regions or be > > within one region, and different parallel heap vacuum workers might > > scan the same region. And parallel heap vacuum workers could be > > scanning different regions of the table simultaneously. Thank you for your feedback. > Ah, I see. What are the chunk size ranges? I picked a 32 MB region > size after a little testing and mostly because it seemed reasonable. I > think it would be fine to use different region size. Parallel workers > could just consider the chunk they get an eager scan region (unless it > is too small or too large -- then it might not make sense). The maximum chunk size is 8192 blocks, 64MB. As we scan the table, we ramp down the chunk size. It would eventually become 1. > > > During eager vacuum scan, we reset the eager_scan_remaining_fails > > counter when we start to scan the new region. So if we want to make > > parallel heap vacuum behaves exactly the same way as the > > single-progress vacuum in terms of the eager vacuum scan, we would > > need to have the eager_scan_remaining_fails counters for each region > > so that the workers can decrement it corresponding to the region of > > the block that the worker is scanning. But I'm concerned that it makes > > the logic very complex. I'd like to avoid making newly introduced > > codes more complex by adding yet another new code on top of that. > > I don't think it would have to behave exactly the same. I think we > just don't want to add a lot of complexity or make it hard to reason > about. > > Since the failure rate is defined as a percent, couldn't we just have > parallel workers set eager_scan_remaining_fails when they get their > chunk assignment (as a percentage of their chunk size)? (I haven't > looked at the code, so maybe this doesn't make sense). IIUC since the chunk size eventually becomes 1, we cannot simply just have parallel workers set the failure rate to its assigned chunk. > > For the success cap, we could have whoever hits it first disable eager > scanning for all future assigned chunks. Agreed. > > > Another idea is to disable the eager vacuum scan when parallel heap > > vacuum is enabled. It might look like just avoiding difficult things > > but it could make sense in a sense. The eager vacuum scan is aimed to > > amortize the aggressive vacuum by incrementally freezing pages that > > are potentially frozen by the next aggressive vacuum. On the other > > hand, parallel heap vacuum is available only in manual VACUUM and > > would be used to remove garbage on a large table as soon as possible > > or to freeze the entire table to avoid reaching the XID limit. So I > > think it might make sense to disable the eager vacuum scan when > > parallel vacuum. > > Do we only do parallelism in manual vacuum because we don't want to > use up too many parallel workers for a maintenance subsystem? I never > really tried to find out why parallel index vacuuming is only in > manual vacuum. I assume you made the same choice they did for the same > reasons. > > If the idea is to never allow parallelism in vacuum, then I think > disabling eager scanning during manual parallel vacuum seems > reasonable. People could use vacuum freeze if they want more freezing. IIUC the purpose of parallel vacuum is incompatible with the purpose of auto vacuum. The former is aimed to execute the vacuum as fast as possible using more resources, whereas the latter is aimed to execute the vacuum while not affecting foreground transaction processing. It's probably worth considering to enable parallel vacuum even for autovacuum in a wraparound situation, but the purpose would remain the same. > Also, if you start with only doing parallelism for the third phase of > heap vacuuming (second pass over the heap), this wouldn't be a problem > because eager scanning only impacts the first phase. Right. I'm inclined to support only the second heap pass as the first step. If we support parallelism only for the second pass, it cannot help speed up freezing the entire table in emergency situations, but it would be beneficial for cases where a big table have a large amount of spread garbage. At least, I'm going to reorganize the patch set to support parallelism for the second pass first and then the first heap pass. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Since the failure rate is defined as a percent, couldn't we just have > > parallel workers set eager_scan_remaining_fails when they get their > > chunk assignment (as a percentage of their chunk size)? (I haven't > > looked at the code, so maybe this doesn't make sense). > > IIUC since the chunk size eventually becomes 1, we cannot simply just > have parallel workers set the failure rate to its assigned chunk. Yep. The ranges are too big (1-8192). The behavior would be too different from serial. > > Also, if you start with only doing parallelism for the third phase of > > heap vacuuming (second pass over the heap), this wouldn't be a problem > > because eager scanning only impacts the first phase. > > Right. I'm inclined to support only the second heap pass as the first > step. If we support parallelism only for the second pass, it cannot > help speed up freezing the entire table in emergency situations, but > it would be beneficial for cases where a big table have a large amount > of spread garbage. > > At least, I'm going to reorganize the patch set to support parallelism > for the second pass first and then the first heap pass. Makes sense. - Melanie
On Tue, Feb 18, 2025 at 4:43 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > Since the failure rate is defined as a percent, couldn't we just have > > > parallel workers set eager_scan_remaining_fails when they get their > > > chunk assignment (as a percentage of their chunk size)? (I haven't > > > looked at the code, so maybe this doesn't make sense). > > > > IIUC since the chunk size eventually becomes 1, we cannot simply just > > have parallel workers set the failure rate to its assigned chunk. > > Yep. The ranges are too big (1-8192). The behavior would be too > different from serial. > > > > Also, if you start with only doing parallelism for the third phase of > > > heap vacuuming (second pass over the heap), this wouldn't be a problem > > > because eager scanning only impacts the first phase. > > > > Right. I'm inclined to support only the second heap pass as the first > > step. If we support parallelism only for the second pass, it cannot > > help speed up freezing the entire table in emergency situations, but > > it would be beneficial for cases where a big table have a large amount > > of spread garbage. > > > > At least, I'm going to reorganize the patch set to support parallelism > > for the second pass first and then the first heap pass. > > Makes sense. I've attached the updated patches. In this version, I focused on parallelizing only the second pass over the heap. It's more straightforward than supporting the first pass, it still requires many preliminary changes though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v9-0003-radixtree.h-support-shared-iteration.patch
- v9-0002-vacuumparallel.c-Support-parallel-table-vacuuming.patch
- v9-0004-tidstore.c-support-shared-iteration-on-TidStore.patch
- v9-0005-Move-some-fields-of-LVRelState-to-LVVacCounters-s.patch
- v9-0006-Support-parallelism-for-removing-dead-items-durin.patch
- v9-0001-Introduces-table-AM-APIs-for-parallel-table-vacuu.patch
On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Right. I'm inclined to support only the second heap pass as the first > step. If we support parallelism only for the second pass, it cannot > help speed up freezing the entire table in emergency situations, but > it would be beneficial for cases where a big table have a large amount > of spread garbage. I started looking at the most recent patch set for this, and while looking back over the thread, a couple random thoughts occurred to me: On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > - Patched > parallel 0: 53468.734 (15990, 28296, 9465) > parallel 1: 35712.613 ( 8254, 23569, 3700) These measurements were done before before phase I and III were stream-ified, but even here those phases were already the quickest ones for a modest number of indexes and a single worker. I think it would have an even bigger difference when only a small percentage of pages need scanning/vacuuming, because it still has to read the entire index. It's a bit unfortunate that the simplest heap phase to parallelize is also the quickest to begin with. Pre-existing issue: We don't get anywhere near 2x improvement in phase II for 2 parallel index scans. We've known for a while that the shared memory TID store has more overhead than private memory, and here that overhead is about the same as the entirety of phase III with a single worker! It may be time to look into mitigations there, independently of this thread. The same commit that made the parallel scanning patch more difficult should also reduce the risk of having a large amount of freezing work at once in the first place. (There are still plenty of systemic things that can go wrong, of course, and it's always good if unpleasant work finishes faster.) I seem to recall a proposal from David Rowley to (something like) batch gathering xids for visibility checks during executor scans, but if so I can't find it in the archives. It's possible some similar work might speed up heap scanning in a more localized fashion. On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? > > TidStoreBeginIterateShared() is designed for multiple parallel workers > to iterate a shared TidStore. During an iteration, parallel workers > share the iteration state and iterate the underlying radix tree while > taking appropriate locks. Therefore, it's available only for a shared > TidStore. This is required to implement the parallel heap vacuum, > where multiple parallel workers do the iteration on the shared > TidStore. > > On the other hand, TidStoreBeginIterate() is designed for a single > process to iterate a TidStore. It accepts even a shared TidStore as > you mentioned, but during an iteration there is no inter-process > coordination such as locking. When it comes to parallel vacuum, > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > cover the case where we use only parallel index vacuum but not > parallel heap scan/vacuum. In this case, we need to store dead tuple > TIDs on the shared TidStore during heap scan so parallel workers can > use it during index vacuum. But it's not necessary to use > TidStoreBeginIterateShared() because only one (leader) process does > heap vacuum. The takeaway I got is that the word "shared" is used for two different concepts. Future readers are not going to think to look back at this email for explanation. At the least there needs to be a different word for the new concept. There are quite a lot of additions to radix_tree.h and tidstore.c to make it work this way, including a new "control object", new locks, new atomic variables. I'm not sure it's really necessary. Is it possible to just store the "most recent block heap-vacuumed" in the shared vacuum state? Then each backend would lock the TID store, iterate starting at the next block, and unlock the store when it has enough blocks. Sometimes my brainstorms are unworkable for some reason I failed to think about, but this way seems 95% simpler -- we would only need to teach the existing iteration machinery to take a "start key". -- John Naylor Amazon Web Services
On Mon, Feb 17, 2025 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> If the idea is to never allow parallelism in vacuum, then I think
> disabling eager scanning during manual parallel vacuum seems
> reasonable. People could use vacuum freeze if they want more freezing.
IIUC the purpose of parallel vacuum is incompatible with the purpose
of auto vacuum. The former is aimed to execute the vacuum as fast as
possible using more resources, whereas the latter is aimed to execute
the vacuum while not affecting foreground transaction processing. It's
probably worth considering to enable parallel vacuum even for
autovacuum in a wraparound situation, but the purpose would remain the
same.
That's assuming that the database is running with autovacuum_cost_delay > 0. There's presumably some number of systems that intentionally do not throttle autovacuum. Another consideration is that people are free to set vacuum_cost_page_miss = 0; in those cases it would still be useful to parallelize at least phase 1 and 2. Of course, folks can also set vacuum_cost_page_dirty = 0, in which case parallelization would help phase 2 and 3.
In terms of less hypothetical scenarios, I've definitely run into cases where 1 table accounts for 90%+ of the space used by a database. Parallelism would help ensure the single table is still processed in a timely fashion. (Assuming non-default settings for vacuum throttling.)
On Sun, Feb 23, 2025 at 8:51 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Right. I'm inclined to support only the second heap pass as the first > > step. If we support parallelism only for the second pass, it cannot > > help speed up freezing the entire table in emergency situations, but > > it would be beneficial for cases where a big table have a large amount > > of spread garbage. > > I started looking at the most recent patch set for this, and while > looking back over the thread, a couple random thoughts occurred to me: > > On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > - Patched > > parallel 0: 53468.734 (15990, 28296, 9465) > > parallel 1: 35712.613 ( 8254, 23569, 3700) > > These measurements were done before before phase I and III were > stream-ified, but even here those phases were already the quickest > ones for a modest number of indexes and a single worker. I think it > would have an even bigger difference when only a small percentage of > pages need scanning/vacuuming, because it still has to read the entire > index. It's a bit unfortunate that the simplest heap phase to > parallelize is also the quickest to begin with. > > Pre-existing issue: We don't get anywhere near 2x improvement in phase > II for 2 parallel index scans. We've known for a while that the shared > memory TID store has more overhead than private memory, and here that > overhead is about the same as the entirety of phase III with a single > worker! It may be time to look into mitigations there, independently > of this thread. Thank you for the comments. I've done simple performance benchmarks and attached the results. In this test, I prepared a table having one integer column and 3GB in size. The 'fraction' column means the fraction of pages with deleted rows; for example fraction=1000 means that the table has pages with deleted rows every 1000 pages. The results show the duration for each phase in addition to the total duration (for PATCHED only) and compares the total duration between the HEAD and the PATCHED. What I can see from these results was that we might not benefit much from parallelizing phase III, unfortunately. Although in the best case the phase III got about 2x speedup, as for the total duration it's about only 10% speedup. My analysis for these results matches what John mentioned; phase III is already the fastest phase and accounts only ~10% of the total execution time, and the overhead of shared TidStore offsets the speedup of phase III. > The same commit that made the parallel scanning patch more difficult > should also reduce the risk of having a large amount of freezing work > at once in the first place. (There are still plenty of systemic things > that can go wrong, of course, and it's always good if unpleasant work > finishes faster.) I think that vacuum would still need to scan a large amount of blocks of the table especially when it is very large and heavily modified. Parallel heap vacuum (only phase I) would be beneficial in case where autovacuum could not catch up. And we might want to consider using parallel heap vacuum also in autovacuum while integrating it with eagar freeze scan. > I seem to recall a proposal from David Rowley to (something like) > batch gathering xids for visibility checks during executor scans, but > if so I can't find it in the archives. It's possible some similar work > might speed up heap scanning in a more localized fashion. Interesting. > On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? > > > > TidStoreBeginIterateShared() is designed for multiple parallel workers > > to iterate a shared TidStore. During an iteration, parallel workers > > share the iteration state and iterate the underlying radix tree while > > taking appropriate locks. Therefore, it's available only for a shared > > TidStore. This is required to implement the parallel heap vacuum, > > where multiple parallel workers do the iteration on the shared > > TidStore. > > > > On the other hand, TidStoreBeginIterate() is designed for a single > > process to iterate a TidStore. It accepts even a shared TidStore as > > you mentioned, but during an iteration there is no inter-process > > coordination such as locking. When it comes to parallel vacuum, > > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > > cover the case where we use only parallel index vacuum but not > > parallel heap scan/vacuum. In this case, we need to store dead tuple > > TIDs on the shared TidStore during heap scan so parallel workers can > > use it during index vacuum. But it's not necessary to use > > TidStoreBeginIterateShared() because only one (leader) process does > > heap vacuum. > > The takeaway I got is that the word "shared" is used for two different > concepts. Future readers are not going to think to look back at this > email for explanation. At the least there needs to be a different word > for the new concept. > > There are quite a lot of additions to radix_tree.h and tidstore.c to > make it work this way, including a new "control object", new locks, > new atomic variables. I'm not sure it's really necessary. Is it > possible to just store the "most recent block heap-vacuumed" in the > shared vacuum state? Then each backend would lock the TID store, > iterate starting at the next block, and unlock the store when it has > enough blocks. Sometimes my brainstorms are unworkable for some reason > I failed to think about, but this way seems 95% simpler -- we would > only need to teach the existing iteration machinery to take a "start > key". Interesting idea. While it might be worth evaluating this idea, given that the phase III accounts only a small portion of the total vacuum execution time, it might be better to switch focusing on parallelizing phase I. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > What I can see from these results was that we might not benefit much > from parallelizing phase III, unfortunately. Although in the best case > the phase III got about 2x speedup, as for the total duration it's > about only 10% speedup. My analysis for these results matches what > John mentioned; phase III is already the fastest phase and accounts > only ~10% of the total execution time, and the overhead of shared > TidStore offsets the speedup of phase III. So, are you proposing to drop the patches for parallelizing phase III for now? If so, are you planning on posting a set of patches just to parallelize phase I? I haven't looked at the prelim refactoring patches to see if they have independent value. What do you think is reasonable for us to try and do in the next few weeks? > > The same commit that made the parallel scanning patch more difficult > > should also reduce the risk of having a large amount of freezing work > > at once in the first place. (There are still plenty of systemic things > > that can go wrong, of course, and it's always good if unpleasant work > > finishes faster.) > > I think that vacuum would still need to scan a large amount of blocks > of the table especially when it is very large and heavily modified. > Parallel heap vacuum (only phase I) would be beneficial in case where > autovacuum could not catch up. And we might want to consider using > parallel heap vacuum also in autovacuum while integrating it with > eagar freeze scan. I'd be interested to hear more about this. - Melanie
On Tue, Feb 25, 2025 at 9:59 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > What I can see from these results was that we might not benefit much > > from parallelizing phase III, unfortunately. Although in the best case > > the phase III got about 2x speedup, as for the total duration it's > > about only 10% speedup. My analysis for these results matches what > > John mentioned; phase III is already the fastest phase and accounts > > only ~10% of the total execution time, and the overhead of shared > > TidStore offsets the speedup of phase III. > > So, are you proposing to drop the patches for parallelizing phase III > for now? If so, are you planning on posting a set of patches just to > parallelize phase I? I haven't looked at the prelim refactoring > patches to see if they have independent value. What do you think is > reasonable for us to try and do in the next few weeks? Given that we have only about one month until the feature freeze, I find that it's realistic to introduce either one parallelism for PG18 and at least we might want to implement the one first that is more beneficial and helpful for users. Since we found that parallel phase III is not very efficient in many cases, I'm thinking that in terms of PG18 development, we might want to switch focus to parallel phase I, and then go for phase III if we have time. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Given that we have only about one month until the feature freeze, I > find that it's realistic to introduce either one parallelism for PG18 > and at least we might want to implement the one first that is more > beneficial and helpful for users. Since we found that parallel phase > III is not very efficient in many cases, I'm thinking that in terms of > PG18 development, we might want to switch focus to parallel phase I, > and then go for phase III if we have time. Okay, well let me know how I can be helpful. Should I be reviewing a version that is already posted? - Melanie
On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Given that we have only about one month until the feature freeze, I > > find that it's realistic to introduce either one parallelism for PG18 > > and at least we might want to implement the one first that is more > > beneficial and helpful for users. Since we found that parallel phase > > III is not very efficient in many cases, I'm thinking that in terms of > > PG18 development, we might want to switch focus to parallel phase I, > > and then go for phase III if we have time. > > Okay, well let me know how I can be helpful. Should I be reviewing a > version that is already posted? Thank you so much. I'm going to submit the latest patches in a few days for parallelizing the phase I. I would appreciate it if you could review that version. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Given that we have only about one month until the feature freeze, I > > > find that it's realistic to introduce either one parallelism for PG18 > > > and at least we might want to implement the one first that is more > > > beneficial and helpful for users. Since we found that parallel phase > > > III is not very efficient in many cases, I'm thinking that in terms of > > > PG18 development, we might want to switch focus to parallel phase I, > > > and then go for phase III if we have time. > > > > Okay, well let me know how I can be helpful. Should I be reviewing a > > version that is already posted? > > Thank you so much. I'm going to submit the latest patches in a few > days for parallelizing the phase I. I would appreciate it if you could > review that version. > I've attached the updated patches that make the phase I (heap scanning) parallel. I'll share the benchmark results soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v10-0005-Support-parallelism-for-collecting-dead-items-du.patch
- v10-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v10-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patch
- v10-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v10-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > Given that we have only about one month until the feature freeze, I > > > > find that it's realistic to introduce either one parallelism for PG18 > > > > and at least we might want to implement the one first that is more > > > > beneficial and helpful for users. Since we found that parallel phase > > > > III is not very efficient in many cases, I'm thinking that in terms of > > > > PG18 development, we might want to switch focus to parallel phase I, > > > > and then go for phase III if we have time. > > > > > > Okay, well let me know how I can be helpful. Should I be reviewing a > > > version that is already posted? > > > > Thank you so much. I'm going to submit the latest patches in a few > > days for parallelizing the phase I. I would appreciate it if you could > > review that version. > > > > I've attached the updated patches that make the phase I (heap > scanning) parallel. I'll share the benchmark results soon. > I've attached the benchmark test results. Overall, with the parallel heap scan (phase I), the vacuum got speedup much. On the other hand, looking at each phase I can see performance regressions in some cases: First, we can see the regression on a table with one index due to overhead of the shared TidStore. Currently, we disable parallel index vacuuming if the table has only one index as the leader process always takes one index. With this patch, we enable parallel heap scan even if the parallel index vacuuming is disabled, ending up using the shared TidStore. In the benchmark test, while the regression due to that overhead is about ~25% the speedup by parallel heap scan is 50%~, so the performance number is good overall. I think we can improve the shared TidStore in the future. Another performance regression I can see in the results is that heap vacuum phase (phase III) got slower with the patch. It's weired to me since I don't touch the code of heap vacuum phase. I'm still investigating the cause. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Another performance regression I can see in the results is that heap > vacuum phase (phase III) got slower with the patch. It's weired to me > since I don't touch the code of heap vacuum phase. I'm still > investigating the cause. I have investigated this regression. I've confirmed that In both scenarios (patched and unpatched), the entire table and its associated indexes were loaded into the shared buffer before the vacuum. Then, the 'perf record' analysis, focused specifically on the heap vacuum phase of the patched code, revealed numerous soft page faults occurring: 62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel | |--52.44%--lazy_vacuum_heap_rel | | | |--46.33%--lazy_vacuum_heap_page (inlined) | | | | | |--32.42%--heap_page_is_all_visible (inlined) | | | | | | | |--26.46%--HeapTupleSatisfiesVacuum | | | | HeapTupleSatisfiesVacuumHorizon | | | | HeapTupleHeaderXminCommitted (inlined) | | | | | | | | | --18.52%--page_fault | | | | do_page_fault | | | | __do_page_fault | | | | handle_mm_fault | | | | __handle_mm_fault | | | | handle_pte_fault | | | | | | | | | |--16.53%--filemap_map_pages | | | | | | | | | | | --2.63%--alloc_set_pte | | | | | pfn_pte | | | | | | | | | --1.99%--pmd_page_vaddr | | | | | | | --1.99%--TransactionIdPrecedes I did not observe these page faults in the 'perf record' results for the HEAD version. Furthermore, when I disabled parallel heap vacuum while keeping parallel index vacuuming enabled, the regression disappeared. Based on these findings, the likely cause of the regression appears to be that during parallel heap vacuum operations, table blocks were loaded into the shared buffer by parallel vacuum workers. However, in the heap vacuum phase, the leader process needed to process all blocks, resulting in soft page faults while creating Page Table Entries (PTEs). Without the patch, the backend process had already created PTEs during the heap scan, thus preventing these faults from occurring during the heap vacuum phase. It appears to be an inherent side effect of utilizing parallel queries. Given this understanding, it's likely an acceptable trade-off that we can accommodate. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Some minor review comments for patch v10-0001. ====== src/include/access/tableam.h 1. struct IndexInfo; +struct ParallelVacuumState; +struct ParallelContext; +struct ParallelWorkerContext; struct SampleScanState; Use alphabetical order for consistency with existing code. ~~~ 2. + /* + * Estimate the size of shared memory that the parallel table vacuum needs + * for AM + * 2a. Missing period (.) 2b. Change the wording to be like below, for consistency with the other 'estimate' function comments, and for consistency with the comment where this function is implemented. Estimate the size of shared memory needed for a parallel table vacuum of this relation. ~~~ 3. + * The state_out is the output parameter so that an arbitrary data can be + * passed to the subsequent callback, parallel_vacuum_remove_dead_items. Typo? "an arbitrary data" ~~~ 4. General/Asserts All the below functions have a comment saying "Not called if parallel table vacuum is disabled." - parallel_vacuum_estimate - parallel_vacuum_initialize - parallel_vacuum_initialize_worker - parallel_vacuum_collect_dead_items But, it's only a comment. I wondered if they should all have some Assert as an integrity check on that. ~~~ 5. +/* + * Return the number of parallel workers for a parallel vacuum scan of this + * relation. + */ "Return the number" or "Compute the number"? The comment should match the comment in the fwd declaration of this function. ~~~ 6. +/* + * Perform a parallel vacuums scan to collect dead items. + */ 6a. "Perform" or "Execute"? The comment should match the one in the fwd declaration of this function. 6b. Typo "vacuums" ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. FYI. I am observing the following test behaviour: I apply patch v10-0001, do a clean rebuild and run 'make check', and all tests are OK. Then, after I apply just patch v10-0002 on top of 0001, do a clean rebuild and run 'make check' there are many test fails. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. Here are some review comments for patch v10-0002. ====== src/backend/access/heap/heapam_handler.c 1. .scan_bitmap_next_block = heapam_scan_bitmap_next_block, .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple, .scan_sample_next_block = heapam_scan_sample_next_block, - .scan_sample_next_tuple = heapam_scan_sample_next_tuple + .scan_sample_next_tuple = heapam_scan_sample_next_tuple, + + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers, Pretty much every other heam AM method here has a 'heapam_' prefix. Is it better to be consistent and call the new function 'heapam_parallel_vacuum_compute_workers' instead of 'heap_parallel_vacuum_compute_workers'? ====== src/backend/access/heap/vacuumlazy.c 2. +/* + * Compute the number of workers for parallel heap vacuum. + * + * Disabled so far. + */ +int +heap_parallel_vacuum_compute_workers(Relation rel, int max_workers) +{ + return 0; +} + Instead of saying "Disabled so far", the function comment maybe should say: "Return 0 means parallel heap vacuum is disabled." Then the comment doesn't need to churn later when the function gets implemented in later patches. ====== src/backend/commands/vacuumparallel.c 3. +/* The kind of parallel vacuum work */ +typedef enum +{ + PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */ + PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */ +} PVWorkItem; + Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for the field name: 'work_phase' seems more appropriate. ~~~ 4. + /* + * Processing indexes or removing dead tuples from the table. + */ + PVWorkItem work_item; Missing question mark for this comment? ~~~ 5. + /* + * The number of workers for parallel table vacuuming. If > 0, the + * parallel table vacuum is enabled. + */ + int nworkers_for_table; + I guess this field will never be negative. So is it simpler to modify the comment to say: "If 0, parallel table vacuum is disabled." ~~~ parallel_vacuum_init: 6. + /* A parallel vacuum must be requested */ Assert(nrequested_workers >= 0); It's not very intuitive to say the user requested a parallel vacuum when the 'nrequested_workers' is 0. I felt some more commentary is needed here or in the function header; it seems nrequested_workers == 0 has a special meaning of having the system decide the number of workers. ~~~ parallel_vacuum_compute_workers: 7. static int -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested, +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, + int nrequested, int *nworkers_table_p, bool *will_parallel_vacuum) { int nindexes_parallel = 0; int nindexes_parallel_bulkdel = 0; int nindexes_parallel_cleanup = 0; - int parallel_workers; + int nworkers_table = 0; + int nworkers_index = 0; + + *nworkers_table_p = 0; 7a. AFAICT you can just remove the variable 'nworkers_table', and instead call the 'nworkers_table_p' parameter as 'nworkers_table' ~ 7b. IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now that the patch introduces parallel table vacuuming it makes this existing 'will_parallel_vacuum' name generic/misleading. Maybe it now needs to be called 'will_parallel_index_vacuum' or similar (in all places). ~~~ 8. + if (nrequested > 0) + { + /* + * If the parallel degree is specified, accept that as the number of + * parallel workers for table vacuum (though still cap at + * max_parallel_maintenance_workers). + */ + nworkers_table = Min(nrequested, max_parallel_maintenance_workers); + } + else + { + /* Compute the number of workers for parallel table scan */ + nworkers_table = table_parallel_vacuum_compute_workers(rel, + max_parallel_maintenance_workers); + + Assert(nworkers_table <= max_parallel_maintenance_workers); + } + The Assert can be outside of the if/else because it is the same for both. ~~~ 9. /* No index supports parallel vacuum */ - if (nindexes_parallel <= 0) - return 0; - - /* Compute the parallel degree */ - parallel_workers = (nrequested > 0) ? - Min(nrequested, nindexes_parallel) : nindexes_parallel; + if (nindexes_parallel > 0) + { + /* Take into account the requested number of workers */ + nworkers_index = (nrequested > 0) ? + Min(nrequested, nindexes_parallel) : nindexes_parallel; - /* Cap by max_parallel_maintenance_workers */ - parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers); + /* Cap by max_parallel_maintenance_workers */ + nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers); + } "No index supports..." seems to be an old comment that is not correct for this new code block. ~~~ 10. + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); +} + + Double blank line after function 'parallel_vacuum_process_table' ~~~ main: 11. + /* Initialize AM-specific vacuum state for parallel table vacuuming */ + if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) + { + ParallelWorkerContext pwcxt; + + pwcxt.toc = toc; + pwcxt.seg = seg; + table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt, + &state); + } + Wondering if this code can be done in the same if block later: "if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)" ~~~ 12. + if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) + { + /* Scan the table to collect dead items */ + parallel_vacuum_process_table(&pvs, state); + } + else + { + Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES); + + /* Process indexes to perform vacuum/cleanup */ + parallel_vacuum_process_safe_indexes(&pvs); + } Would this if/else be better implemented as a 'switch' for the possible phases? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > Given that we have only about one month until the feature freeze, I > > > > > find that it's realistic to introduce either one parallelism for PG18 > > > > > and at least we might want to implement the one first that is more > > > > > beneficial and helpful for users. Since we found that parallel phase > > > > > III is not very efficient in many cases, I'm thinking that in terms of > > > > > PG18 development, we might want to switch focus to parallel phase I, > > > > > and then go for phase III if we have time. > > > > > > > > Okay, well let me know how I can be helpful. Should I be reviewing a > > > > version that is already posted? > > > > > > Thank you so much. I'm going to submit the latest patches in a few > > > days for parallelizing the phase I. I would appreciate it if you could > > > review that version. > > > > > > > I've attached the updated patches that make the phase I (heap > > scanning) parallel. I'll share the benchmark results soon. > > > > I've attached the benchmark test results. > > Overall, with the parallel heap scan (phase I), the vacuum got speedup > much. On the other hand, looking at each phase I can see performance > regressions in some cases: > > First, we can see the regression on a table with one index due to > overhead of the shared TidStore. Currently, we disable parallel index > vacuuming if the table has only one index as the leader process always > takes one index. With this patch, we enable parallel heap scan even if > the parallel index vacuuming is disabled, ending up using the shared > TidStore. In the benchmark test, while the regression due to that > overhead is about ~25% the speedup by parallel heap scan is 50%~, so > the performance number is good overall. I think we can improve the > shared TidStore in the future. > > Another performance regression I can see in the results is that heap > vacuum phase (phase III) got slower with the patch. It's weired to me > since I don't touch the code of heap vacuum phase. I'm still > investigating the cause. > Discussing with Amit offlist, I've run another benchmark test where no data is loaded on the shared buffer. In the previous test, I loaded all table blocks before running vacuum, so it was the best case. The attached test results showed the worst case. Overall, while the numbers seem not stable, the phase I got sped up a bit, but not as scalable as expected, which is not surprising. Please note that the test results shows that the phase III also got sped up but this is because in parallel vacuum we use more ring buffers than the single process vacuum. So we need to compare the only phase I time in terms of the benefit of the parallelism. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 6, 2025 at 5:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Some minor review comments for patch v10-0001. > > ====== > src/include/access/tableam.h > > 1. > struct IndexInfo; > +struct ParallelVacuumState; > +struct ParallelContext; > +struct ParallelWorkerContext; > struct SampleScanState; > > Use alphabetical order for consistency with existing code. > > ~~~ > > 2. > + /* > + * Estimate the size of shared memory that the parallel table vacuum needs > + * for AM > + * > > 2a. > Missing period (.) > > 2b. > Change the wording to be like below, for consistency with the other > 'estimate' function comments, and for consistency with the comment > where this function is implemented. > > Estimate the size of shared memory needed for a parallel table vacuum > of this relation. > > ~~~ > > 3. > + * The state_out is the output parameter so that an arbitrary data can be > + * passed to the subsequent callback, parallel_vacuum_remove_dead_items. > > Typo? "an arbitrary data" > > ~~~ > > 4. General/Asserts > > All the below functions have a comment saying "Not called if parallel > table vacuum is disabled." > - parallel_vacuum_estimate > - parallel_vacuum_initialize > - parallel_vacuum_initialize_worker > - parallel_vacuum_collect_dead_items > > But, it's only a comment. I wondered if they should all have some > Assert as an integrity check on that. > > ~~~ > > 5. > +/* > + * Return the number of parallel workers for a parallel vacuum scan of this > + * relation. > + */ > > "Return the number" or "Compute the number"? > The comment should match the comment in the fwd declaration of this function. > > ~~~ > > 6. > +/* > + * Perform a parallel vacuums scan to collect dead items. > + */ > > 6a. > "Perform" or "Execute"? > The comment should match the one in the fwd declaration of this function. > > 6b. > Typo "vacuums" > Thank you for reviewing the patch. I'll address these comments and submit the updated version patches soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 6, 2025 at 10:56 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Sawada-San. Here are some review comments for patch v10-0002. Thank you for reviewing the patch. > > ====== > src/backend/access/heap/heapam_handler.c > > 1. > .scan_bitmap_next_block = heapam_scan_bitmap_next_block, > .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple, > .scan_sample_next_block = heapam_scan_sample_next_block, > - .scan_sample_next_tuple = heapam_scan_sample_next_tuple > + .scan_sample_next_tuple = heapam_scan_sample_next_tuple, > + > + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers, > > Pretty much every other heam AM method here has a 'heapam_' prefix. Is > it better to be consistent and call the new function > 'heapam_parallel_vacuum_compute_workers' instead of > 'heap_parallel_vacuum_compute_workers'? Hmm, given that the existing vacuum-related callback, heap_vacuum_rel, uses 'heap' I guess it's okay with 'heap_' prefix for parallel vacuum callbacks. > > ====== > src/backend/access/heap/vacuumlazy.c > > 2. > +/* > + * Compute the number of workers for parallel heap vacuum. > + * > + * Disabled so far. > + */ > +int > +heap_parallel_vacuum_compute_workers(Relation rel, int max_workers) > +{ > + return 0; > +} > + > > Instead of saying "Disabled so far", the function comment maybe should say: > "Return 0 means parallel heap vacuum is disabled." > > Then the comment doesn't need to churn later when the function gets > implemented in later patches. Updated. > > ====== > src/backend/commands/vacuumparallel.c > > 3. > +/* The kind of parallel vacuum work */ > +typedef enum > +{ > + PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */ > + PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */ > +} PVWorkItem; > + > > Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for > the field name: 'work_phase' seems more appropriate. Agreed. > > ~~~ > > 4. > + /* > + * Processing indexes or removing dead tuples from the table. > + */ > + PVWorkItem work_item; > > Missing question mark for this comment? Updated. > > ~~~ > > 5. > + /* > + * The number of workers for parallel table vacuuming. If > 0, the > + * parallel table vacuum is enabled. > + */ > + int nworkers_for_table; > + > > I guess this field will never be negative. So is it simpler to modify > the comment to say: > "If 0, parallel table vacuum is disabled." Agreed. > > ~~~ > > parallel_vacuum_init: > > 6. > + /* A parallel vacuum must be requested */ > Assert(nrequested_workers >= 0); > > It's not very intuitive to say the user requested a parallel vacuum > when the 'nrequested_workers' is 0. I felt some more commentary is > needed here or in the function header; it seems nrequested_workers == > 0 has a special meaning of having the system decide the number of > workers. Added some comments. > > ~~~ > > parallel_vacuum_compute_workers: > > 7. > static int > -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int > nrequested, > +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, > + int nrequested, int *nworkers_table_p, > bool *will_parallel_vacuum) > { > int nindexes_parallel = 0; > int nindexes_parallel_bulkdel = 0; > int nindexes_parallel_cleanup = 0; > - int parallel_workers; > + int nworkers_table = 0; > + int nworkers_index = 0; > + > + *nworkers_table_p = 0; > > 7a. > AFAICT you can just remove the variable 'nworkers_table', and instead > call the 'nworkers_table_p' parameter as 'nworkers_table' Yes, but I wanted to not modify the output parameter many times. > > ~ > > 7b. > IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now > that the patch introduces parallel table vacuuming it makes this > existing 'will_parallel_vacuum' name generic/misleading. Maybe it now > needs to be called 'will_parallel_index_vacuum' or similar (in all > places). Agreed. > > ~~~ > > 8. > + if (nrequested > 0) > + { > + /* > + * If the parallel degree is specified, accept that as the number of > + * parallel workers for table vacuum (though still cap at > + * max_parallel_maintenance_workers). > + */ > + nworkers_table = Min(nrequested, max_parallel_maintenance_workers); > + } > + else > + { > + /* Compute the number of workers for parallel table scan */ > + nworkers_table = table_parallel_vacuum_compute_workers(rel, > + max_parallel_maintenance_workers); > + > + Assert(nworkers_table <= max_parallel_maintenance_workers); > + } > + > > The Assert can be outside of the if/else because it is the same for both. This part has been removed by modifying the parallel degree computation logic. > > ~~~ > > 9. > /* No index supports parallel vacuum */ > - if (nindexes_parallel <= 0) > - return 0; > - > - /* Compute the parallel degree */ > - parallel_workers = (nrequested > 0) ? > - Min(nrequested, nindexes_parallel) : nindexes_parallel; > + if (nindexes_parallel > 0) > + { > + /* Take into account the requested number of workers */ > + nworkers_index = (nrequested > 0) ? > + Min(nrequested, nindexes_parallel) : nindexes_parallel; > > - /* Cap by max_parallel_maintenance_workers */ > - parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers); > + /* Cap by max_parallel_maintenance_workers */ > + nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers); > + } > > "No index supports..." seems to be an old comment that is not correct > for this new code block. Removed. > > ~~~ > > 10. > + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); > +} > + > + > > Double blank line after function 'parallel_vacuum_process_table' Removed. > > ~~~ > > main: > > 11. > + /* Initialize AM-specific vacuum state for parallel table vacuuming */ > + if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) > + { > + ParallelWorkerContext pwcxt; > + > + pwcxt.toc = toc; > + pwcxt.seg = seg; > + table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt, > + &state); > + } > + > > Wondering if this code can be done in the same if block later: "if > (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)" If we do that, we would end up calling InstrStartParallelQuery() before the worker initialization. I guess we want to avoid counting these usage during the initialization. > > ~~~ > > 12. > + if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) > + { > + /* Scan the table to collect dead items */ > + parallel_vacuum_process_table(&pvs, state); > + } > + else > + { > + Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES); > + > + /* Process indexes to perform vacuum/cleanup */ > + parallel_vacuum_process_safe_indexes(&pvs); > + } > > Would this if/else be better implemented as a 'switch' for the possible phases? > Okay, changed. I've attached the updated version patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v11-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
- v11-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v11-0005-Support-parallelism-for-collecting-dead-items-du.patch
- v11-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v11-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patch
Hi Sawada-San, Here are some review comments for patch v11-0001. ====== src/backend/access/heap/vacuumlazy.c 1. +/* + * Compute the number of workers for parallel heap vacuum. + * + * Return 0 to disable parallel vacuum so far. + */ +int +heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested) You don't need to say "so far". ====== src/backend/access/table/tableamapi.c 2. Assert(routine->relation_vacuum != NULL); + Assert(routine->parallel_vacuum_compute_workers != NULL); Assert(routine->scan_analyze_next_block != NULL); Is it better to keep these Asserts in the same order that the TableAmRoutine fields are assigned (in heapam_handler.c)? ~~~ 3. + /* + * Callbacks for parallel vacuum are also optional (except for + * parallel_vacuum_compute_workers). But one callback implies presence of + * the others. + */ + Assert(((((routine->parallel_vacuum_estimate == NULL) == + (routine->parallel_vacuum_initialize == NULL)) == + (routine->parallel_vacuum_initialize_worker == NULL)) == + (routine->parallel_vacuum_collect_dead_items == NULL))); /also optional/optional/ ====== src/include/access/heapam.h +extern int heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested); 4. wrong tab/space after 'int'. ====== src/include/access/tableam.h 5. + /* + * Compute the number of parallel workers for parallel table vacuum. The + * parallel degree for parallel vacuum is further limited by + * max_parallel_maintenance_workers. The function must return 0 to disable + * parallel table vacuum. + * + * 'nworkers_requested' is a >=0 number and the requested number of + * workers. This comes from the PARALLEL option. 0 means to choose the + * parallel degree based on the table AM specific factors such as table + * size. + */ + int (*parallel_vacuum_compute_workers) (Relation rel, + int nworkers_requested); The comment here says "This comes from the PARALLEL option." and "0 means to choose the parallel degree...". But, the PG docs [1] says "To disable this feature, one can use PARALLEL option and specify parallel workers as zero.". These two descriptions "disable this feature" (PG docs) and letting the system "choose the parallel degree" (code comment) don't sound the same. Should this 0001 patch update the PG documentation for the effect of setting PARALLEL value zero? ~~~ 6. + /* + * Initialize DSM space for parallel table vacuum. + * + * Not called if parallel table vacuum is disabled. + * + * Optional callback, but either all other parallel vacuum callbacks need + * to exist, or neither. + */ "or neither"? Also, saying "all other" seems incorrect because parallel_vacuum_compute_workers callback must exist event if parallel_vacuum_initialize does not exist. IMO you meant to say "all optional", and "or none". SUGGESTION: Optional callback. Either all optional parallel vacuum callbacks need to exist, or none. (this same issue is repeated in multiple places). ====== [1] https://www.postgresql.org/docs/devel/sql-vacuum.html Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. Here are some review comments for patch v11-0002 ====== Commit message. 1. Heap table AM disables the parallel heap vacuuming for now, but an upcoming patch uses it. This function implementation was moved into patch 0001, so probably this part of the commit message comment also belongs now in patch 0001. ====== src/backend/commands/vacuumparallel.c 2. + * For processing indexes in parallel, individual indexes are processed by one + * vacuum process. We launch parallel worker processes at the start of parallel index + * bulk-deletion and index cleanup and once all indexes are processed, the parallel + * worker processes exit. + * "are processed by one vacuum process." -- Did you mean "are processed by separate vacuum processes." ? ~~~ 3. + * + * Each time we process table or indexes in parallel, the parallel context is + * re-initialized so that the same DSM can be used for multiple passes of table vacuum + * or index bulk-deletion and index cleanup. Maybe I am mistaken, but it seems like the logic is almost always re-initializing this. I wonder if it might be simpler to just remove the 'need_reinitialize_dsm' field and initialize unconditionally. ~~~ 4. + * nrequested_workers is >= 0 number and the requested parallel degree. 0 + * means that the parallel degrees for table and indexes vacuum are decided + * differently. See the comments of parallel_vacuum_compute_workers() for + * details. + * * On success, return parallel vacuum state. Otherwise return NULL. */ SUGGESTION nrequested_workers is the requested parallel degree (>=0). 0 means that... ~~~ 5. static int -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested, - bool *will_parallel_vacuum) +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, + int nrequested, int *nworkers_table_p, + bool *idx_will_parallel_vacuum) IIUC the returns for this function seem inconsistent. AFAIK, it previously would return the number of workers for parallel index vacuuming. But now (after this patch) the return value is returned Max(nworkers_table, nworkers_index). Meanwhile, the number of workers for parallel table vacuuming is returned as a by-reference parameter 'nworkers_table_p'. In other words, it is returning the number of workers in 2 different ways. Why not make this a void function, but introduce another parameter 'nworkers_index_p', similar to 'nworkers_table_p'? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Another performance regression I can see in the results is that heap > > vacuum phase (phase III) got slower with the patch. It's weired to me > > since I don't touch the code of heap vacuum phase. I'm still > > investigating the cause. > > I have investigated this regression. I've confirmed that In both > scenarios (patched and unpatched), the entire table and its associated > indexes were loaded into the shared buffer before the vacuum. Then, > the 'perf record' analysis, focused specifically on the heap vacuum > phase of the patched code, revealed numerous soft page faults > occurring: > > 62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel > | > |--52.44%--lazy_vacuum_heap_rel > | | > | |--46.33%--lazy_vacuum_heap_page (inlined) > | | | > | | |--32.42%--heap_page_is_all_visible (inlined) > | | | | > | | | |--26.46%--HeapTupleSatisfiesVacuum > | | | | > HeapTupleSatisfiesVacuumHorizon > | | | | > HeapTupleHeaderXminCommitted (inlined) > | | | | | > | | | | --18.52%--page_fault > | | | | do_page_fault > | | | | > __do_page_fault > | | | | > handle_mm_fault > | | | | > __handle_mm_fault > | | | | > handle_pte_fault > | | | | | > | | | | > |--16.53%--filemap_map_pages > | | | | | | > | | | | | > --2.63%--alloc_set_pte > | | | | | > pfn_pte > | | | | | > | | | | > --1.99%--pmd_page_vaddr > | | | | > | | | --1.99%--TransactionIdPrecedes > > I did not observe these page faults in the 'perf record' results for > the HEAD version. Furthermore, when I disabled parallel heap vacuum > while keeping parallel index vacuuming enabled, the regression > disappeared. Based on these findings, the likely cause of the > regression appears to be that during parallel heap vacuum operations, > table blocks were loaded into the shared buffer by parallel vacuum > workers. > In the previous paragraph, you mentioned that the entire table and its associated indexes were loaded into the shared buffer before the vacuum. If that is true, then why does the parallel vacuum need to reload the table blocks into shared buffers? > However, in the heap vacuum phase, the leader process needed > to process all blocks, resulting in soft page faults while creating > Page Table Entries (PTEs). Without the patch, the backend process had > already created PTEs during the heap scan, thus preventing these > faults from occurring during the heap vacuum phase. > This part is again not clear to me because I am assuming all the data exists in shared buffers before the vacuum, so why the page faults will occur in the first place. -- With Regards, Amit Kapila.
On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Discussing with Amit offlist, I've run another benchmark test where no > data is loaded on the shared buffer. In the previous test, I loaded > all table blocks before running vacuum, so it was the best case. The > attached test results showed the worst case. > > Overall, while the numbers seem not stable, the phase I got sped up a > bit, but not as scalable as expected, which is not surprising. > Sorry, but it is difficult for me to understand this data because it doesn't contain the schema or details like what exactly is a fraction. It is also not clear how the workers are divided among heap and indexes, like do we use parallelism for both phases of heap or only first phase and do we reuse those workers for index vacuuming. These tests were probably discussed earlier, but it would be better to either add a summary of the required information to understand the results or at least a link to a previous email that has such details. Please > note that the test results shows that the phase III also got sped up > but this is because in parallel vacuum we use more ring buffers than > the single process vacuum. So we need to compare the only phase I time > in terms of the benefit of the parallelism. > Does phase 3 also use parallelism? If so, can we try to divide the ring buffers among workers or at least try vacuum with an increased number of ring buffers. This would be good to do for both the phases, if they both use parallelism. -- With Regards, Amit Kapila.
Hi Sawada-San. Some review comments for patch v11-0003. ====== src/backend/access/heap/vacuumlazy.c 1. +typedef struct LVScanData +{ + BlockNumber rel_pages; /* total number of pages */ + + BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */ + + /* + * Count of all-visible blocks eagerly scanned (for logging only). This + * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD. + */ + BlockNumber eager_scanned_pages; + + BlockNumber removed_pages; /* # pages removed by relation truncation */ + BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */ + + + /* # pages newly set all-visible in the VM */ + BlockNumber vm_new_visible_pages; + + /* + * # pages newly set all-visible and all-frozen in the VM. This is a + * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes + * all pages set all-visible, but vm_new_visible_frozen_pages includes + * only those which were also set all-frozen. + */ + BlockNumber vm_new_visible_frozen_pages; + + /* # all-visible pages newly set all-frozen in the VM */ + BlockNumber vm_new_frozen_pages; + + BlockNumber lpdead_item_pages; /* # pages with LP_DEAD items */ + BlockNumber missed_dead_pages; /* # pages with missed dead tuples */ + BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ + + /* Counters that follow are only for scanned_pages */ + int64 tuples_deleted; /* # deleted from table */ + int64 tuples_frozen; /* # newly frozen */ + int64 lpdead_items; /* # deleted from indexes */ + int64 live_tuples; /* # live tuples remaining */ + int64 recently_dead_tuples; /* # dead, but not yet removable */ + int64 missed_dead_tuples; /* # removable, but not removed */ + + /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */ + TransactionId NewRelfrozenXid; + MultiXactId NewRelminMxid; + bool skippedallvis; +} LVScanData; + 1a. Double blank line after 'new_frozen_tuple_pages' field. ~ 1b. I know you have only refactored existing code but IMO, the use of "#" meaning "Number of" or "Count of" doesn't seem nice. It *may* be justifiable to prevent wrapping of the short comments on the same line as the fields, but for all the other larger comments it looks strange not to be spelled out properly as words. ~~~ heap_vacuum_rel: 2. /* Initialize page counters explicitly (be tidy) */ - vacrel->scanned_pages = 0; - vacrel->eager_scanned_pages = 0; - vacrel->removed_pages = 0; - vacrel->new_frozen_tuple_pages = 0; - vacrel->lpdead_item_pages = 0; - vacrel->missed_dead_pages = 0; - vacrel->nonempty_pages = 0; + scan_data = palloc(sizeof(LVScanData)); + scan_data->scanned_pages = 0; + scan_data->eager_scanned_pages = 0; + scan_data->removed_pages = 0; + scan_data->new_frozen_tuple_pages = 0; + scan_data->lpdead_item_pages = 0; + scan_data->missed_dead_pages = 0; + scan_data->nonempty_pages = 0; + scan_data->tuples_deleted = 0; + scan_data->tuples_frozen = 0; + scan_data->lpdead_items = 0; + scan_data->live_tuples = 0; + scan_data->recently_dead_tuples = 0; + scan_data->missed_dead_tuples = 0; + scan_data->vm_new_visible_pages = 0; + scan_data->vm_new_visible_frozen_pages = 0; + scan_data->vm_new_frozen_pages = 0; + scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel); + vacrel->scan_data = scan_data; Or, you could replace most of these assignments by using palloc0, and write a comment to say all the counters were zapped to 0. ~~~ 3. + scan_data->vm_new_frozen_pages = 0; + scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel); + vacrel->scan_data = scan_data; /* dead_items_alloc allocates vacrel->dead_items later on */ That comment about dead_items_alloc seems misplaced now because it is grouped with all the scan_data stuff. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > Another performance regression I can see in the results is that heap > > > vacuum phase (phase III) got slower with the patch. It's weired to me > > > since I don't touch the code of heap vacuum phase. I'm still > > > investigating the cause. > > > > I have investigated this regression. I've confirmed that In both > > scenarios (patched and unpatched), the entire table and its associated > > indexes were loaded into the shared buffer before the vacuum. Then, > > the 'perf record' analysis, focused specifically on the heap vacuum > > phase of the patched code, revealed numerous soft page faults > > occurring: > > > > 62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel > > | > > |--52.44%--lazy_vacuum_heap_rel > > | | > > | |--46.33%--lazy_vacuum_heap_page (inlined) > > | | | > > | | |--32.42%--heap_page_is_all_visible (inlined) > > | | | | > > | | | |--26.46%--HeapTupleSatisfiesVacuum > > | | | | > > HeapTupleSatisfiesVacuumHorizon > > | | | | > > HeapTupleHeaderXminCommitted (inlined) > > | | | | | > > | | | | --18.52%--page_fault > > | | | | do_page_fault > > | | | | > > __do_page_fault > > | | | | > > handle_mm_fault > > | | | | > > __handle_mm_fault > > | | | | > > handle_pte_fault > > | | | | | > > | | | | > > |--16.53%--filemap_map_pages > > | | | | | | > > | | | | | > > --2.63%--alloc_set_pte > > | | | | | > > pfn_pte > > | | | | | > > | | | | > > --1.99%--pmd_page_vaddr > > | | | | > > | | | --1.99%--TransactionIdPrecedes > > > > I did not observe these page faults in the 'perf record' results for > > the HEAD version. Furthermore, when I disabled parallel heap vacuum > > while keeping parallel index vacuuming enabled, the regression > > disappeared. Based on these findings, the likely cause of the > > regression appears to be that during parallel heap vacuum operations, > > table blocks were loaded into the shared buffer by parallel vacuum > > workers. > > > > In the previous paragraph, you mentioned that the entire table and its > associated indexes were loaded into the shared buffer before the > vacuum. If that is true, then why does the parallel vacuum need to > reload the table blocks into shared buffers? Hmm, my above sentences are wrong. All pages are loaded into the shared buffer before the vacuum test. In parallel heap vacuum cases, the leader process reads a part of the table during phase I whereas in single-process vacuum case, the process reads the entire table. So there will be differences in the phase III in terms of PTEs as described below. > > > However, in the heap vacuum phase, the leader process needed > > to process all blocks, resulting in soft page faults while creating > > Page Table Entries (PTEs). Without the patch, the backend process had > > already created PTEs during the heap scan, thus preventing these > > faults from occurring during the heap vacuum phase. > > > > This part is again not clear to me because I am assuming all the data > exists in shared buffers before the vacuum, so why the page faults > will occur in the first place. IIUC PTEs are process-local data. So even if physical pages are loaded to PostgreSQL's shared buffer (and paga caches), soft page faults (or minor page faults)[1] can occur if these pages are not yet mapped in its page table. Regards, [1] https://en.wikipedia.org/wiki/Page_fault#Minor -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Discussing with Amit offlist, I've run another benchmark test where no > > data is loaded on the shared buffer. In the previous test, I loaded > > all table blocks before running vacuum, so it was the best case. The > > attached test results showed the worst case. > > > > Overall, while the numbers seem not stable, the phase I got sped up a > > bit, but not as scalable as expected, which is not surprising. > > > > Sorry, but it is difficult for me to understand this data because it > doesn't contain the schema or details like what exactly is a fraction. > It is also not clear how the workers are divided among heap and > indexes, like do we use parallelism for both phases of heap or only > first phase and do we reuse those workers for index vacuuming. These > tests were probably discussed earlier, but it would be better to > either add a summary of the required information to understand the > results or at least a link to a previous email that has such details. The testing configurations are: max_wal_size = 50GB shared_buffers = 25GB max_parallel_maintenance_workers = 10 max_parallel_workers = 20 max_worker_processes = 30 The test scripts are: ($m and $p are a fraction and a parallel degree, respectively) create unlogged table test_vacuum (a bigint) with (autovacuum_enabled=off); insert into test_vacuum select i from generate_series(1,200000000) s(i); create index idx_0 on test_vacuum (a); create index idx_1 on test_vacuum (a); create index idx_2 on test_vacuum (a); create index idx_3 on test_vacuum (a); create index idx_4 on test_vacuum (a); delete from test_vacuum where mod(a, $m) = 0; vacuum (verbose, parallel $p) test_vacuum; -- measured the execution time > > Please > > note that the test results shows that the phase III also got sped up > > but this is because in parallel vacuum we use more ring buffers than > > the single process vacuum. So we need to compare the only phase I time > > in terms of the benefit of the parallelism. > > > > Does phase 3 also use parallelism? If so, can we try to divide the > ring buffers among workers or at least try vacuum with an increased > number of ring buffers. This would be good to do for both the phases, > if they both use parallelism. No, only phase 1 was parallelized in this test. In parallel vacuum, since it uses (ring_buffer_size * parallel_degree) memory, more pages are loaded during phase 1, increasing cache hits during phase 3. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sat, Mar 8, 2025 at 1:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached the updated version patches. I've started trying to review this and realized that, while I'm familiar with heap vacuuming code, I'm not familiar enough with the vacuumparallel.c machinery to be of help without much additional study. As such, I have mainly focused on reading the comments in your code. I think your comment in vacuumlazy.c describing the design could use more detail and a bit of massaging. For example, I don't know what you mean when you say: * We could require different number of parallel vacuum workers for each phase * for various factors such as table size and number of indexes. Does that refer to something you did implement or you are saying we could do that in the future? - Melanie
On Tue, Mar 11, 2025 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Does phase 3 also use parallelism? If so, can we try to divide the > > ring buffers among workers or at least try vacuum with an increased > > number of ring buffers. This would be good to do for both the phases, > > if they both use parallelism. > > No, only phase 1 was parallelized in this test. In parallel vacuum, > since it uses (ring_buffer_size * parallel_degree) memory, more pages > are loaded during phase 1, increasing cache hits during phase 3. > Shouldn't we ideally try with a vacuum without parallelism with ring_buffer_size * parallel_degree to make the comparison better? Also, what could be the reason for the variation in data of phase-I? Do you restart the system after each run to ensure there is nothing in the memory? If not, then shouldn't we try at least a few runs by restarting the system before each run to ensure there is nothing leftover in memory? -- With Regards, Amit Kapila.
On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > However, in the heap vacuum phase, the leader process needed > > > to process all blocks, resulting in soft page faults while creating > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > already created PTEs during the heap scan, thus preventing these > > > faults from occurring during the heap vacuum phase. > > > > > > > This part is again not clear to me because I am assuming all the data > > exists in shared buffers before the vacuum, so why the page faults > > will occur in the first place. > > IIUC PTEs are process-local data. So even if physical pages are loaded > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > minor page faults)[1] can occur if these pages are not yet mapped in > its page table. > Okay, I got your point. BTW, I noticed that even for the case where all the data is in shared_buffers, the performance improvement for workers greater than two does decrease marginally. Am I reading the data correctly? If so, what is the theory, and do we have recommendations for a parallel degree? -- With Regards, Amit Kapila.
On Mon, Mar 10, 2025 at 5:03 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sat, Mar 8, 2025 at 1:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've attached the updated version patches. > > I've started trying to review this and realized that, while I'm > familiar with heap vacuuming code, I'm not familiar enough with the > vacuumparallel.c machinery to be of help without much additional > study. As such, I have mainly focused on reading the comments in your > code. Thank you for looking at the patch. > > I think your comment in vacuumlazy.c describing the design could use > more detail and a bit of massaging. > > For example, I don't know what you mean when you say: > > * We could require different number of parallel vacuum workers for each phase > * for various factors such as table size and number of indexes. > > Does that refer to something you did implement or you are saying we > could do that in the future? It referred to the parallel heap vacuum implementation that I wrote. Since the parallel degrees for parallel heap scan and parallel index vacuuming are chosen separately based on different factors, we launch a different number of workers for each phase and they exit at the end of each phase. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > However, in the heap vacuum phase, the leader process needed > > > > to process all blocks, resulting in soft page faults while creating > > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > > already created PTEs during the heap scan, thus preventing these > > > > faults from occurring during the heap vacuum phase. > > > > > > > > > > This part is again not clear to me because I am assuming all the data > > > exists in shared buffers before the vacuum, so why the page faults > > > will occur in the first place. > > > > IIUC PTEs are process-local data. So even if physical pages are loaded > > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > > minor page faults)[1] can occur if these pages are not yet mapped in > > its page table. > > > > Okay, I got your point. BTW, I noticed that even for the case where > all the data is in shared_buffers, the performance improvement for > workers greater than two does decrease marginally. Am I reading the > data correctly? If so, what is the theory, and do we have > recommendations for a parallel degree? The decrease you referred to is that the total vacuum execution time? When it comes to the execution time of phase 1, it seems we have good scalability. For example, with 2 workers (i.e.3 workers working including the leader in total) it got about 3x speed up, and with 4 workers it got about 5x speed up. Regarding other phases, the phase 3 got slower probably because of PTEs stuff, but I don't investigate why the phase 2 also slightly got slower with more than 2 workers. In the current patch, the parallel degree for phase 1 is chosen based on the table size, which is almost the same as the calculation of the degree for parallel seq scan. But thinking further, we might want to account for the number of all-visible pages and all-frozen pages here so that we can avoid launching many workers for mostly-frozen-big-tables. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Does phase 3 also use parallelism? If so, can we try to divide the > > > ring buffers among workers or at least try vacuum with an increased > > > number of ring buffers. This would be good to do for both the phases, > > > if they both use parallelism. > > > > No, only phase 1 was parallelized in this test. In parallel vacuum, > > since it uses (ring_buffer_size * parallel_degree) memory, more pages > > are loaded during phase 1, increasing cache hits during phase 3. > > > > Shouldn't we ideally try with a vacuum without parallelism with > ring_buffer_size * parallel_degree to make the comparison better? Right. I'll share the benchmark test results with such configuration. > Also, what could be the reason for the variation in data of phase-I? > Do you restart the system after each run to ensure there is nothing in > the memory? If not, then shouldn't we try at least a few runs by > restarting the system before each run to ensure there is nothing > leftover in memory? I dropped all page caches by executing 'echo 3 > /proc/sys/vm/drop_caches' before each run and these results are the median of 3 runs. I'll investigate it further. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Some thoughts/questions on the idea I notice that we are always considering block-level parallelism for heaps and object-level parallelism for indexes. I'm wondering, when multiple tables are being vacuumed together—either because the user has provided a list of tables or has specified a partitioned table with multiple children—does it still make sense to default to block-level parallelism? Or could we consider table-level parallelism in such cases? For example, if there are 4 tables and 6 workers, with 2 tables being small and the other 2 being large, perhaps we could allocate 4 workers to vacuum all 4 tables in parallel. For the larger tables, we could apply block-level parallelism, using more workers for internal parallelism. On the other hand, if all tables are small, we could just apply table-level parallelism without needing block-level parallelism at all. This approach could offer more flexibility, isn't it? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Some more specific comments on the patch set -- v11-0001 1. This introduces functions like parallel_vacuum_estimate(), parallel_vacuum_initialize(), etc., but these functions haven't been assigned to the respective members in TableAMRoutine. As shown in the hunk below, only parallel_vacuum_compute_workers is initialized, while parallel_vacuum_estimate and the others are not. These are assigned in later patches, which makes the patch division appear a bit unclean. +++ b/src/backend/access/heap/heapam_handler.c @@ -2688,7 +2688,9 @@ static const TableAmRoutine heapam_methods = { .scan_bitmap_next_block = heapam_scan_bitmap_next_block, .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple, .scan_sample_next_block = heapam_scan_sample_next_block, - .scan_sample_next_tuple = heapam_scan_sample_next_tuple + .scan_sample_next_tuple = heapam_scan_sample_next_tuple, + + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers, }; -- v11-0002 2. In commit message, you mentioned a function name "parallel_vacuum_remove_dead_items_begin()" but there is no such function introduced in the patch, I think you meant 'parallel_vacuum_collect_dead_items_begin()'? 3. I would suggest rewrite for this /launched for the table vacuuming and index processing./launched for the table pruning phase and index vacuuming. 4. These comments are talking about DSA and DSM but we better clarify for what we are using DSM and DSA, or give reference to the location where we have explained that. + * In a parallel vacuum, we perform table scan or both index bulk deletion and + * index cleanup or all of them with parallel worker processes. Different + * numbers of workers are launched for the table vacuuming and index processing. + * ParallelVacuumState contains shared information as well as the memory space + * for storing dead items allocated in the DSA area. + * + * When initializing parallel table vacuum scan, we invoke table AM routines for + * estimating DSM sizes and initializing DSM memory. Parallel table vacuum + * workers invoke the table AM routine for vacuuming the table. 5. Is there any particular reason why marking the dead TID as reusable is not done in parallel? Is it because parallelizing that phase wouldn't make sense due to the work involved, or is there another reason? +/* The kind of parallel vacuum phases */ +typedef enum +{ + PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */ + PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */ +} PVWorkPhase; 6. In the below hunk "nrequested_workers is >= 0 number and the requested parallel degree." sentence doesn't make sense to me, can you rephrase this? + * nrequested_workers is >= 0 number and the requested parallel degree. 0 + * means that the parallel degrees for table and indexes vacuum are decided + * differently. See the comments of parallel_vacuum_compute_workers() for + * details. Thats what I got so far, I will continue reviewing 0002 and the remaining patches from the thread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 12, 2025 at 3:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > However, in the heap vacuum phase, the leader process needed > > > > > to process all blocks, resulting in soft page faults while creating > > > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > > > already created PTEs during the heap scan, thus preventing these > > > > > faults from occurring during the heap vacuum phase. > > > > > > > > > > > > > This part is again not clear to me because I am assuming all the data > > > > exists in shared buffers before the vacuum, so why the page faults > > > > will occur in the first place. > > > > > > IIUC PTEs are process-local data. So even if physical pages are loaded > > > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > > > minor page faults)[1] can occur if these pages are not yet mapped in > > > its page table. > > > > > > > Okay, I got your point. BTW, I noticed that even for the case where > > all the data is in shared_buffers, the performance improvement for > > workers greater than two does decrease marginally. Am I reading the > > data correctly? If so, what is the theory, and do we have > > recommendations for a parallel degree? > > The decrease you referred to is that the total vacuum execution time? > Right. > When it comes to the execution time of phase 1, it seems we have good > scalability. For example, with 2 workers (i.e.3 workers working > including the leader in total) it got about 3x speed up, and with 4 > workers it got about 5x speed up. Regarding other phases, the phase 3 > got slower probably because of PTEs stuff, but I don't investigate why > the phase 2 also slightly got slower with more than 2 workers. > Could it be possible that now phase-2 needs to access the shared area for TIDs, and some locking/unlocking causes such slowdown? -- With Regards, Amit Kapila.
On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Some thoughts/questions on the idea > > I notice that we are always considering block-level parallelism for > heaps and object-level parallelism for indexes. I'm wondering, when > multiple tables are being vacuumed together—either because the user > has provided a list of tables or has specified a partitioned table > with multiple children—does it still make sense to default to > block-level parallelism? Or could we consider table-level parallelism > in such cases? For example, if there are 4 tables and 6 workers, with > 2 tables being small and the other 2 being large, perhaps we could > allocate 4 workers to vacuum all 4 tables in parallel. For the larger > tables, we could apply block-level parallelism, using more workers for > internal parallelism. On the other hand, if all tables are small, we > could just apply table-level parallelism without needing block-level > parallelism at all. This approach could offer more flexibility, isn't > it? > I have not thought from this angle, but it seems we can build this even on top of block-level vacuum for large tables. -- With Regards, Amit Kapila.
On Wed, Mar 12, 2025 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Some thoughts/questions on the idea > > > > I notice that we are always considering block-level parallelism for > > heaps and object-level parallelism for indexes. I'm wondering, when > > multiple tables are being vacuumed together—either because the user > > has provided a list of tables or has specified a partitioned table > > with multiple children—does it still make sense to default to > > block-level parallelism? Or could we consider table-level parallelism > > in such cases? For example, if there are 4 tables and 6 workers, with > > 2 tables being small and the other 2 being large, perhaps we could > > allocate 4 workers to vacuum all 4 tables in parallel. For the larger > > tables, we could apply block-level parallelism, using more workers for > > internal parallelism. On the other hand, if all tables are small, we > > could just apply table-level parallelism without needing block-level > > parallelism at all. This approach could offer more flexibility, isn't > > it? > > > > I have not thought from this angle, but it seems we can build this > even on top of block-level vacuum for large tables. Yes, that can be built on top of block-level vacuum. In that case, we can utilize the workers more efficiently, depending on how many workers we have and how many tables need to be vacuumed. And yes, that could also be discussed separately. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 12, 2025 at 3:05 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 3:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > However, in the heap vacuum phase, the leader process needed > > > > > > to process all blocks, resulting in soft page faults while creating > > > > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > > > > already created PTEs during the heap scan, thus preventing these > > > > > > faults from occurring during the heap vacuum phase. > > > > > > > > > > > > > > > > This part is again not clear to me because I am assuming all the data > > > > > exists in shared buffers before the vacuum, so why the page faults > > > > > will occur in the first place. > > > > > > > > IIUC PTEs are process-local data. So even if physical pages are loaded > > > > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > > > > minor page faults)[1] can occur if these pages are not yet mapped in > > > > its page table. > > > > > > > > > > Okay, I got your point. BTW, I noticed that even for the case where > > > all the data is in shared_buffers, the performance improvement for > > > workers greater than two does decrease marginally. Am I reading the > > > data correctly? If so, what is the theory, and do we have > > > recommendations for a parallel degree? > > > > The decrease you referred to is that the total vacuum execution time? > > > > Right. > > > When it comes to the execution time of phase 1, it seems we have good > > scalability. For example, with 2 workers (i.e.3 workers working > > including the leader in total) it got about 3x speed up, and with 4 > > workers it got about 5x speed up. Regarding other phases, the phase 3 > > got slower probably because of PTEs stuff, but I don't investigate why > > the phase 2 also slightly got slower with more than 2 workers. > > > > Could it be possible that now phase-2 needs to access the shared area > for TIDs, and some locking/unlocking causes such slowdown? No, TidStore is shared in this case but we don't take a lock on it during phase 2. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 7, 2025 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached the updated version patches. When testing the multi passes of table vacuuming, I found an issue. With the current patch, both leader and parallel workers process stop the phase 1 as soon as the shared TidStore size reaches to the limit, and then the leader resumes the parallel heap scan after heap vacuuming and index vacuuming. Therefore, as I described in the patch, one tricky part of this patch is that it's possible that we launch fewer workers than the previous time when resuming phase 1 after phase 2 and 3. In this case, since the previous parallel workers might have already allocated some blocks in their chunk, newly launched workers need to take over their parallel scan state. That's why in the patch we store workers' ParallelBlockTableScanWorkerData in shared memory. However, I found my assumption is wrong; in order to take over the previous scan state properly we need to take over not only ParallelBlockTableScanWorkerData but also ReadStream data as parallel workers might have already queued some blocks for look-ahead in their ReadStream. Looking at ReadStream codes, I find that it's not realistic to store it into the shared memory. One plausible solution would be that we don't use ReadStream in parallel heap vacuum cases but directly use table_block_parallelscan_xxx() instead. It works but we end up having two different scan methods for parallel and non-parallel lazy heap scan. I've implemented this idea in the attached v12 patches. Other than the above change, I've fixed some bugs and addressed review comments I got so far. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v12-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
- v12-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v12-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patch
- v12-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v12-0005-Support-parallelism-for-collecting-dead-items-du.patch
On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > When testing the multi passes of table vacuuming, I found an issue. > With the current patch, both leader and parallel workers process stop > the phase 1 as soon as the shared TidStore size reaches to the limit, > and then the leader resumes the parallel heap scan after heap > vacuuming and index vacuuming. Therefore, as I described in the patch, > one tricky part of this patch is that it's possible that we launch > fewer workers than the previous time when resuming phase 1 after phase > 2 and 3. In this case, since the previous parallel workers might have > already allocated some blocks in their chunk, newly launched workers > need to take over their parallel scan state. That's why in the patch > we store workers' ParallelBlockTableScanWorkerData in shared memory. > However, I found my assumption is wrong; in order to take over the > previous scan state properly we need to take over not only > ParallelBlockTableScanWorkerData but also ReadStream data as parallel > workers might have already queued some blocks for look-ahead in their > ReadStream. Looking at ReadStream codes, I find that it's not > realistic to store it into the shared memory. It seems like one way to solve this would be to add functionality to the read stream to unpin the buffers it has in the buffers queue without trying to continue calling the callback until the stream is exhausted. We have read_stream_reset(), but that is to restart streams that have already been exhausted. Exhausted streams are where the callback has returned InvalidBlockNumber. In the read_stream_reset() cases, the read stream user knows there are more blocks it would like to scan or that it would like to restart the scan from the beginning. Your case is you want to stop trying to exhaust the read stream and just unpin the remaining buffers. As long as the worker which paused phase I knows exactly the last block it processed and can communicate this to whatever worker resumes phase I later, it can initialize vacrel->current_block to the last block processed. > One plausible solution would be that we don't use ReadStream in > parallel heap vacuum cases but directly use > table_block_parallelscan_xxx() instead. It works but we end up having > two different scan methods for parallel and non-parallel lazy heap > scan. I've implemented this idea in the attached v12 patches. One question is which scenarios will parallel vacuum phase I without AIO be faster than read AIO-ified vacuum phase I. Without AIO writes, I suppose it would be trivial for phase I parallel vacuum to be faster without using read AIO. But it's worth thinking about the tradeoff. - Melanie
Hi, On 2025-03-20 01:35:42 -0700, Masahiko Sawada wrote: > One plausible solution would be that we don't use ReadStream in > parallel heap vacuum cases but directly use > table_block_parallelscan_xxx() instead. It works but we end up having > two different scan methods for parallel and non-parallel lazy heap > scan. I've implemented this idea in the attached v12 patches. I think that's a bad idea - this means we'll never be able to use direct IO for parallel VACUUMs, despite a) The CPU overhead of buffered reads being a problem for VACUUM b) Data ending up in the kernel page cache is rather wasteful for VACUUM, as that's often data that won't otherwise be used again soon. I.e. these reads would particularly benefit from using direct IO. c) Even disregarding DIO, loosing the ability to do larger reads, as provided by read streams, looses a fair bit of efficiency (just try doing a pg_prewarm of a large relation with io_combine_limit=1 vs io_combine_limit=1). Greetings, Andres Freund
On Sat, Mar 22, 2025 at 7:16 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > When testing the multi passes of table vacuuming, I found an issue. > > With the current patch, both leader and parallel workers process stop > > the phase 1 as soon as the shared TidStore size reaches to the limit, > > and then the leader resumes the parallel heap scan after heap > > vacuuming and index vacuuming. Therefore, as I described in the patch, > > one tricky part of this patch is that it's possible that we launch > > fewer workers than the previous time when resuming phase 1 after phase > > 2 and 3. In this case, since the previous parallel workers might have > > already allocated some blocks in their chunk, newly launched workers > > need to take over their parallel scan state. That's why in the patch > > we store workers' ParallelBlockTableScanWorkerData in shared memory. > > However, I found my assumption is wrong; in order to take over the > > previous scan state properly we need to take over not only > > ParallelBlockTableScanWorkerData but also ReadStream data as parallel > > workers might have already queued some blocks for look-ahead in their > > ReadStream. Looking at ReadStream codes, I find that it's not > > realistic to store it into the shared memory. > > It seems like one way to solve this would be to add functionality to > the read stream to unpin the buffers it has in the buffers queue > without trying to continue calling the callback until the stream is > exhausted. > > We have read_stream_reset(), but that is to restart streams that have > already been exhausted. Exhausted streams are where the callback has > returned InvalidBlockNumber. In the read_stream_reset() cases, the > read stream user knows there are more blocks it would like to scan or > that it would like to restart the scan from the beginning. > > Your case is you want to stop trying to exhaust the read stream and > just unpin the remaining buffers. As long as the worker which paused > phase I knows exactly the last block it processed and can communicate > this to whatever worker resumes phase I later, it can initialize > vacrel->current_block to the last block processed. If we use ParallelBlockTableScanDesc with streaming read like the patch did, we would also need to somehow rewind the number of blocks allocated to workers. The problem I had with such usage was that a parallel vacuum worker allocated a new chunk of blocks when doing look-ahead reading and therefore advanced ParallelBlockTableScanDescData.phs_nallocated. In this case, even if we unpin the remaining buffers in the queue by a new functionality and a parallel worker resumes the phase 1 from the last processed block, we would lose some blocks in already allocated chunks unless we rewind ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData data. However, since a worker might have already allocated multiple chunks it would not be easy to rewind these scan state data. Another idea is that parallel workers don't exit phase 1 until it consumes all pinned buffers in the queue, even if the memory usage of TidStore exceeds the limit. It would need to add new functionality to the read stream to disable the look-ahead reading. Since we could use much memory while processing these buffers, exceeding the memory limit, we can trigger this mode when the memory usage of TidStore reaches 70% of the limit or so. On the other hand, it means that we would not use the streaming read for the blocks in this mode, which is not efficient. > > > One plausible solution would be that we don't use ReadStream in > > parallel heap vacuum cases but directly use > > table_block_parallelscan_xxx() instead. It works but we end up having > > two different scan methods for parallel and non-parallel lazy heap > > scan. I've implemented this idea in the attached v12 patches. > > One question is which scenarios will parallel vacuum phase I without > AIO be faster than read AIO-ified vacuum phase I. Without AIO writes, > I suppose it would be trivial for phase I parallel vacuum to be faster > without using read AIO. But it's worth thinking about the tradeoff. As Andres pointed out, there are major downsides. So we would need to invent a way to stop and resume the read stream in the middle during parallel scan. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > If we use ParallelBlockTableScanDesc with streaming read like the > patch did, we would also need to somehow rewind the number of blocks > allocated to workers. The problem I had with such usage was that a > parallel vacuum worker allocated a new chunk of blocks when doing > look-ahead reading and therefore advanced > ParallelBlockTableScanDescData.phs_nallocated. In this case, even if > we unpin the remaining buffers in the queue by a new functionality and > a parallel worker resumes the phase 1 from the last processed block, > we would lose some blocks in already allocated chunks unless we rewind > ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData > data. However, since a worker might have already allocated multiple > chunks it would not be easy to rewind these scan state data. Ah I didn't realize rewinding the state would be difficult. It seems like the easiest way to make sure those blocks are done is to add them back to the counter somehow. And I don't suppose there is some way to save these not yet done block assignments somewhere and give them to the workers who restart phase I to process on the second pass? > Another idea is that parallel workers don't exit phase 1 until it > consumes all pinned buffers in the queue, even if the memory usage of > TidStore exceeds the limit. It would need to add new functionality to > the read stream to disable the look-ahead reading. Since we could use > much memory while processing these buffers, exceeding the memory > limit, we can trigger this mode when the memory usage of TidStore > reaches 70% of the limit or so. On the other hand, it means that we > would not use the streaming read for the blocks in this mode, which is > not efficient. That might work. And/or maybe you could start decreasing the size of block assignment chunks when the memory usage of TidStore reaches a certain level. I don't know how much that would help or how fiddly it would be. > So we would need to > invent a way to stop and resume the read stream in the middle during > parallel scan. As for needing to add new read stream functionality, we actually probably don't have to. If you use read_stream_end() -> read_stream_reset(), it resets the distance to 0, so then read_stream_next_buffer() should just end up unpinning the buffers and freeing the per buffer data. I think the easiest way to implement this is to think about it as ending a read stream and starting a new one next time you start phase I and not as pausing and resuming the read stream. And anyway, maybe it's better not to keep a bunch of pinned buffers and allocated memory hanging around while doing what could be very long index scans. - Melanie
Hi, On 2025-03-23 01:45:35 -0700, Masahiko Sawada wrote: > Another idea is that parallel workers don't exit phase 1 until it > consumes all pinned buffers in the queue, even if the memory usage of > TidStore exceeds the limit. Yes, that seems a quite reasonable approach to me. > It would need to add new functionality to the read stream to disable the > look-ahead reading. Couldn't your next block callback simply return InvalidBlockNumber once close to the memory limit? > Since we could use much memory while processing these buffers, exceeding the > memory limit, we can trigger this mode when the memory usage of TidStore > reaches 70% of the limit or so. It wouldn't be that much memory, would it? A few 10s-100s of buffers don't increase the size of a TidStore that much? Using 10 parallel vacuum with a m_w_m of 1MB doesn't make sense, we'd constantly start/stop workers. > On the other hand, it means that we would not use the streaming read for the > blocks in this mode, which is not efficient. I don't follow - why wouldn't you be using streaming read? Greetings, Andres Freund
On Sun, Mar 23, 2025 at 10:01 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > If we use ParallelBlockTableScanDesc with streaming read like the > > patch did, we would also need to somehow rewind the number of blocks > > allocated to workers. The problem I had with such usage was that a > > parallel vacuum worker allocated a new chunk of blocks when doing > > look-ahead reading and therefore advanced > > ParallelBlockTableScanDescData.phs_nallocated. In this case, even if > > we unpin the remaining buffers in the queue by a new functionality and > > a parallel worker resumes the phase 1 from the last processed block, > > we would lose some blocks in already allocated chunks unless we rewind > > ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData > > data. However, since a worker might have already allocated multiple > > chunks it would not be easy to rewind these scan state data. > > Ah I didn't realize rewinding the state would be difficult. It seems > like the easiest way to make sure those blocks are done is to add them > back to the counter somehow. And I don't suppose there is some way to > save these not yet done block assignments somewhere and give them to > the workers who restart phase I to process on the second pass? It might be possible to store the not-yet-done-blocks in DSA to pass them to the next workers. But it would make the codes more complex. > > Another idea is that parallel workers don't exit phase 1 until it > > consumes all pinned buffers in the queue, even if the memory usage of > > TidStore exceeds the limit. It would need to add new functionality to > > the read stream to disable the look-ahead reading. Since we could use > > much memory while processing these buffers, exceeding the memory > > limit, we can trigger this mode when the memory usage of TidStore > > reaches 70% of the limit or so. On the other hand, it means that we > > would not use the streaming read for the blocks in this mode, which is > > not efficient. > > That might work. And/or maybe you could start decreasing the size of > block assignment chunks when the memory usage of TidStore reaches a > certain level. I don't know how much that would help or how fiddly it > would be. I've tried this idea in the attached version patch. I've started with a simple approach; once the TidStore reaches the limit, heap_vac_scan_next_block(), a callback for the read stream, begins to return InvalidBlockNumber. We continue phase 1 until the read stream is exhausted. > > > So we would need to > > invent a way to stop and resume the read stream in the middle during > > parallel scan. > > As for needing to add new read stream functionality, we actually > probably don't have to. If you use read_stream_end() -> > read_stream_reset(), it resets the distance to 0, so then > read_stream_next_buffer() should just end up unpinning the buffers and > freeing the per buffer data. I think the easiest way to implement this > is to think about it as ending a read stream and starting a new one > next time you start phase I and not as pausing and resuming the read > stream. And anyway, maybe it's better not to keep a bunch of pinned > buffers and allocated memory hanging around while doing what could be > very long index scans. You're right. I've studied the read stream code and figured out how to use it. In the attached patch, we end the read stream at the end of phase 1 and start a new read stream, as you suggested. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v13-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patch
- v13-0005-Support-parallelism-for-collecting-dead-items-du.patch
- v13-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v13-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v13-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
On Sun, Mar 23, 2025 at 10:13 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-03-23 01:45:35 -0700, Masahiko Sawada wrote: > > Another idea is that parallel workers don't exit phase 1 until it > > consumes all pinned buffers in the queue, even if the memory usage of > > TidStore exceeds the limit. > > Yes, that seems a quite reasonable approach to me. > > > > It would need to add new functionality to the read stream to disable the > > look-ahead reading. > > Couldn't your next block callback simply return InvalidBlockNumber once close > to the memory limit? Good idea. > > > > Since we could use much memory while processing these buffers, exceeding the > > memory limit, we can trigger this mode when the memory usage of TidStore > > reaches 70% of the limit or so. > > It wouldn't be that much memory, would it? A few 10s-100s of buffers don't > increase the size of a TidStore that much? Using 10 parallel vacuum with a > m_w_m of 1MB doesn't make sense, we'd constantly start/stop workers. > It ultimately depends on how big the next size class of DSA segment is, but typically yes. We configure the max DSA segment size based on the maintenance_work_mem. > > > On the other hand, it means that we would not use the streaming read for the > > blocks in this mode, which is not efficient. > > I don't follow - why wouldn't you be using streaming read? I was thinking of idea like stopping look-ahead reading at some point and processing pages without the read stream until the TidStore reaches the limit. But it seems like simpler if we could stop retrieving next block for the read stream once the TidStore reached the limit and then continue phase 1 until the read stream is exhausted. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > You're right. I've studied the read stream code and figured out how to > use it. In the attached patch, we end the read stream at the end of > phase 1 and start a new read stream, as you suggested. I've started looking at this patch set some more. In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD codepath and run out of unskippable blocks in the current chunk and then go back to get another chunk (goto retry) but we are near the memory limit so we can't get another block (!dead_items_check_memory_limit()), could we get an infinite loop? Or even incorrectly encroach on another worker's block? Asking that because of this math end_block = next_block + vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 and we are in the goto retry loop, it seems like we could keep incrementing next_block even when we shouldn't be. I just want to make sure that the skip pages optimization works with the parallel block assignment and the low memory read stream wind-down. I also think you do not need to expose table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is only called in heap-specific code and the logic seems very heap-related. If table AMs want something to skip some blocks, they could easily implement it. On another topic, I think it would be good to have a comment above this code in parallel_lazy_scan_gather_scan_results(), stating why we are very sure it is correct. Assert(TransactionIdIsValid(data->NewRelfrozenXid)); Assert(MultiXactIdIsValid(data->NewRelminMxid)); if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid)) vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid)) vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) vacrel->scan_data->nonempty_pages = data->nonempty_pages; vacrel->scan_data->skippedallvis |= data->skippedallvis; Parallel relfrozenxid advancement sounds scary, and scary things are best with comments. Even though the way this works is intuitive, I think it is worth pointing out that this part is important to get right so future programmers know how important it is. One thing I was wondering about is if there are any implications of different workers having different values in their GlobalVisState. GlobalVisState can be updated during vacuum, so even if they start out with the same values, that could diverge. It is probably okay since it just controls what tuples are removable. Some workers may remove fewer tuples than they absolutely could, and this is probably okay. And if it is okay for each worker to have different GlobalVisState then maybe you shouldn't have a GlobalVisState in shared memory. If you look at GlobalVisTestFor() it just returns the memory address of that global variable in the backend. So, it seems like it might be better to just let each parallel worker use their own backend local GlobalVisState and not try to put it in shared memory and copy it from one worker to the other workers when initializing them. I'm not sure. At the very least, there should be a comment explaining why you've done it the way you have done it. - Melanie
On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > You're right. I've studied the read stream code and figured out how to > > use it. In the attached patch, we end the read stream at the end of > > phase 1 and start a new read stream, as you suggested. > > I've started looking at this patch set some more. Thank you for reviewing the patch! > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > codepath and run out of unskippable blocks in the current chunk and > then go back to get another chunk (goto retry) but we are near the > memory limit so we can't get another block > (!dead_items_check_memory_limit()), could we get an infinite loop? > > Or even incorrectly encroach on another worker's block? Asking that > because of this math > end_block = next_block + > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; You're right. We should make sure that reset next_block is reset to InvalidBlockNumber at the beginning of the retry loop. > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > and we are in the goto retry loop, it seems like we could keep > incrementing next_block even when we shouldn't be. Right. Will fix. > > I just want to make sure that the skip pages optimization works with > the parallel block assignment and the low memory read stream > wind-down. > > I also think you do not need to expose > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > only called in heap-specific code and the logic seems very > heap-related. If table AMs want something to skip some blocks, they > could easily implement it. Agreed. Will remove it. > > On another topic, I think it would be good to have a comment above > this code in parallel_lazy_scan_gather_scan_results(), stating why we > are very sure it is correct. > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > vacrel->scan_data->NewRelfrozenXid)) > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > vacrel->scan_data->NewRelminMxid)) > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > Parallel relfrozenxid advancement sounds scary, and scary things are > best with comments. Even though the way this works is intuitive, I > think it is worth pointing out that this part is important to get > right so future programmers know how important it is. > > One thing I was wondering about is if there are any implications of > different workers having different values in their GlobalVisState. > GlobalVisState can be updated during vacuum, so even if they start out > with the same values, that could diverge. It is probably okay since it > just controls what tuples are removable. Some workers may remove fewer > tuples than they absolutely could, and this is probably okay. > Good point. > And if it is okay for each worker to have different GlobalVisState > then maybe you shouldn't have a GlobalVisState in shared memory. If > you look at GlobalVisTestFor() it just returns the memory address of > that global variable in the backend. So, it seems like it might be > better to just let each parallel worker use their own backend local > GlobalVisState and not try to put it in shared memory and copy it from > one worker to the other workers when initializing them. I'm not sure. > At the very least, there should be a comment explaining why you've > done it the way you have done it. Agreed. IIUC it's not a problem even if parallel workers use their own GlobalVisState. I'll make that change and remove the 0004 patch which exposes GlobalVisState. I'll send the updated patch soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > You're right. I've studied the read stream code and figured out how to > > > use it. In the attached patch, we end the read stream at the end of > > > phase 1 and start a new read stream, as you suggested. > > > > I've started looking at this patch set some more. > > Thank you for reviewing the patch! > > > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > > codepath and run out of unskippable blocks in the current chunk and > > then go back to get another chunk (goto retry) but we are near the > > memory limit so we can't get another block > > (!dead_items_check_memory_limit()), could we get an infinite loop? > > > > Or even incorrectly encroach on another worker's block? Asking that > > because of this math > > end_block = next_block + > > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; > > You're right. We should make sure that reset next_block is reset to > InvalidBlockNumber at the beginning of the retry loop. > > > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > > and we are in the goto retry loop, it seems like we could keep > > incrementing next_block even when we shouldn't be. > > Right. Will fix. > > > > > I just want to make sure that the skip pages optimization works with > > the parallel block assignment and the low memory read stream > > wind-down. > > > > I also think you do not need to expose > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > > only called in heap-specific code and the logic seems very > > heap-related. If table AMs want something to skip some blocks, they > > could easily implement it. > > Agreed. Will remove it. > > > > > On another topic, I think it would be good to have a comment above > > this code in parallel_lazy_scan_gather_scan_results(), stating why we > > are very sure it is correct. > > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > > vacrel->scan_data->NewRelfrozenXid)) > > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > > vacrel->scan_data->NewRelminMxid)) > > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > > > Parallel relfrozenxid advancement sounds scary, and scary things are > > best with comments. Even though the way this works is intuitive, I > > think it is worth pointing out that this part is important to get > > right so future programmers know how important it is. > > > > One thing I was wondering about is if there are any implications of > > different workers having different values in their GlobalVisState. > > GlobalVisState can be updated during vacuum, so even if they start out > > with the same values, that could diverge. It is probably okay since it > > just controls what tuples are removable. Some workers may remove fewer > > tuples than they absolutely could, and this is probably okay. > > > > Good point. > > > And if it is okay for each worker to have different GlobalVisState > > then maybe you shouldn't have a GlobalVisState in shared memory. If > > you look at GlobalVisTestFor() it just returns the memory address of > > that global variable in the backend. So, it seems like it might be > > better to just let each parallel worker use their own backend local > > GlobalVisState and not try to put it in shared memory and copy it from > > one worker to the other workers when initializing them. I'm not sure. > > At the very least, there should be a comment explaining why you've > > done it the way you have done it. > > Agreed. IIUC it's not a problem even if parallel workers use their own > GlobalVisState. I'll make that change and remove the 0004 patch which > exposes GlobalVisState. > > I'll send the updated patch soon. I've attached the updated patches. This version includes the comments from Melanie, some bug fixes, and comment updates. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > You're right. I've studied the read stream code and figured out how to > > > > use it. In the attached patch, we end the read stream at the end of > > > > phase 1 and start a new read stream, as you suggested. > > > > > > I've started looking at this patch set some more. > > > > Thank you for reviewing the patch! > > > > > > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > > > codepath and run out of unskippable blocks in the current chunk and > > > then go back to get another chunk (goto retry) but we are near the > > > memory limit so we can't get another block > > > (!dead_items_check_memory_limit()), could we get an infinite loop? > > > > > > Or even incorrectly encroach on another worker's block? Asking that > > > because of this math > > > end_block = next_block + > > > > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; > > > > You're right. We should make sure that reset next_block is reset to > > InvalidBlockNumber at the beginning of the retry loop. > > > > > > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > > > and we are in the goto retry loop, it seems like we could keep > > > incrementing next_block even when we shouldn't be. > > > > Right. Will fix. > > > > > > > > I just want to make sure that the skip pages optimization works with > > > the parallel block assignment and the low memory read stream > > > wind-down. > > > > > > I also think you do not need to expose > > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > > > only called in heap-specific code and the logic seems very > > > heap-related. If table AMs want something to skip some blocks, they > > > could easily implement it. > > > > Agreed. Will remove it. > > > > > > > > On another topic, I think it would be good to have a comment above > > > this code in parallel_lazy_scan_gather_scan_results(), stating why we > > > are very sure it is correct. > > > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > > > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > > > > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > > > vacrel->scan_data->NewRelfrozenXid)) > > > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > > > > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > > > vacrel->scan_data->NewRelminMxid)) > > > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > > > > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > > > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > > > > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > > > > > Parallel relfrozenxid advancement sounds scary, and scary things are > > > best with comments. Even though the way this works is intuitive, I > > > think it is worth pointing out that this part is important to get > > > right so future programmers know how important it is. > > > > > > One thing I was wondering about is if there are any implications of > > > different workers having different values in their GlobalVisState. > > > GlobalVisState can be updated during vacuum, so even if they start out > > > with the same values, that could diverge. It is probably okay since it > > > just controls what tuples are removable. Some workers may remove fewer > > > tuples than they absolutely could, and this is probably okay. > > > > > > > Good point. > > > > > And if it is okay for each worker to have different GlobalVisState > > > then maybe you shouldn't have a GlobalVisState in shared memory. If > > > you look at GlobalVisTestFor() it just returns the memory address of > > > that global variable in the backend. So, it seems like it might be > > > better to just let each parallel worker use their own backend local > > > GlobalVisState and not try to put it in shared memory and copy it from > > > one worker to the other workers when initializing them. I'm not sure. > > > At the very least, there should be a comment explaining why you've > > > done it the way you have done it. > > > > Agreed. IIUC it's not a problem even if parallel workers use their own > > GlobalVisState. I'll make that change and remove the 0004 patch which > > exposes GlobalVisState. > > > > I'll send the updated patch soon. > > I've attached the updated patches. This version includes the comments > from Melanie, some bug fixes, and comment updates. Rebased the patches as they conflicted with recent commits. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 31, 2025 at 2:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > You're right. I've studied the read stream code and figured out how to > > > > > use it. In the attached patch, we end the read stream at the end of > > > > > phase 1 and start a new read stream, as you suggested. > > > > > > > > I've started looking at this patch set some more. > > > > > > Thank you for reviewing the patch! > > > > > > > > > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > > > > codepath and run out of unskippable blocks in the current chunk and > > > > then go back to get another chunk (goto retry) but we are near the > > > > memory limit so we can't get another block > > > > (!dead_items_check_memory_limit()), could we get an infinite loop? > > > > > > > > Or even incorrectly encroach on another worker's block? Asking that > > > > because of this math > > > > end_block = next_block + > > > > > > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; > > > > > > You're right. We should make sure that reset next_block is reset to > > > InvalidBlockNumber at the beginning of the retry loop. > > > > > > > > > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > > > > and we are in the goto retry loop, it seems like we could keep > > > > incrementing next_block even when we shouldn't be. > > > > > > Right. Will fix. > > > > > > > > > > > I just want to make sure that the skip pages optimization works with > > > > the parallel block assignment and the low memory read stream > > > > wind-down. > > > > > > > > I also think you do not need to expose > > > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > > > > only called in heap-specific code and the logic seems very > > > > heap-related. If table AMs want something to skip some blocks, they > > > > could easily implement it. > > > > > > Agreed. Will remove it. > > > > > > > > > > > On another topic, I think it would be good to have a comment above > > > > this code in parallel_lazy_scan_gather_scan_results(), stating why we > > > > are very sure it is correct. > > > > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > > > > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > > > > > > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > > > > vacrel->scan_data->NewRelfrozenXid)) > > > > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > > > > > > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > > > > vacrel->scan_data->NewRelminMxid)) > > > > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > > > > > > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > > > > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > > > > > > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > > > > > > > Parallel relfrozenxid advancement sounds scary, and scary things are > > > > best with comments. Even though the way this works is intuitive, I > > > > think it is worth pointing out that this part is important to get > > > > right so future programmers know how important it is. > > > > > > > > One thing I was wondering about is if there are any implications of > > > > different workers having different values in their GlobalVisState. > > > > GlobalVisState can be updated during vacuum, so even if they start out > > > > with the same values, that could diverge. It is probably okay since it > > > > just controls what tuples are removable. Some workers may remove fewer > > > > tuples than they absolutely could, and this is probably okay. > > > > > > > > > > Good point. > > > > > > > And if it is okay for each worker to have different GlobalVisState > > > > then maybe you shouldn't have a GlobalVisState in shared memory. If > > > > you look at GlobalVisTestFor() it just returns the memory address of > > > > that global variable in the backend. So, it seems like it might be > > > > better to just let each parallel worker use their own backend local > > > > GlobalVisState and not try to put it in shared memory and copy it from > > > > one worker to the other workers when initializing them. I'm not sure. > > > > At the very least, there should be a comment explaining why you've > > > > done it the way you have done it. > > > > > > Agreed. IIUC it's not a problem even if parallel workers use their own > > > GlobalVisState. I'll make that change and remove the 0004 patch which > > > exposes GlobalVisState. > > > > > > I'll send the updated patch soon. > > > > I've attached the updated patches. This version includes the comments > > from Melanie, some bug fixes, and comment updates. > > Rebased the patches as they conflicted with recent commits. > I've attached the new version patch. There are no major changes; I fixed some typos, improved the comment, and removed duplicated codes. Also, I've updated the commit messages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached the new version patch. There are no major changes; I > fixed some typos, improved the comment, and removed duplicated codes. > Also, I've updated the commit messages. I haven't looked closely at this version but I did notice that you do not document that parallel vacuum disables eager scanning. Imagine you are a user who has set the eager freeze related table storage option (vacuum_max_eager_freeze_failure_rate) and you schedule a regular parallel vacuum. Now that table storage option does nothing. - Melanie
On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've attached the new version patch. There are no major changes; I > > fixed some typos, improved the comment, and removed duplicated codes. > > Also, I've updated the commit messages. > > I haven't looked closely at this version but I did notice that you do > not document that parallel vacuum disables eager scanning. Imagine you > are a user who has set the eager freeze related table storage option > (vacuum_max_eager_freeze_failure_rate) and you schedule a regular > parallel vacuum. Now that table storage option does nothing. Good point. That restriction should be mentioned in the documentation. I'll update the patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com