Re: Parallel heap vacuum - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Parallel heap vacuum |
Date | |
Msg-id | CAD21AoD2PR9XLaAcU92hp=SeLaPQ3AGevztttrh+uf5Ugu3H-Q@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel heap vacuum (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > # Benchmark results > > > > * Test-1: parallel heap scan on the table without indexes > > > > I created 20GB table, made garbage on the table, and run vacuum while > > changing parallel degree: > > > > create unlogged table test (a int) with (autovacuum_enabled = off); > > insert into test select generate_series(1, 600000000); --- 20GB table > > delete from test where a % 5 = 0; > > vacuum (verbose, parallel 0) test; > > > > Here are the results (total time and heap scan time): > > > > PARALLEL 0: 21.99 s (single process) > > PARALLEL 1: 11.39 s > > PARALLEL 2: 8.36 s > > PARALLEL 3: 6.14 s > > PARALLEL 4: 5.08 s > > > > * Test-2: parallel heap scan on the table with one index > > > > I used a similar table to the test case 1 but created one btree index on it: > > > > create unlogged table test (a int) with (autovacuum_enabled = off); > > insert into test select generate_series(1, 600000000); --- 20GB table > > create index on test (a); > > delete from test where a % 5 = 0; > > vacuum (verbose, parallel 0) test; > > > > I've measured the total execution time as well as the time of each > > vacuum phase (from left heap scan time, index vacuum time, and heap > > vacuum time): > > > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > > scalability; In both test-1 and test-2, the execution time of heap > > scan got ~4x faster with 4 parallel workers. On the other hand, when > > it comes to the total vacuum execution time, I could not see much > > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > > worker (12.75 vs. 11.39). > > > > I think the reason is the shared TidStore is not very scalable since > > we have a single lock on it. In all cases in the test-1, we don't use > > the shared TidStore since all dead tuples are removed during heap > > pruning. So the scalability was better overall than in test-2. In > > parallel 0 case in test-2, we use the local TidStore, and from > > parallel degree of 1 in test-2, we use the shared TidStore and > > parallel worker concurrently update it. Also, I guess that the lookup > > performance of the local TidStore is better than the shared TidStore's > > lookup performance because of the differences between a bump context > > and an DSA area. I think that this difference contributed the fact > > that index vacuuming got slower (16.74 vs. 22.04). > > Thank you for the comments! > > There are two obvious improvement ideas to improve overall vacuum > > execution time: (1) improve the shared TidStore scalability and (2) > > support parallel heap vacuum. For (1), several ideas are proposed by > > the ART authors[1]. I've not tried these ideas but it might be > > applicable to our ART implementation. But I prefer to start with (2) > > since it would be easier. Feedback is very welcome. > > > > Starting with (2) sounds like a reasonable approach. We should study a > few more things like (a) the performance results where there are 3-4 > indexes, Here are the results with 4 indexes (and restarting the server before the benchmark): PARALLEL 0: 115.48 s (32.76, 64.46, 18.24) PARALLEL 1: 74.88 s (17.11, 44.43, 13.25) PARALLEL 2: 71.15 s (14.13, 44.82, 12.12) PARALLEL 3: 46.78 s (10.74, 24.50, 11.43) PARALLEL 4: 46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap scan and 3 workers for index vacuum) > (b) What is the reason for performance improvement seen with > only heap scans. We normally get benefits of parallelism because of > using multiple CPUs but parallelizing scans (I/O) shouldn't give much > benefits. Is it possible that you are seeing benefits because most of > the data is either in shared_buffers or in memory? We can probably try > vacuuming tables by restarting the nodes to ensure the data is not in > memory. I think it depends on the storage performance. FYI I use an EC2 instance (m6id.metal). I've run the same benchmark script (table with no index) with restarting the server before executing the vacuum, and here are the results: PARALLEL 0: 32.75 s PARALLEL 1: 17.46 s PARALLEL 2: 13.41 s PARALLEL 3: 10.31 s PARALLEL 4: 8.48 s With the above two tests, I used the updated patch that I just submitted[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: