Thread: Batch TIDs lookup in ambulkdelete
Hi all, During index bulk-deletion in lazy vacuum, we currently check the deletability of each index tuple individually using the vac_tid_reaped() function. The attached proof-of-concept patches propose batching multiple TID lookups for deletability checks to reduce overhead. This optimization aims to minimize redundant function calls and repeated TidStore entry retrievals for TIDs on the same page. I have conducted benchmarks across several scenarios to evaluate the performance impact. # Case-1 (btree tuples are regular tuples and dead TIDs are concentrated): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select generate_series(1, ${NROWS}); create index on test (c); delete from test where c < ${NROWS} * 0.3; # Case-2 (btree tuples are regular tuples and dead TIDs are sparsed): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select generate_series(1, ${NROWS}); create index on test (c); delete from test where random() < 0.3; # Case-3 (btree tuples are deduplicated tuples): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select c % 1000 from generate_series(1, ${NROWS}) c; create index on test (c); select pg_relation_size('test') / 8192 as relpages \gset delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3); # Case-4 (btree tuples are deduplicated tuples and table is clustered): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select c % 1000 from generate_series(1, ${NROWS}) c; create index on test (c); cluster test using test_c_idx; select pg_relation_size('test') / 8192 as relpages \gset delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3); # Case-5 (creating btree index on UUID column) create unlogged table test (c uuid) with (autovacuum_enabled = off); insert into test select uuidv4() from generate_series(1, ${NROWS}) c; create index on test (c); select pg_relation_size('test') / 8192 as relpages \gset delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3); Here are the results (NROWS = 50000000): HEAD PATCHED DIFF case-1: 3,021 ms 2.818 ms 93.29% case-2: 5, 697 ms 5.545 ms 97.34% case-3: 2,833 ms 2.790 ms 98.48% case-4: 2,564 ms 2.279 ms 88.86% case-5: 4,657 ms 4.706 ms 101.04% I've measured 6 ~ 11% improvement in btree bulk-deletion. Here are the summary of each attached patch: 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check for multiple TIDs in one function call. If the given TIDs are sorted (at least in block number), we can save radix tree lookup for the same page entry. 0002: Convert IndexBUlkDeleteCallback() to batched operation. 0003: Use batch TIDs lookup in btree index bulk-deletion. In patch 0003, we implement batch TID lookups for both each deduplicated index tuple and remaining all regular index tuples, which seems to be the most straightforward approach. While further optimizations are possible, such as performing batch TID lookups for all index tuples on a single page, these could introduce additional overhead from sorting and re-sorting TIDs. Moreover, considering that radix tree lookups are relatively inexpensive, the benefits of sorting TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless, these potential optimizations warrant further evaluation to determine their actual impact on performance. Also, the patch includes the batch TIDs lookup support only for btree indexes but we potentially can improve other index AMs too. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, May 1, 2025 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > HEAD PATCHED DIFF > case-1: 3,021 ms 2.818 ms 93.29% > case-2: 5, 697 ms 5.545 ms 97.34% > case-3: 2,833 ms 2.790 ms 98.48% > case-4: 2,564 ms 2.279 ms 88.86% > case-5: 4,657 ms 4.706 ms 101.04% 1 and 4 look significant -- do the other cases have reproducible differences or is it just noise? > Here are the summary of each attached patch: > > 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check > for multiple TIDs in one function call. If the given TIDs are sorted > (at least in block number), we can save radix tree lookup for the same > page entry. My only comment is that TidStoreIsMember() is now unused in core (or maybe just the tests?). It seems like we could just change the API for it rather than introduce a new function? > 0003: Use batch TIDs lookup in btree index bulk-deletion. > > In patch 0003, we implement batch TID lookups for both each > deduplicated index tuple and remaining all regular index tuples, which > seems to be the most straightforward approach. Seems like a good approach. btvacuumpage() needs to sort if there is a mix of posting tuples and regular index tuples. Was that covered by any of the tests above? > While further > optimizations are possible, such as performing batch TID lookups for > all index tuples on a single page, these could introduce additional > overhead from sorting and re-sorting TIDs. Moreover, considering that > radix tree lookups are relatively inexpensive, the benefits of sorting > TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless, > these potential optimizations warrant further evaluation to determine > their actual impact on performance. My guess is that always sorting by TID and than back by index tuple offset is too much overhead to be worth it, but I'm not sure. -- John Naylor Amazon Web Services
On Tue, May 13, 2025 at 2:26 PM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > > Hi, > > On 30/04/25 19:36, Masahiko Sawada wrote: > > Here are the summary of each attached patch: > > > > 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check > > for multiple TIDs in one function call. If the given TIDs are sorted > > (at least in block number), we can save radix tree lookup for the same > > page entry. > > > > 0002: Convert IndexBUlkDeleteCallback() to batched operation. > > > > 0003: Use batch TIDs lookup in btree index bulk-deletion. > > > > In patch 0003, we implement batch TID lookups for both each > > deduplicated index tuple and remaining all regular index tuples, which > > seems to be the most straightforward approach. While further > > optimizations are possible, such as performing batch TID lookups for > > all index tuples on a single page, these could introduce additional > > overhead from sorting and re-sorting TIDs. Moreover, considering that > > radix tree lookups are relatively inexpensive, the benefits of sorting > > TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless, > > these potential optimizations warrant further evaluation to determine > > their actual impact on performance. > > > > Also, the patch includes the batch TIDs lookup support only for btree > > indexes but we potentially can improve other index AMs too. > > > > The code looks good and also +1 for the idea. I just have some small > points: > - Maybe it would be good to mention somewhere that > IndexBulkDeleteCallback() callback returns the number of tids > members found on TidStore? > - The vac_tid_reaped() docs may need to be updated? Thank you for looking at the patches. I agree with the above comments. > > I also executed meson tests for each patch individually and the 0002 > patch is not passing on my machine (MacOs). > > Ok: 39 > Expected Fail: 0 > Fail: 271 > Unexpected Pass: 0 > Skipped: 22 > Timeout: 0 > > One behaviour that I found by executing the 0002 tests is that it may be > leaking some shared memory segments. I notice that because after > executing the tests I tried to re-execute based on master and all tests > were failing with the "Failed system call was shmget(key=97530599, > size=56, 03600)" error. I also checked the shared memory segments using > "ipcs -m" and it returns some segments which is don't returned when I > execute the tests on master (after cleaning the leaked memory segments) > and it also doesn't occur when executing based on 0001 or 0003. > > ~/d/p/batch-tids-lookup-ambulkdelete ❯❯❯ ipcs -m > IPC status from <running system> as of Tue May 13 18:19:14 -03 2025 > T ID KEY MODE OWNER GROUP > Shared Memory: > m 18087936 0x05f873bf --rw------- matheus staff > m 15925250 0x05f966fe --rw------- matheus staff > m 24248325 0x05f9677e --rw------- matheus staff > .... > > Note that the the 0003 patch don't have this issue so at the end we will > not have problem with this I think, but it maybe be good to mention that > although the patches are separate, there is a dependency between them, > which may cause issues on buildfarm? Thank you for the report. With the 0001 and 0002 patches, I got a SEGV. I've fixed this issue in the attached updated version patches. I've confirmed that the patches pass CI tests but I'm not sure it fixes the shared memory segment leak problem you reported. The attached patches incorporated the comments[1] from John as well. BTW I found that the constant 'maxblkno' in test_tidstore.sql actually equals to InvalidBlockNumber, but not MaxBlockNumber. I think it doesn't make sense that TidStore uses InvalidBlockNumber as the key. The attached 0001 patch fixes it. I think we can fix it separately on HEAD as well as back branches. Regards, [1] https://www.postgresql.org/message-id/CANWCAZbiJcwSBCczLfbfiPe1mET+V9PjTZv5VvUBwarLvx1Hfg@mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, Jun 1, 2025 at 11:01 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, May 1, 2025 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > HEAD PATCHED DIFF > > case-1: 3,021 ms 2.818 ms 93.29% > > case-2: 5, 697 ms 5.545 ms 97.34% > > case-3: 2,833 ms 2.790 ms 98.48% > > case-4: 2,564 ms 2.279 ms 88.86% > > case-5: 4,657 ms 4.706 ms 101.04% > > 1 and 4 look significant -- do the other cases have reproducible > differences or is it just noise? These results are the average of 3 times executions, so they are reproducible differences. > > > Here are the summary of each attached patch: > > > > 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check > > for multiple TIDs in one function call. If the given TIDs are sorted > > (at least in block number), we can save radix tree lookup for the same > > page entry. > > My only comment is that TidStoreIsMember() is now unused in core (or > maybe just the tests?). It seems like we could just change the API for > it rather than introduce a new function? Good point, changed in the latest patch I posted[1]. > > > 0003: Use batch TIDs lookup in btree index bulk-deletion. > > > > In patch 0003, we implement batch TID lookups for both each > > deduplicated index tuple and remaining all regular index tuples, which > > seems to be the most straightforward approach. > > Seems like a good approach. btvacuumpage() needs to sort if there is a > mix of posting tuples and regular index tuples. Was that covered by > any of the tests above? Good point, I think this case was not covered. I've measured the performance with the following queries: # Case-6: create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select (2 * c - 1) from generate_series(1, ${NROWS}) c; insert into test select c from generate_series(1, ${NROWS}) c; create index on test (c); select pg_relation_size('test') / 8192 as relpages \gset delete from test where c < (${NROWS} * 0.3)::int vacuum test; And here is the result: HEAD PATCHED DIFF case-6: 3,320 ms 3,617 ms 108.94% I'll consider how to deal with overheads of sorting TIDs. > > While further > > optimizations are possible, such as performing batch TID lookups for > > all index tuples on a single page, these could introduce additional > > overhead from sorting and re-sorting TIDs. Moreover, considering that > > radix tree lookups are relatively inexpensive, the benefits of sorting > > TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless, > > these potential optimizations warrant further evaluation to determine > > their actual impact on performance. > > My guess is that always sorting by TID and than back by index tuple > offset is too much overhead to be worth it, but I'm not sure. Agreed. Given the above test results, it's unlikely always sorting the array helps speedups. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAv55DhJ%2B19zaemx-_eO7z%2Bu4gtFmeADsMBFqtHhyUySQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jun 6, 2025 at 6:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Agreed. Given the above test results, it's unlikely always sorting the > array helps speedups. Did you try specializing the sort? In my experience, it makes a big difference. -- Peter Geoghegan