Re: Batch TIDs lookup in ambulkdelete - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Batch TIDs lookup in ambulkdelete
Date
Msg-id CAD21AoCGzTu66ZjatOR2yxu6nVkpK7=XBh2reZXiJ819rT-38Q@mail.gmail.com
Whole thread Raw
In response to Re: Batch TIDs lookup in ambulkdelete  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: Batch TIDs lookup in ambulkdelete
List pgsql-hackers
On Sun, Jun 1, 2025 at 11:01 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, May 1, 2025 at 5:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >                   HEAD         PATCHED    DIFF
> > case-1:     3,021 ms      2.818 ms    93.29%
> > case-2:     5, 697 ms     5.545 ms    97.34%
> > case-3:     2,833 ms      2.790 ms    98.48%
> > case-4:     2,564 ms      2.279 ms    88.86%
> > case-5:     4,657 ms      4.706 ms   101.04%
>
> 1 and 4 look significant -- do the other cases have reproducible
> differences or is it just noise?

These results are the average of 3 times executions, so they are
reproducible differences.

>
> > Here are the summary of each attached patch:
> >
> > 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check
> > for multiple TIDs in one function call. If the given TIDs are sorted
> > (at least in block number), we can save radix tree lookup for the same
> > page entry.
>
> My only comment is that TidStoreIsMember() is now unused in core (or
> maybe just the tests?). It seems like we could just change the API for
> it rather than introduce a new function?

Good point, changed in the latest patch I posted[1].

>
> > 0003: Use batch TIDs lookup in btree index bulk-deletion.
> >
> > In patch 0003, we implement batch TID lookups for both each
> > deduplicated index tuple and remaining all regular index tuples, which
> > seems to be the most straightforward approach.
>
> Seems like a good approach. btvacuumpage() needs to sort if there is a
> mix of posting tuples and regular index tuples. Was that covered by
> any of the tests above?

Good point, I think this case was not covered. I've measured the
performance with the following queries:

# Case-6:
create unlogged table test (c int) with (autovacuum_enabled = off);
insert into test select (2 * c - 1) from generate_series(1, ${NROWS}) c;
insert into test select c from generate_series(1, ${NROWS}) c;
create index on test (c);
select pg_relation_size('test') / 8192 as relpages \gset
delete from test where c < (${NROWS} * 0.3)::int
vacuum test;

And here is the result:

                  HEAD         PATCHED    DIFF
case-6:     3,320 ms      3,617 ms    108.94%

I'll consider how to deal with overheads of sorting TIDs.

> > While further
> > optimizations are possible, such as performing batch TID lookups for
> > all index tuples on a single page, these could introduce additional
> > overhead from sorting and re-sorting TIDs. Moreover, considering that
> > radix tree lookups are relatively inexpensive, the benefits of sorting
> > TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless,
> > these potential optimizations warrant further evaluation to determine
> > their actual impact on performance.
>
> My guess is that always sorting by TID and than back by index tuple
> offset is too much overhead to be worth it, but I'm not sure.

Agreed. Given the above test results, it's unlikely always sorting the
array helps speedups.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoAv55DhJ%2B19zaemx-_eO7z%2Bu4gtFmeADsMBFqtHhyUySQ%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: CREATE DATABASE command for non-libc providers
Next
From: Peter Geoghegan
Date:
Subject: Re: Batch TIDs lookup in ambulkdelete