Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAH2-WzkRoNM991NzH6T6cDK5kMk+_3RzAv6hYH_5oPehRPsK=g@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Hannu Krosing <hannuk@google.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Hannu Krosing <hannuk@google.com>)
List pgsql-hackers
On Thu, Jul 8, 2021 at 1:53 PM Hannu Krosing <hannuk@google.com> wrote:
> How I am approaching this is separating "page search" tyo run over a
> (naturally) sorted array of 32 bit page pointers and only when the
> page is found the indexes in this array are used to look up the
> in-page bitmaps.
> This allows the heavier bsearch activity to run on smaller range of
> memory, hopefully reducing the cache trashing.

I think that the really important thing is to figure out roughly the
right data structure first.

> There are opportunities to optimise this further for cash hits, buy
> collecting the tids from indexes in larger patches and then
> constraining the searches in the main is-deleted-bitmap to run over
> sections of it, but at some point this becomes a very complex
> balancing act, as the manipulation of the bits-to-check from indexes
> also takes time, not to mention the need to release the index pages
> and then later chase the tid pointers in case they have moved while
> checking them.

I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?

> I have not measured anything yet, but one of my concerns in case of
> very large dead tuple collections searched by 8-way parallel bsearch
> could actually get close to saturating RAM bandwidth by reading (8 x
> 32bits x cache-line-size) bytes from main memory every few cycles, so
> we may need some inner-loop level throttling similar to current
> vacuum_cost_limit for data pages.

If it happens then it'll be a nice problem to have, I suppose.

> Maybe not unrolling the full 32 loops for 32 bit bserach, but
> something like 8-loop unroll for getting most of the benefit.

My current assumption is that we're bound by memory speed right now,
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.

> While it may make sense to have different bitmap encodings for
> different distributions, it likely would not be good for optimisations
> if all these are used at the same time.

To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.

> This is why I propose the first bitmap collecting phase to collect
> into a file and then - when reading into memory for lookups phase -
> possibly rewrite the initial structure to something else if it sees
> that it is more efficient. Like for example where the first half of
> the file consists of only empty pages.

Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising, since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples. We don't need to rewrite the data structure
to make it do that well, AFAICT.

When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days. In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.

> Then again it may be so much extra work that it starts to dominate
> some parts of profiles.
>
> For example see the work that was done in improving the mini-vacuum
> part where it was actually faster to copy data out to a separate
> buffer and then back in than shuffle it around inside the same 8k page

Some of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance. What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.

> So only testing will tell.

True.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Peter Smith
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions