Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAMT0RQQftTGZSsTg0SeRYBejbFL65Yv-ETvpjk4sxFK02HOPsg@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Hannu Krosing <hannuk@google.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Resending as forgot to send to the list (thanks Peter :) )

On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> The loop inside btvacuumpage() makes each loop iteration call the
> callback -- this is always a call to lazy_tid_reaped() in practice.
> And that's where we do binary searches. These binary searches are
> usually where we see a huge number of cycles spent when we look at
> profiles, including the profile that produced your flame graph. But I
> worry that that might be a bit misleading -- the way that profilers
> attribute costs is very complicated and can never be fully trusted.
> While it is true that lazy_tid_reaped() often accesses main memory,
> which will of course add a huge amount of latency and make it a huge
> bottleneck, the "big picture" is still relevant.

This is why I have mainly focused on making it possible to use SIMD and
run 4-8 binary searches in parallel, mostly 8, for AVX2.

How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.

There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.

I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.

> I think that the compiler currently has to make very conservative
> assumptions when generating the machine code used by the loop inside
> btvacuumpage(), which calls through an opaque function pointer at
> least once per loop iteration -- anything can alias, so the compiler
> must be conservative.

Definitely this! The lookup function needs to be turned into an inline
function or #define as well to give the compiler maximum freedoms.

> The data dependencies are hard for both the
> compiler and the CPU to analyze. The cost of using a function pointer
> compared to a direct function call is usually quite low, but there are
> important exceptions -- cases where it prevents other useful
> optimizations. Maybe this is an exception.

Yes. Also this could be a place where unrolling the loop could make a
real difference.

Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.

The 32x unroll would not be really that bad for performance if all 32
loops were needed, but mostly we would need to jump into last 10 to 20
loops for lookup min 1000 to 1000000 pages and I suspect this is such
a weird corner case that compiler is really unlikely to have this
optimisation supported. Of course I may be wrong and ith is a common
enough case for the optimiser.

>
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.

While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.

This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.

> This approach would make btbulkdelete() similar to
> _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
> an independent idea to your ideas -- I imagine that this would work
> far better when combined with a more compact data structure, which is
> naturally more capable of batch processing than a simple array of
> TIDs. Maybe this will help the compiler and the CPU to fully
> understand the *natural* data dependencies, so that they can be as
> effective as possible in making the code run fast. It's possible that
> a modern CPU will be able to *hide* the latency more intelligently
> than what we have today. The latency is such a big problem that we may
> be able to justify "wasting" other CPU resources, just because it
> sometimes helps with hiding the latency. For example, it might
> actually be okay to sort all of the TIDs on the page to make the bulk
> processing work

Then again it may be so much extra work that it starts to dominate
some parts of profiles.

For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
:)

So only testing will tell.

> -- though you might still do a precheck that is
> similar to the precheck inside lazy_tid_reaped() that was added by you
> in commit bbaf315309e.
>
> Of course it's very easy to be wrong about stuff like this. But it
> might not be that hard to prototype. You can literally copy and paste
> code from _bt_delitems_delete_check() to do this. It does the same
> basic thing already.

Also a lot of testing would be needed to figure out which strategy
fits best for which distribution of dead tuples, and possibly their
relation to the order of tuples to check from indexes .


Cheers

--
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: "debug_invalidate_system_caches_always" is too long
Next
From: Alvaro Herrera
Date:
Subject: Re: "debug_invalidate_system_caches_always" is too long