Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAMT0RQQftTGZSsTg0SeRYBejbFL65Yv-ETvpjk4sxFK02HOPsg@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Hannu Krosing <hannuk@google.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
Resending as forgot to send to the list (thanks Peter :) ) On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The loop inside btvacuumpage() makes each loop iteration call the > callback -- this is always a call to lazy_tid_reaped() in practice. > And that's where we do binary searches. These binary searches are > usually where we see a huge number of cycles spent when we look at > profiles, including the profile that produced your flame graph. But I > worry that that might be a bit misleading -- the way that profilers > attribute costs is very complicated and can never be fully trusted. > While it is true that lazy_tid_reaped() often accesses main memory, > which will of course add a huge amount of latency and make it a huge > bottleneck, the "big picture" is still relevant. This is why I have mainly focused on making it possible to use SIMD and run 4-8 binary searches in parallel, mostly 8, for AVX2. How I am approaching this is separating "page search" tyo run over a (naturally) sorted array of 32 bit page pointers and only when the page is found the indexes in this array are used to look up the in-page bitmaps. This allows the heavier bsearch activity to run on smaller range of memory, hopefully reducing the cache trashing. There are opportunities to optimise this further for cash hits, buy collecting the tids from indexes in larger patches and then constraining the searches in the main is-deleted-bitmap to run over sections of it, but at some point this becomes a very complex balancing act, as the manipulation of the bits-to-check from indexes also takes time, not to mention the need to release the index pages and then later chase the tid pointers in case they have moved while checking them. I have not measured anything yet, but one of my concerns in case of very large dead tuple collections searched by 8-way parallel bsearch could actually get close to saturating RAM bandwidth by reading (8 x 32bits x cache-line-size) bytes from main memory every few cycles, so we may need some inner-loop level throttling similar to current vacuum_cost_limit for data pages. > I think that the compiler currently has to make very conservative > assumptions when generating the machine code used by the loop inside > btvacuumpage(), which calls through an opaque function pointer at > least once per loop iteration -- anything can alias, so the compiler > must be conservative. Definitely this! The lookup function needs to be turned into an inline function or #define as well to give the compiler maximum freedoms. > The data dependencies are hard for both the > compiler and the CPU to analyze. The cost of using a function pointer > compared to a direct function call is usually quite low, but there are > important exceptions -- cases where it prevents other useful > optimizations. Maybe this is an exception. Yes. Also this could be a place where unrolling the loop could make a real difference. Maybe not unrolling the full 32 loops for 32 bit bserach, but something like 8-loop unroll for getting most of the benefit. The 32x unroll would not be really that bad for performance if all 32 loops were needed, but mostly we would need to jump into last 10 to 20 loops for lookup min 1000 to 1000000 pages and I suspect this is such a weird corner case that compiler is really unlikely to have this optimisation supported. Of course I may be wrong and ith is a common enough case for the optimiser. > > I wonder how much it would help to break up that loop into two loops. > Make the callback into a batch operation that generates state that > describes what to do with each and every index tuple on the leaf page. > The first loop would build a list of TIDs, then you'd call into > vacuumlazy.c and get it to process the TIDs, and finally the second > loop would physically delete the TIDs that need to be deleted. This > would mean that there would be only one call per leaf page per > btbulkdelete(). This would reduce the number of calls to the callback > by at least 100x, and maybe more than 1000x. While it may make sense to have different bitmap encodings for different distributions, it likely would not be good for optimisations if all these are used at the same time. This is why I propose the first bitmap collecting phase to collect into a file and then - when reading into memory for lookups phase - possibly rewrite the initial structure to something else if it sees that it is more efficient. Like for example where the first half of the file consists of only empty pages. > This approach would make btbulkdelete() similar to > _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really > an independent idea to your ideas -- I imagine that this would work > far better when combined with a more compact data structure, which is > naturally more capable of batch processing than a simple array of > TIDs. Maybe this will help the compiler and the CPU to fully > understand the *natural* data dependencies, so that they can be as > effective as possible in making the code run fast. It's possible that > a modern CPU will be able to *hide* the latency more intelligently > than what we have today. The latency is such a big problem that we may > be able to justify "wasting" other CPU resources, just because it > sometimes helps with hiding the latency. For example, it might > actually be okay to sort all of the TIDs on the page to make the bulk > processing work Then again it may be so much extra work that it starts to dominate some parts of profiles. For example see the work that was done in improving the mini-vacuum part where it was actually faster to copy data out to a separate buffer and then back in than shuffle it around inside the same 8k page :) So only testing will tell. > -- though you might still do a precheck that is > similar to the precheck inside lazy_tid_reaped() that was added by you > in commit bbaf315309e. > > Of course it's very easy to be wrong about stuff like this. But it > might not be that hard to prototype. You can literally copy and paste > code from _bt_delitems_delete_check() to do this. It does the same > basic thing already. Also a lot of testing would be needed to figure out which strategy fits best for which distribution of dead tuples, and possibly their relation to the order of tuples to check from indexes . Cheers -- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
pgsql-hackers by date: