Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoB4K0pAjB9bkawdFas0Fa68jtM20TBMz-rj1gBkEyLhTQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Hannu Krosing <hannuk@google.com>)
Re: [PoC] Improve dead tuple storage for lazy vacuum  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Currently, the TIDs of dead tuples are stored in an array that is
> > collectively allocated at the start of lazy vacuum and TID lookup uses
> > bsearch(). There are the following challenges and limitations:
> >
> > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > this limitation by using MemoryContextAllocHuge() but there were
> > concerns about point 2[1].
>
> I think that the main problem with the 1GB limitation is that it is
> surprising -- it can cause disruption when we first exceed the magical
> limit of ~174 million TIDs. This can cause us to dirty index pages a
> second time when we might have been able to just do it once with
> sufficient memory for TIDs. OTOH there are actually cases where having
> less memory for TIDs makes performance *better* because of locality
> effects. This perverse behavior with memory sizing isn't a rare case
> that we can safely ignore -- unfortunately it's fairly common.
>
> My point is that we should be careful to choose the correct goal.
> Obviously memory use matters. But it might be more helpful to think of
> memory use as just a proxy for what truly matters, not a goal in
> itself. It's hard to know what this means (what is the "real goal"?),
> and hard to measure it even if you know for sure. It could still be
> useful to think of it like this.

As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.

>
> > A run container is selected in this test case, using 4 bytes for each block.
> >
> >           Execution Time  Memory Usage
> > array       8,883.03        600,008,248
> > intset       7,358.23        100,671,488
> > tbm            758.81         100,671,544
> > rtbm           764.33          29,384,816
> >
> > Overall, 'rtbm' has a much better lookup performance and good memory
> > usage especially if there are relatively many dead tuples. However, in
> > some cases, 'intset' and 'array' have a better memory usage.
>
> This seems very promising.
>
> I wonder how much you have thought about the index AM side. It makes
> sense to initially evaluate these techniques using this approach of
> separating the data structure from how it is used by VACUUM -- I think
> that that was a good idea. But at the same time there may be certain
> important theoretical questions that cannot be answered this way --
> questions about how everything "fits together" in a real VACUUM might
> matter a lot. You've probably thought about this at least a little
> already. Curious to hear how you think it "fits together" with the
> work that you've done already.

Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.

>
> The loop inside btvacuumpage() makes each loop iteration call the
> callback -- this is always a call to lazy_tid_reaped() in practice.
> And that's where we do binary searches. These binary searches are
> usually where we see a huge number of cycles spent when we look at
> profiles, including the profile that produced your flame graph. But I
> worry that that might be a bit misleading -- the way that profilers
> attribute costs is very complicated and can never be fully trusted.
> While it is true that lazy_tid_reaped() often accesses main memory,
> which will of course add a huge amount of latency and make it a huge
> bottleneck, the "big picture" is still relevant.
>
> I think that the compiler currently has to make very conservative
> assumptions when generating the machine code used by the loop inside
> btvacuumpage(), which calls through an opaque function pointer at
> least once per loop iteration -- anything can alias, so the compiler
> must be conservative. The data dependencies are hard for both the
> compiler and the CPU to analyze. The cost of using a function pointer
> compared to a direct function call is usually quite low, but there are
> important exceptions -- cases where it prevents other useful
> optimizations. Maybe this is an exception.
>
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.
>
> This approach would make btbulkdelete() similar to
> _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
> an independent idea to your ideas -- I imagine that this would work
> far better when combined with a more compact data structure, which is
> naturally more capable of batch processing than a simple array of
> TIDs. Maybe this will help the compiler and the CPU to fully
> understand the *natural* data dependencies, so that they can be as
> effective as possible in making the code run fast. It's possible that
> a modern CPU will be able to *hide* the latency more intelligently
> than what we have today. The latency is such a big problem that we may
> be able to justify "wasting" other CPU resources, just because it
> sometimes helps with hiding the latency. For example, it might
> actually be okay to sort all of the TIDs on the page to make the bulk
> processing work -- though you might still do a precheck that is
> similar to the precheck inside lazy_tid_reaped() that was added by you
> in commit bbaf315309e.

Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Incorrect usage of strtol, atoi for non-numeric junk inputs
Next
From: Dean Rasheed
Date:
Subject: Re: rand48 replacement