Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAMT0RQSwsVnA3JVr9VxHc-HN656egyBK=K=taK0d8TbVTKXnWA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
Very nice results.

I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.

Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.

Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.

That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which

1) postpones the memory allocation to the beginning of index cleanups

2) lets you select the correct size and structure as you know more
about the distribution at that time

3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup

Also, collecting dead tids into a file makes it trivial (well, almost
:) ) to parallelize the initial heap scan, so more resources can be
thrown at it if available.

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.




On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Currently, the TIDs of dead tuples are stored in an array that is
> > > collectively allocated at the start of lazy vacuum and TID lookup uses
> > > bsearch(). There are the following challenges and limitations:
> > >
> > > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > > this limitation by using MemoryContextAllocHuge() but there were
> > > concerns about point 2[1].
> >
> > I think that the main problem with the 1GB limitation is that it is
> > surprising -- it can cause disruption when we first exceed the magical
> > limit of ~174 million TIDs. This can cause us to dirty index pages a
> > second time when we might have been able to just do it once with
> > sufficient memory for TIDs. OTOH there are actually cases where having
> > less memory for TIDs makes performance *better* because of locality
> > effects. This perverse behavior with memory sizing isn't a rare case
> > that we can safely ignore -- unfortunately it's fairly common.
> >
> > My point is that we should be careful to choose the correct goal.
> > Obviously memory use matters. But it might be more helpful to think of
> > memory use as just a proxy for what truly matters, not a goal in
> > itself. It's hard to know what this means (what is the "real goal"?),
> > and hard to measure it even if you know for sure. It could still be
> > useful to think of it like this.
>
> As I wrote in the first email, I think there are two important factors
> in index vacuuming performance: the performance to check if heap TID
> that an index tuple points to is dead, and the number of times to
> perform index bulk-deletion. The flame graph I attached in the first
> mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
> disk-intensive operation in practice. Given that most index AM's
> bulk-deletion does a full index scan and a table could have multiple
> indexes, reducing the number of times to perform index bulk-deletion
> really contributes to reducing the execution time, especially for
> large tables. I think that a more compact data structure for dead
> tuple TIDs is one of the ways to achieve that.
>
> >
> > > A run container is selected in this test case, using 4 bytes for each block.
> > >
> > >           Execution Time  Memory Usage
> > > array       8,883.03        600,008,248
> > > intset       7,358.23        100,671,488
> > > tbm            758.81         100,671,544
> > > rtbm           764.33          29,384,816
> > >
> > > Overall, 'rtbm' has a much better lookup performance and good memory
> > > usage especially if there are relatively many dead tuples. However, in
> > > some cases, 'intset' and 'array' have a better memory usage.
> >
> > This seems very promising.
> >
> > I wonder how much you have thought about the index AM side. It makes
> > sense to initially evaluate these techniques using this approach of
> > separating the data structure from how it is used by VACUUM -- I think
> > that that was a good idea. But at the same time there may be certain
> > important theoretical questions that cannot be answered this way --
> > questions about how everything "fits together" in a real VACUUM might
> > matter a lot. You've probably thought about this at least a little
> > already. Curious to hear how you think it "fits together" with the
> > work that you've done already.
>
> Yeah, that definitely needs to be considered. Currently, what we need
> for the dead tuple storage for lazy vacuum are store, lookup, and
> iteration. And given the parallel vacuum, it has to be able to be
> allocated on DSM or DSA. While implementing the PoC code, I'm trying
> to integrate it with the current lazy vacuum code. As far as I've seen
> so far, the integration is not hard, at least with the *current* lazy
> vacuum code and index AMs code.
>
> >
> > The loop inside btvacuumpage() makes each loop iteration call the
> > callback -- this is always a call to lazy_tid_reaped() in practice.
> > And that's where we do binary searches. These binary searches are
> > usually where we see a huge number of cycles spent when we look at
> > profiles, including the profile that produced your flame graph. But I
> > worry that that might be a bit misleading -- the way that profilers
> > attribute costs is very complicated and can never be fully trusted.
> > While it is true that lazy_tid_reaped() often accesses main memory,
> > which will of course add a huge amount of latency and make it a huge
> > bottleneck, the "big picture" is still relevant.
> >
> > I think that the compiler currently has to make very conservative
> > assumptions when generating the machine code used by the loop inside
> > btvacuumpage(), which calls through an opaque function pointer at
> > least once per loop iteration -- anything can alias, so the compiler
> > must be conservative. The data dependencies are hard for both the
> > compiler and the CPU to analyze. The cost of using a function pointer
> > compared to a direct function call is usually quite low, but there are
> > important exceptions -- cases where it prevents other useful
> > optimizations. Maybe this is an exception.
> >
> > I wonder how much it would help to break up that loop into two loops.
> > Make the callback into a batch operation that generates state that
> > describes what to do with each and every index tuple on the leaf page.
> > The first loop would build a list of TIDs, then you'd call into
> > vacuumlazy.c and get it to process the TIDs, and finally the second
> > loop would physically delete the TIDs that need to be deleted. This
> > would mean that there would be only one call per leaf page per
> > btbulkdelete(). This would reduce the number of calls to the callback
> > by at least 100x, and maybe more than 1000x.
> >
> > This approach would make btbulkdelete() similar to
> > _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
> > an independent idea to your ideas -- I imagine that this would work
> > far better when combined with a more compact data structure, which is
> > naturally more capable of batch processing than a simple array of
> > TIDs. Maybe this will help the compiler and the CPU to fully
> > understand the *natural* data dependencies, so that they can be as
> > effective as possible in making the code run fast. It's possible that
> > a modern CPU will be able to *hide* the latency more intelligently
> > than what we have today. The latency is such a big problem that we may
> > be able to justify "wasting" other CPU resources, just because it
> > sometimes helps with hiding the latency. For example, it might
> > actually be okay to sort all of the TIDs on the page to make the bulk
> > processing work -- though you might still do a precheck that is
> > similar to the precheck inside lazy_tid_reaped() that was added by you
> > in commit bbaf315309e.
>
> Interesting idea. I remember you mentioned this idea somewhere and
> I've considered this idea too while implementing the PoC code. It's
> definitely worth trying. Maybe we can write a patch for this as a
> separate patch? It will change index AM and could improve also the
> current bulk-deletion. We can consider a better data structure on top
> of this idea.
>
> Regards,
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>
>



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Enhanced error message to include hint messages for redundant options error
Next
From: Bruce Momjian
Date:
Subject: Re: visibility map corruption