Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDwERrrmH4hZu4JZrUHUt4VDnmoAd=tsuWNk0FjagjGZA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Sorry for the late reply.

On Sat, Jul 10, 2021 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
> > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > > Currently, the TIDs of dead tuples are stored in an array that is
> > > collectively allocated at the start of lazy vacuum and TID lookup uses
> > > bsearch(). There are the following challenges and limitations:
> >
> > > So I prototyped a new data structure dedicated to storing dead tuples
> > > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > > 2.0 license). But I've implemented this idea from scratch because we
> > > need to integrate it with Dynamic Shared Memory/Area to support
> > > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > > in total, whereas the implementation supports only 4-bytes integers.
> > > Also, when it comes to vacuum, we neither need to compute the
> > > intersection, the union, nor the difference between sets, but need
> > > only an existence check.
> > >
> > > The data structure is somewhat similar to TIDBitmap. It consists of
> > > the hash table and the container area; the hash table has entries per
> > > block and each block entry allocates its memory space, called a
> > > container, in the container area to store its offset numbers. The
> > > container area is actually an array of bytes and can be enlarged as
> > > needed. In the container area, the data representation of offset
> > > numbers varies depending on their cardinality. It has three container
> > > types: array, bitmap, and run.
> >
> > How are you thinking of implementing iteration efficiently for rtbm? The
> > second heap pass needs that obviously... I think the only option would
> > be to qsort the whole thing?
>
> I experimented further, trying to use an old radix tree implementation I
> had lying around to store dead tuples. With a bit of trickery that seems
> to work well.

Thank you for experimenting with another approach.

>
> The radix tree implementation I have basically maps an int64 to another
> int64. Each level of the radix tree stores 6 bits of the key, and uses
> those 6 bits to index a 1<<64 long array leading to the next level.
>
> My first idea was to use itemptr_encode() to convert tids into an int64
> and store the lower 6 bits in the value part of the radix tree. That
> turned out to work well performance wise, but awfully memory usage
> wise. The problem is that we at most use 9 bits for offsets, but reserve
> 16 bits for it in the ItemPointerData. Which means that there's often a
> lot of empty "tree levels" for those 0 bits, making it hard to get to a
> decent memory usage.
>
> The simplest way to address that was to simply compress out those
> guaranteed-to-be-zero bits. That results in memory usage that's quite
> good - nearly always beating array, occasionally beating rtbm. It's an
> ordered datastructure, so the latter isn't too surprising. For lookup
> performance the radix approach is commonly among the best, if not the
> best.

How were its both lookup performance and memory usage comparing to
intset? I guess the performance trends of those two approaches are
similar since both consists of a tree. Intset encodes uint64 by
simple-8B encoding so I'm interested also in the comparison in terms
of memory usage.

>
> A variation of the storage approach is to just use the block number as
> the index, and store the tids as the value. Even with the absolutely
> naive approach of just using a Bitmapset that reduces memory usage
> substantially - at a small cost to search performance. Of course it'd be
> better to use an adaptive approach like you did for rtbm, I just thought
> this is good enough.
>
>
> This largely works well, except when there are a large number of evenly
> spread out dead tuples. I don't think that's a particularly common
> situation, but it's worth considering anyway.
>
> The reason the memory usage can be larger for sparse workloads obviously
> can lead to tree nodes with only one child. As they are quite large
> (1<<6 pointers to further children) that then can lead to large increase
> in memory usage.

Interesting. How big was it in such workloads comparing to other data
structures?

I personally like adaptive approaches especially in the context of
vacuum improvements. We know common patterns of dead tuple
distribution but it’s not necessarily true since it depends on data
distribution and timings of autovacuum etc even with the same
workload. And we might be able to provide a new approach that works
well in 95% of use cases but if things get worse than before in
another 5% I think the approach is not a good approach. Ideally, it
should be better in common cases and at least be the same as before in
other cases.

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

>
> I have toyed with implementing adaptively large radix nodes like
> proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> gotten it quite working.

That seems promising approach.

Regards,

[1] https://www.postgresql.org/message-id/CA%2BTgmoakKFXwUv1Cx2mspUuPQHzYF74BfJ8koF5YdgVLCvhpwA%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: corruption of WAL page header is never reported
Next
From: Peter Eisentraut
Date:
Subject: Re: Replace remaining castNode(…, lfirst(…)) and friends calls with l*_node()