Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id 20210710025543.37sizjvgybemkdus@alap3.anarazel.de
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi,

On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > Currently, the TIDs of dead tuples are stored in an array that is
> > collectively allocated at the start of lazy vacuum and TID lookup uses
> > bsearch(). There are the following challenges and limitations:
> 
> > So I prototyped a new data structure dedicated to storing dead tuples
> > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > 2.0 license). But I've implemented this idea from scratch because we
> > need to integrate it with Dynamic Shared Memory/Area to support
> > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > in total, whereas the implementation supports only 4-bytes integers.
> > Also, when it comes to vacuum, we neither need to compute the
> > intersection, the union, nor the difference between sets, but need
> > only an existence check.
> > 
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
> 
> How are you thinking of implementing iteration efficiently for rtbm? The
> second heap pass needs that obviously... I think the only option would
> be to qsort the whole thing?

I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.

The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.

My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.

The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.

A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.


This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.


The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_dump new feature: exporting functions only. Bad or good idea ?
Next
From: Thomas Munro
Date:
Subject: Re: pgbench logging broken by time logic changes