Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBdz6XW5bMSN7EVhtyXOLj32YAt6+evJLPMEZvaajKB-Q@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > this limitation by using MemoryContextAllocHuge() but there were
> > concerns about point 2[1].
> >
> > 2. Allocate the whole memory space at once.
> >
> > 3. Slow lookup performance (O(logN)).
> >
> > I’ve done some experiments in this area and would like to share the
> > results and discuss ideas.
>
> Yea, this is a serious issue.
>
>
> 3) could possibly be addressed to a decent degree without changing the
> fundamental datastructure too much. There's some sizable and trivial
> wins by just changing vac_cmp_itemptr() to compare int64s and by using
> an open coded bsearch().
>
> The big problem with bsearch isn't imo the O(log(n)) complexity - it's
> that it has an abominally bad cache locality. And that can be addressed
> https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf
>
> Imo 2) isn't really that a hard problem to improve, even if we were to
> stay with the current bsearch approach. Reallocation with an aggressive
> growth factor or such isn't that bad.
>
>
> That's not to say we ought to stay with binary search...
>
>
>
> > Problems Solutions
> > ===============
> >
> > Firstly, I've considered using existing data structures:
> > IntegerSet(src/backend/lib/integerset.c)  and
> > TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
> > only either point 2 or 3. IntegerSet uses lower memory thanks to
> > simple-8b encoding but is slow at lookup, still O(logN), since it’s a
> > tree structure. On the other hand, TIDBitmap has a good lookup
> > performance, O(1), but could unnecessarily use larger memory in some
> > cases since it always allocates the space for bitmap enough to store
> > all possible offsets. With 8kB blocks, the maximum number of line
> > pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
> > bitmap is 40 bytes long and we always need 46 bytes in total per block
> > including other meta information.
>
> Imo tidbitmap isn't particularly good, even in the current use cases -
> it's constraining in what we can store (a problem for other AMs), not
> actually that dense, the lossy mode doesn't choose what information to
> loose well etc.
>
> It'd be nice if we came up with a datastructure that could also replace
> the bitmap scan cases.

Agreed.

>
>
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
>
> Not a huge fan of encoding this much knowledge about the tid layout...
>
>
> > For example, if there are two dead tuples at offset 1 and 150, it uses
> > the array container that has an array of two 2-byte integers
> > representing 1 and 150, using 4 bytes in total. If we used the bitmap
> > container in this case, we would need 20 bytes instead. On the other
> > hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
> > uses the run container that has an array of 2-byte integers. The first
> > value in each pair represents a starting offset number, whereas the
> > second value represents its length. Therefore, in this case, the run
> > container uses only 4 bytes in total. Finally, if there are dead
> > tuples at every other offset from 1 to 100, it uses the bitmap
> > container that has an uncompressed bitmap, using 13 bytes. We need
> > another 16 bytes per block entry for hash table entry.
> >
> > The lookup complexity of a bitmap container is O(1) whereas the one of
> > an array and a run container is O(N) or O(logN) but the number of
> > elements in those two containers should not be large it would not be a
> > problem.
>
> Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
> many tuples have been deleted. In cases where the "run" storage is
> cheaper (e.g. because there's high offset numbers due to HOT pruning),
> we could end up regularly scanning a few hundred entries for a
> match. That's not cheap anymore.

With 8kB blocks, the maximum size of a bitmap container is 37 bytes.
IOW, other two types of containers are always smaller than 37 bytes.
Since the run container uses 4 bytes per run, the number of runs in a
run container never be more than 9. Even with 32kB blocks, we don’t
have more than 37 runs. So I think N is small enough in this case.

>
>
> > Evaluation
> > ========
> >
> > Before implementing this idea and integrating it with lazy vacuum
> > code, I've implemented a benchmark tool dedicated to evaluating
> > lazy_tid_reaped() performance[4].
>
> Good idea!
>
>
> > In all test cases, I simulated that the table has 1,000,000 blocks and
> > every block has at least one dead tuple.
>
> That doesn't strike me as a particularly common scenario? I think it's
> quite rare for there to be so evenly but sparse dead tuples. In
> particularly it's very common for there to be long runs of dead tuples
> separated by long ranges of no dead tuples at all...

Agreed. I'll test with such scenarios.

>
>
> > The benchmark scenario is that for
> > each virtual heap tuple we check if there is its TID in the dead
> > tuple storage. Here are the results of execution time in milliseconds
> > and memory usage in bytes:
>
> In which order are the dead tuples checked? Looks like in sequential
> order? In the case of an index over a column that's not correlated with
> the heap order the lookups are often much more random - which can
> influence lookup performance drastically, due to cache differences in
> cache locality. Which will make some structures look worse/better than
> others.

Good point. It's sequential order, which is not good. I'll test again
after shuffling virtual index tuples.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Transactions involving multiple postgres foreign servers, take 2
Next
From: Greg Nancarrow
Date:
Subject: Re: Added schema level support for publication.