Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id 20210709035332.nfnzbekfsceyqdbl@alap3.anarazel.de
Whole thread Raw
In response to [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi,


On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> 1. Don't allocate more than 1GB. There was a discussion to eliminate
> this limitation by using MemoryContextAllocHuge() but there were
> concerns about point 2[1].
>
> 2. Allocate the whole memory space at once.
>
> 3. Slow lookup performance (O(logN)).
>
> I’ve done some experiments in this area and would like to share the
> results and discuss ideas.

Yea, this is a serious issue.


3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().

The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf

Imo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.


That's not to say we ought to stay with binary search...



> Problems Solutions
> ===============
>
> Firstly, I've considered using existing data structures:
> IntegerSet(src/backend/lib/integerset.c)  and
> TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
> only either point 2 or 3. IntegerSet uses lower memory thanks to
> simple-8b encoding but is slow at lookup, still O(logN), since it’s a
> tree structure. On the other hand, TIDBitmap has a good lookup
> performance, O(1), but could unnecessarily use larger memory in some
> cases since it always allocates the space for bitmap enough to store
> all possible offsets. With 8kB blocks, the maximum number of line
> pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
> bitmap is 40 bytes long and we always need 46 bytes in total per block
> including other meta information.

Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.

It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.


> The data structure is somewhat similar to TIDBitmap. It consists of
> the hash table and the container area; the hash table has entries per
> block and each block entry allocates its memory space, called a
> container, in the container area to store its offset numbers. The
> container area is actually an array of bytes and can be enlarged as
> needed. In the container area, the data representation of offset
> numbers varies depending on their cardinality. It has three container
> types: array, bitmap, and run.

Not a huge fan of encoding this much knowledge about the tid layout...


> For example, if there are two dead tuples at offset 1 and 150, it uses
> the array container that has an array of two 2-byte integers
> representing 1 and 150, using 4 bytes in total. If we used the bitmap
> container in this case, we would need 20 bytes instead. On the other
> hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
> uses the run container that has an array of 2-byte integers. The first
> value in each pair represents a starting offset number, whereas the
> second value represents its length. Therefore, in this case, the run
> container uses only 4 bytes in total. Finally, if there are dead
> tuples at every other offset from 1 to 100, it uses the bitmap
> container that has an uncompressed bitmap, using 13 bytes. We need
> another 16 bytes per block entry for hash table entry.
>
> The lookup complexity of a bitmap container is O(1) whereas the one of
> an array and a run container is O(N) or O(logN) but the number of
> elements in those two containers should not be large it would not be a
> problem.

Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.


> Evaluation
> ========
>
> Before implementing this idea and integrating it with lazy vacuum
> code, I've implemented a benchmark tool dedicated to evaluating
> lazy_tid_reaped() performance[4].

Good idea!


> In all test cases, I simulated that the table has 1,000,000 blocks and
> every block has at least one dead tuple.

That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...


> The benchmark scenario is that for
> each virtual heap tuple we check if there is its TID in the dead
> tuple storage. Here are the results of execution time in milliseconds
> and memory usage in bytes:

In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexey Lesovsky
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Thomas Munro
Date:
Subject: Re: pgbench logging broken by time logic changes