Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id 5a884fd4189793279ecc33edefb938ad@postgrespro.ru
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
Robert Haas писал 2021-07-29 20:15:
> On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> 
> wrote:
>> Indeed. Given that the radix tree itself has other use cases, I have
>> no concern about using radix tree for vacuum's dead tuples storage. It
>> will be better to have one that can be generally used and has some
>> optimizations that are helpful also for vacuum's use case, rather than
>> having one that is very optimized only for vacuum's use case.
> 
> What I'm about to say might be a really stupid idea, especially since
> I haven't looked at any of the code already posted, but what I'm
> wondering about is whether we need a full radix tree or maybe just a
> radix-like lookup aid. For example, suppose that for a relation <= 8MB
> in size, we create an array of 1024 elements indexed by block number.
> Each element of the array stores an offset into the dead TID array.
> When you need to probe for a TID, you look up blkno and blkno + 1 in
> the array and then bsearch only between those two offsets. For bigger
> relations, a two or three level structure could be built, or it could
> always be 3 levels. This could even be done on demand, so you
> initialize all of the elements to some special value that means "not
> computed yet" and then fill them the first time they're needed,
> perhaps with another special value that means "no TIDs in that block".

8MB relation is not a problem, imo. There is no need to do anything to
handle 8MB relation.

Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead
tuples.

Then offset array will be 2GB and tuple offset array will be 6GB (2 byte
offset per tuple). 8GB in total.

We can make offset array only for higher 3 bytes of block number.
We then will have 1M offset array weighted 8MB, and there will be array
of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes
from Tuple) weighted 9GB.

But using per-batch compression schemes, there could be amortized
4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory.
Yes, it is not as guaranteed as in array approach. But 95% of time it is
such low and even lower. And better: more tuples are dead - better
compression works. Page with all tuples dead could be encoded as little
as 5 bytes. Therefore, overall memory consumption is more stable and
predictive.

Lower memory consumption of tuple storage means there is less chance
indexes should be scanned twice or more times. It gives more
predictability in user experience.

> I don't know if this is better, but I do kind of like the fact that
> the basic representation is just an array. It makes it really easy to
> predict how much memory will be needed for a given number of dead
> TIDs, and it's very DSM-friendly as well.

Whole thing could be encoded in one single array of bytes. Just give
"pointer-to-array"+"array-size" to constructor, and use "bump allocator"
inside. Complex logical structure doesn't imply "DSM-unfriendliness".
Hmm.... I mean if it is suitably designed.

In fact, my code uses bump allocator internally to avoid "per-allocation
overhead" of "aset", "slab" or "generational". And IntegerSet2 version
even uses it for all allocations since it has no reallocatable parts.

Well, if datastructure has reallocatable parts, it could be less 
friendly
to DSM.

regards,

---
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: pg_upgrade does not upgrade pg_stat_statements properly
Next
From: Bruce Momjian
Date:
Subject: Re: pg_upgrade does not upgrade pg_stat_statements properly