Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Yura Sokolov |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | 5a884fd4189793279ecc33edefb938ad@postgrespro.ru Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
Robert Haas писал 2021-07-29 20:15: > On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> Indeed. Given that the radix tree itself has other use cases, I have >> no concern about using radix tree for vacuum's dead tuples storage. It >> will be better to have one that can be generally used and has some >> optimizations that are helpful also for vacuum's use case, rather than >> having one that is very optimized only for vacuum's use case. > > What I'm about to say might be a really stupid idea, especially since > I haven't looked at any of the code already posted, but what I'm > wondering about is whether we need a full radix tree or maybe just a > radix-like lookup aid. For example, suppose that for a relation <= 8MB > in size, we create an array of 1024 elements indexed by block number. > Each element of the array stores an offset into the dead TID array. > When you need to probe for a TID, you look up blkno and blkno + 1 in > the array and then bsearch only between those two offsets. For bigger > relations, a two or three level structure could be built, or it could > always be 3 levels. This could even be done on demand, so you > initialize all of the elements to some special value that means "not > computed yet" and then fill them the first time they're needed, > perhaps with another special value that means "no TIDs in that block". 8MB relation is not a problem, imo. There is no need to do anything to handle 8MB relation. Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead tuples. Then offset array will be 2GB and tuple offset array will be 6GB (2 byte offset per tuple). 8GB in total. We can make offset array only for higher 3 bytes of block number. We then will have 1M offset array weighted 8MB, and there will be array of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes from Tuple) weighted 9GB. But using per-batch compression schemes, there could be amortized 4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory. Yes, it is not as guaranteed as in array approach. But 95% of time it is such low and even lower. And better: more tuples are dead - better compression works. Page with all tuples dead could be encoded as little as 5 bytes. Therefore, overall memory consumption is more stable and predictive. Lower memory consumption of tuple storage means there is less chance indexes should be scanned twice or more times. It gives more predictability in user experience. > I don't know if this is better, but I do kind of like the fact that > the basic representation is just an array. It makes it really easy to > predict how much memory will be needed for a given number of dead > TIDs, and it's very DSM-friendly as well. Whole thing could be encoded in one single array of bytes. Just give "pointer-to-array"+"array-size" to constructor, and use "bump allocator" inside. Complex logical structure doesn't imply "DSM-unfriendliness". Hmm.... I mean if it is suitably designed. In fact, my code uses bump allocator internally to avoid "per-allocation overhead" of "aset", "slab" or "generational". And IntegerSet2 version even uses it for all allocations since it has no reallocatable parts. Well, if datastructure has reallocatable parts, it could be less friendly to DSM. regards, --- Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
pgsql-hackers by date: