Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Yura Sokolov |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | c5a74d6832924e02c418e28573607f0b@postgrespro.ru Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum |
List | pgsql-hackers |
Hi, I've dreamed to write more compact structure for vacuum for three years, but life didn't give me a time to. Let me join to friendly competition. I've bet on HATM approach: popcount-ing bitmaps for non-empty elements. Novelties: - 32 consecutive pages are stored together in a single sparse array (called "chunks"). Chunk contains: - its number, - 4 byte bitmap of non-empty pages, - array of non-empty page headers 2 byte each. Page header contains offset of page's bitmap in bitmaps container. (Except if there is just one dead tuple in a page. Then it is written into header itself). - container of concatenated bitmaps. Ie, page metadata overhead varies from 2.4byte (32pages in single chunk) to 18byte (1 page in single chunk) per page. - If page's bitmap is sparse ie contains a lot of "all-zero" bytes, it is compressed by removing zero byte and indexing with two-level bitmap index. Two-level index - zero bytes in first level are removed using second level. It is mostly done for 32kb pages, but let it stay since it is almost free. - If page's bitmaps contains a lot of "all-one" bytes, it is inverted and then encoded as sparse. - Chunks are allocated with custom "allocator" that has no per-allocation overhead. It is possible because there is no need to perform "free": allocator is freed as whole at once. - Array of pointers to chunks is also bitmap indexed. It saves cpu time when not every 32 consecutive pages has at least one dead tuple. But consumes time otherwise. Therefore additional optimization is added to quick skip lookup for first non-empty run of chunks. (Ahhh, I believe this explanation is awful). Andres Freund wrote 2021-07-20 02:49: > Hi, > > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote: >> BTW is the implementation of the radix tree approach available >> somewhere? If so I'd like to experiment with that too. >> >> > >> > I have toyed with implementing adaptively large radix nodes like >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't >> > gotten it quite working. >> >> That seems promising approach. > > I've since implemented some, but not all of the ideas of that paper > (adaptive node sizes, but not the tree compression pieces). > > E.g. for > > select prepare( > 1000000, -- max block > 20, -- # of dead tuples per page > 10, -- dead tuples interval within a page > 1 -- page inteval > ); > attach size shuffled ordered > array 69 ms 120 MB 84.87 s 8.66 s > intset 173 ms 65 MB 68.82 s 11.75 s > rtbm 201 ms 67 MB 11.54 s 1.35 s > tbm 232 ms 100 MB 8.33 s 1.26 s > vtbm 162 ms 58 MB 10.01 s 1.22 s > radix 88 ms 42 MB 11.49 s 1.67 s > > and for > select prepare( > 1000000, -- max block > 10, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1 -- page inteval > ); > > attach size shuffled ordered > array 24 ms 60MB 3.74s 1.02 s > intset 97 ms 49MB 3.14s 0.75 s > rtbm 138 ms 36MB 0.41s 0.14 s > tbm 198 ms 101MB 0.41s 0.14 s > vtbm 118 ms 27MB 0.39s 0.12 s > radix 33 ms 10MB 0.28s 0.10 s > > (this is an almost unfairly good case for radix) > > Running out of time to format the results of the other testcases before > I have to run, unfortunately. radix uses 42MB both in test case 3 and > 4. My results (Ubuntu 20.04 Intel Core i7-1165G7): Test1. select prepare(1000000, 10, 20, 1); -- original attach size shuffled array 29ms 60MB 93.99s intset 93ms 49MB 80.94s rtbm 171ms 67MB 14.05s tbm 238ms 100MB 8.36s vtbm 148ms 59MB 9.12s radix 100ms 42MB 11.81s svtm 75ms 29MB 8.90s select prepare(1000000, 20, 10, 1); -- Andres's variant attach size shuffled array 61ms 120MB 111.91s intset 163ms 66MB 85.00s rtbm 236ms 67MB 10.72s tbm 290ms 100MB 8.40s vtbm 190ms 59MB 9.28s radix 117ms 42MB 12.00s svtm 98ms 29MB 8.77s Test2. select prepare(1000000, 10, 1, 1); attach size shuffled array 31ms 60MB 4.68s intset 97ms 49MB 4.03s rtbm 163ms 36MB 0.42s tbm 240ms 100MB 0.42s vtbm 136ms 27MB 0.36s radix 60ms 10MB 0.72s svtm 39ms 6MB 0.19s (Bad radix result probably due to smaller cache in notebook's CPU ?) Test3 select prepare(1000000, 2, 100, 1); attach size shuffled array 6ms 12MB 53.42s intset 23ms 16MB 54.99s rtbm 115ms 38MB 8.19s tbm 186ms 100MB 8.37s vtbm 105ms 59MB 9.08s radix 64ms 42MB 10.41s svtm 73ms 10MB 7.49s Test4 select prepare(1000000, 100, 1, 1); attach size shuffled array 304ms 600MB 75.12s intset 775ms 98MB 47.49s rtbm 356ms 38MB 4.11s tbm 539ms 100MB 4.20s vtbm 493ms 42MB 4.44s radix 263ms 42MB 6.05s svtm 360ms 8MB 3.49s Therefore Specialized Vaccum Tid Map always consumes least memory amount and usually faster. (I've applied Andres's patch for slab allocator before testing) Attached patch is against 6753911a444e12e4b55 commit of your pgtools with applied Andres's patches for radix method. I've also pushed it to github: https://github.com/funny-falcon/pgtools/tree/svtm/bdbench regards, Yura Sokolov
Attachment
pgsql-hackers by date: