Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id c5a74d6832924e02c418e28573607f0b@postgrespro.ru
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
Hi,

I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.

Let me join to friendly competition.

I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Novelties:
- 32 consecutive pages are stored together in a single sparse array
   (called "chunks").
   Chunk contains:
   - its number,
   - 4 byte bitmap of non-empty pages,
   - array of non-empty page headers 2 byte each.
     Page header contains offset of page's bitmap in bitmaps container.
     (Except if there is just one dead tuple in a page. Then it is
     written into header itself).
   - container of concatenated bitmaps.

   Ie, page metadata overhead varies from 2.4byte (32pages in single 
chunk)
   to 18byte (1 page in single chunk) per page.

- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
   it is compressed by removing zero byte and indexing with two-level
   bitmap index.
   Two-level index - zero bytes in first level are removed using
   second level. It is mostly done for 32kb pages, but let it stay since
   it is almost free.

- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
   and then encoded as sparse.

- Chunks are allocated with custom "allocator" that has no
   per-allocation overhead. It is possible because there is no need
   to perform "free": allocator is freed as whole at once.

- Array of pointers to chunks is also bitmap indexed. It saves cpu time
   when not every 32 consecutive pages has at least one dead tuple.
   But consumes time otherwise. Therefore additional optimization is 
added
   to quick skip lookup for first non-empty run of chunks.
   (Ahhh, I believe this explanation is awful).

Andres Freund wrote 2021-07-20 02:49:
> Hi,
> 
> On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
>> BTW is the implementation of the radix tree approach available
>> somewhere? If so I'd like to experiment with that too.
>> 
>> >
>> > I have toyed with implementing adaptively large radix nodes like
>> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
>> > gotten it quite working.
>> 
>> That seems promising approach.
> 
> I've since implemented some, but not all of the ideas of that paper
> (adaptive node sizes, but not the tree compression pieces).
> 
> E.g. for
> 
> select prepare(
> 1000000, -- max block
> 20, -- # of dead tuples per page
> 10, -- dead tuples interval within a page
> 1 -- page inteval
> );
>         attach  size    shuffled    ordered
> array    69 ms  120 MB  84.87 s          8.66 s
> intset  173 ms   65 MB  68.82 s         11.75 s
> rtbm    201 ms   67 MB  11.54 s          1.35 s
> tbm     232 ms  100 MB   8.33 s          1.26 s
> vtbm    162 ms   58 MB  10.01 s          1.22 s
> radix    88 ms   42 MB  11.49 s          1.67 s
> 
> and for
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within a page
> 1 -- page inteval
> );
> 
>         attach  size    shuffled    ordered
> array    24 ms   60MB   3.74s            1.02 s
> intset   97 ms   49MB   3.14s            0.75 s
> rtbm    138 ms   36MB   0.41s            0.14 s
> tbm     198 ms  101MB   0.41s            0.14 s
> vtbm    118 ms   27MB   0.39s            0.12 s
> radix    33 ms   10MB   0.28s            0.10 s
> 
> (this is an almost unfairly good case for radix)
> 
> Running out of time to format the results of the other testcases before
> I have to run, unfortunately. radix uses 42MB both in test case 3 and
> 4.

My results (Ubuntu 20.04 Intel Core i7-1165G7):

Test1.

select prepare(1000000, 10, 20, 1); -- original

        attach  size   shuffled
array   29ms    60MB   93.99s
intset  93ms    49MB   80.94s
rtbm   171ms    67MB   14.05s
tbm    238ms   100MB    8.36s
vtbm   148ms    59MB    9.12s
radix  100ms    42MB   11.81s
svtm    75ms    29MB    8.90s

select prepare(1000000, 20, 10, 1); -- Andres's variant

        attach  size   shuffled
array   61ms   120MB  111.91s
intset 163ms    66MB   85.00s
rtbm   236ms    67MB   10.72s
tbm    290ms   100MB    8.40s
vtbm   190ms    59MB    9.28s
radix  117ms    42MB   12.00s
svtm    98ms    29MB    8.77s

Test2.

select prepare(1000000, 10, 1, 1);

        attach  size   shuffled
array   31ms    60MB    4.68s
intset  97ms    49MB    4.03s
rtbm   163ms    36MB    0.42s
tbm    240ms   100MB    0.42s
vtbm   136ms    27MB    0.36s
radix   60ms    10MB    0.72s
svtm    39ms     6MB    0.19s

(Bad radix result probably due to smaller cache in notebook's CPU ?)

Test3

select prepare(1000000, 2, 100, 1);

        attach  size   shuffled
array    6ms    12MB   53.42s
intset  23ms    16MB   54.99s
rtbm   115ms    38MB    8.19s
tbm    186ms   100MB    8.37s
vtbm   105ms    59MB    9.08s
radix   64ms    42MB   10.41s
svtm    73ms    10MB    7.49s

Test4

select prepare(1000000, 100, 1, 1);

        attach  size   shuffled
array  304ms   600MB   75.12s
intset 775ms    98MB   47.49s
rtbm   356ms    38MB    4.11s
tbm    539ms   100MB    4.20s
vtbm   493ms    42MB    4.44s
radix  263ms    42MB    6.05s
svtm   360ms     8MB    3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.


(I've applied Andres's patch for slab allocator before testing)

Attached patch is against 6753911a444e12e4b55 commit of your pgtools 
with
applied Andres's patches for radix method.

I've also pushed it to github:
https://github.com/funny-falcon/pgtools/tree/svtm/bdbench

regards,
Yura Sokolov
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Planning counters in pg_stat_statements (using pgss_store)
Next
From: Tom Lane
Date:
Subject: Re: Removing "long int"-related limit on hash table sizes