Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDy2rmw7dGaGoFqkzxsd+kFg2ttz-bV1D8i=epCXrfWWA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>
> Hi,
>
> I've dreamed to write more compact structure for vacuum for three
> years, but life didn't give me a time to.
>
> Let me join to friendly competition.
>
> I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Thank you for proposing the new idea!

>
> Novelties:
> - 32 consecutive pages are stored together in a single sparse array
>    (called "chunks").
>    Chunk contains:
>    - its number,
>    - 4 byte bitmap of non-empty pages,
>    - array of non-empty page headers 2 byte each.
>      Page header contains offset of page's bitmap in bitmaps container.
>      (Except if there is just one dead tuple in a page. Then it is
>      written into header itself).
>    - container of concatenated bitmaps.
>
>    Ie, page metadata overhead varies from 2.4byte (32pages in single
> chunk)
>    to 18byte (1 page in single chunk) per page.
>
> - If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
>    it is compressed by removing zero byte and indexing with two-level
>    bitmap index.
>    Two-level index - zero bytes in first level are removed using
>    second level. It is mostly done for 32kb pages, but let it stay since
>    it is almost free.
>
> - If page's bitmaps contains a lot of "all-one" bytes, it is inverted
>    and then encoded as sparse.
>
> - Chunks are allocated with custom "allocator" that has no
>    per-allocation overhead. It is possible because there is no need
>    to perform "free": allocator is freed as whole at once.
>
> - Array of pointers to chunks is also bitmap indexed. It saves cpu time
>    when not every 32 consecutive pages has at least one dead tuple.
>    But consumes time otherwise. Therefore additional optimization is
> added
>    to quick skip lookup for first non-empty run of chunks.
>    (Ahhh, I believe this explanation is awful).

It sounds better than my proposal.

>
> Andres Freund wrote 2021-07-20 02:49:
> > Hi,
> >
> > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
> >> BTW is the implementation of the radix tree approach available
> >> somewhere? If so I'd like to experiment with that too.
> >>
> >> >
> >> > I have toyed with implementing adaptively large radix nodes like
> >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> >> > gotten it quite working.
> >>
> >> That seems promising approach.
> >
> > I've since implemented some, but not all of the ideas of that paper
> > (adaptive node sizes, but not the tree compression pieces).
> >
> > E.g. for
> >
> > select prepare(
> > 1000000, -- max block
> > 20, -- # of dead tuples per page
> > 10, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> >         attach  size    shuffled      ordered
> > array    69 ms  120 MB  84.87 s          8.66 s
> > intset  173 ms   65 MB  68.82 s         11.75 s
> > rtbm    201 ms   67 MB  11.54 s          1.35 s
> > tbm     232 ms  100 MB   8.33 s          1.26 s
> > vtbm    162 ms   58 MB  10.01 s          1.22 s
> > radix    88 ms   42 MB  11.49 s          1.67 s
> >
> > and for
> > select prepare(
> > 1000000, -- max block
> > 10, -- # of dead tuples per page
> > 1, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> >
> >         attach  size    shuffled      ordered
> > array    24 ms   60MB   3.74s            1.02 s
> > intset   97 ms   49MB   3.14s            0.75 s
> > rtbm    138 ms   36MB   0.41s            0.14 s
> > tbm     198 ms  101MB   0.41s            0.14 s
> > vtbm    118 ms   27MB   0.39s            0.12 s
> > radix    33 ms   10MB   0.28s            0.10 s
> >
> > (this is an almost unfairly good case for radix)
> >
> > Running out of time to format the results of the other testcases before
> > I have to run, unfortunately. radix uses 42MB both in test case 3 and
> > 4.
>
> My results (Ubuntu 20.04 Intel Core i7-1165G7):
>
> Test1.
>
> select prepare(1000000, 10, 20, 1); -- original
>
>         attach  size   shuffled
> array   29ms    60MB   93.99s
> intset  93ms    49MB   80.94s
> rtbm   171ms    67MB   14.05s
> tbm    238ms   100MB    8.36s
> vtbm   148ms    59MB    9.12s
> radix  100ms    42MB   11.81s
> svtm    75ms    29MB    8.90s
>
> select prepare(1000000, 20, 10, 1); -- Andres's variant
>
>         attach  size   shuffled
> array   61ms   120MB  111.91s
> intset 163ms    66MB   85.00s
> rtbm   236ms    67MB   10.72s
> tbm    290ms   100MB    8.40s
> vtbm   190ms    59MB    9.28s
> radix  117ms    42MB   12.00s
> svtm    98ms    29MB    8.77s
>
> Test2.
>
> select prepare(1000000, 10, 1, 1);
>
>         attach  size   shuffled
> array   31ms    60MB    4.68s
> intset  97ms    49MB    4.03s
> rtbm   163ms    36MB    0.42s
> tbm    240ms   100MB    0.42s
> vtbm   136ms    27MB    0.36s
> radix   60ms    10MB    0.72s
> svtm    39ms     6MB    0.19s
>
> (Bad radix result probably due to smaller cache in notebook's CPU ?)
>
> Test3
>
> select prepare(1000000, 2, 100, 1);
>
>         attach  size   shuffled
> array    6ms    12MB   53.42s
> intset  23ms    16MB   54.99s
> rtbm   115ms    38MB    8.19s
> tbm    186ms   100MB    8.37s
> vtbm   105ms    59MB    9.08s
> radix   64ms    42MB   10.41s
> svtm    73ms    10MB    7.49s
>
> Test4
>
> select prepare(1000000, 100, 1, 1);
>
>         attach  size   shuffled
> array  304ms   600MB   75.12s
> intset 775ms    98MB   47.49s
> rtbm   356ms    38MB    4.11s
> tbm    539ms   100MB    4.20s
> vtbm   493ms    42MB    4.44s
> radix  263ms    42MB    6.05s
> svtm   360ms     8MB    3.49s
>
> Therefore Specialized Vaccum Tid Map always consumes least memory amount
> and usually faster.

I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: WIP: Relaxing the constraints on numeric scale
Next
From: Daniel Gustafsson
Date:
Subject: Re: SQL/JSON: JSON_TABLE