Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Fri, Oct 14, 2022 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> On Mon, Oct 10, 2022 at 2:16 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I
haveto start somewhere... 
> >
> > My main concerns are that internal APIs:
> >
> > 1. are difficult to follow
> > 2. lead to poor branch prediction and too many function calls
> >
> > Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a
regressionthere can go completely unnoticed. Hopefully the broader themes are informative. 
> >
> > On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > [fixed benchmarks]
> >
> > Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top
ofv6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is
necessary.I've done some testing on loading, but will leave it out for now in the interest of length. 
> >
> >
> > 0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off,
forexample, because we know it already. And the sleep call is so I can run perf in a different shell session, on only
thesearch portion. 
> >
> > Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there
arealways 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the
byte-wiseloop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor: 
> >
> > john=# select * from bench_seq_search(0, 1*1000*1000);
> > NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> > NOTICE:  sleeping for 2 seconds...
> >   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > ---------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  1000000 |         10199040 |           180000000 |        167 |             0 |          822 |               0
> >
> >      1,470,141,841      branches:u
> >             63,693      branch-misses:u           #    0.00% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 1*1000*1000);
> > NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> > NOTICE:  sleeping for 2 seconds...
> >   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > ---------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  1000000 |         10199040 |           180000000 |        168 |             0 |         2174 |               0
> >
> >      1,470,142,569      branches:u
> >         15,023,983      branch-misses:u           #    1.02% of all branches
> >
> >
> > 0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being
loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the
samepath in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go
throughdifferent branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are
lessfull -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at
least)
> >
> > john=# select * from bench_seq_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        173 |             0 |          907 |               0
> >
> >      1,684,114,926      branches:u
> >          1,989,901      branch-misses:u           #    0.12% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        173 |             0 |         2890 |               0
> >
> >      1,684,115,844      branches:u
> >         34,215,740      branch-misses:u           #    2.03% of all branches
> >
> >
> > 0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains
predictableperformance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way
tooslow for node4, this benchmark hardly has any so it's ok. 
> >
> > john=# select * from bench_seq_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        176 |             0 |          867 |               0
> >
> >      1,469,540,357      branches:u
> >             96,678      branch-misses:u           #    0.01% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         14893056 |           179937720 |        171 |             0 |         2530 |               0
> >
> >      1,469,540,533      branches:u
> >         15,019,975      branch-misses:u           #    1.02% of all branches
> >
> >
> > 0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler
tocode. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is
costly.With these patches, searching an unevenly populated load is the same or faster than the original sequential
load,despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win
thememory back.) 
> >
> > john=# select * from bench_seq_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         20381696 |           179937720 |        171 |             0 |          717 |               0
> >
> >      1,349,614,294      branches:u
> >              1,313      branch-misses:u           #    0.00% of all branches
> >
> > john=# select * from bench_shuffle_search(0, 2*1000*1000);
> > NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> > NOTICE:  sleeping for 2 seconds...
> >  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
> > --------+------------------+---------------------+------------+---------------+--------------+-----------------
> >  999654 |         20381696 |           179937720 |        172 |             0 |         2202 |               0
> >
> >      1,349,614,741      branches:u
> >             30,592      branch-misses:u           #    0.00% of all branches
> >
> > Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind.
Therabstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done
thisway, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine
level.
>
> Right. When updating the patch from v4 to v5, I've eliminated the
> duplication of code between each node type as much as possible, which
> in turn produced more code on the machine level. The resulst of your
> experiment clearly showed the bad side of this work. FWIW I've also
> confirmed your changes in my environment (I've added the third
> argument to turn on and off the randomizes block selection proposed in
> 0004 patch):
>
> * w/o patches
> postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
> 2022-10-14 11:33:15.460 JST [361692] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         87 |
>         |          462 |
> (1 row)
>
> 1590104944      branches:u                #    3.430 G/sec
>           65957      branch-misses:u           #    0.00% of all branches
>
> postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
> 2022-10-14 11:33:28.934 JST [361692] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |         91 |
>        |          497 |
> (1 row)
>
> 1748249456      branches:u                #    3.506 G/sec
>         481074      branch-misses:u           #    0.03% of all branches
>
> postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
> 1000, false);
> 2022-10-14 11:33:38.378 JST [361692] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         86 |
>         |         1290 |
> (1 row)
>
> 1590105370      branches:u                #    1.231 G/sec
>    15039443      branch-misses:u           #    0.95% of all branches
>
> Time: 4166.346 ms (00:04.166)
> postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
> 1000, true);
> 2022-10-14 11:33:51.556 JST [361692] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         14893056 |           179937720 |         90 |
>        |         1536 |
> (1 row)
>
> 1748250497      branches:u                #    1.137 G/sec
>     28125016      branch-misses:u           #    1.61% of all branches
>
> * w/ all patches
> postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
> 2022-10-14 11:29:27.232 JST [360358] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         81 |
>         |          432 |
> (1 row)
>
> 1380062209      branches:u                #    3.185 G/sec
>             1066      branch-misses:u           #    0.00% of all branches
>
> postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
> 2022-10-14 11:29:46.380 JST [360358] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         20381696 |           179937720 |         88 |
>        |          438 |
> (1 row)
>
> 1379640815      branches:u                #    3.133 G/sec
>            1332      branch-misses:u           #    0.00% of all branches
>
> postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
> 1000, false);
> 2022-10-14 11:30:00.943 JST [360358] LOG:  num_keys = 1000000, height
> = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> ---------+------------------+---------------------+------------+---------------+--------------+-----------------
>  1000000 |         10199040 |           180000000 |         81 |
>         |          994 |
> (1 row)
>
> 1380062386      branches:u                #    1.386 G/sec
>           18368      branch-misses:u           #    0.00% of all branches
>
> postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
> 1000, true);
> 2022-10-14 11:30:15.944 JST [360358] LOG:  num_keys = 999654, height =
> 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
> array_load_ms | rt_search_ms | array_serach_ms
> --------+------------------+---------------------+------------+---------------+--------------+-----------------
>  999654 |         20381696 |           179937720 |         88 |
>        |         1098 |
> (1 row)
>
> 1379641503      branches:u                #    1.254 G/sec
>           18973      branch-misses:u           #    0.00% of all branches
>
> > I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia.
prepare_node_for_insert()branches based on the kind. If it must call rt_node_grow(), that function has no idea where it
camefrom and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must
branchagain. And if we are one of the three linear-search nodes, we later do another function call, where we encounter
a5-way jump table because the caller could be anything at all. 
> >
> > Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the
compilerget rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think
prepare_node_for_insert()is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup
nodesget a return value of zero. There is not enough commonality here. 
>
> Agreed.
>
> >
> > Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and
leaveswith the same api: 
> >
> > rt_node_iterate_next
> > chunk_array_node_get_slot
> > node_128/256_get_slot
> > rt_node_search
> >
> > I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful
forthe last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the
inner/leafstuff. Something to think about for later, but for now I believe we have to accept some code duplication as a
prerequisitefor decent performance as well as readability. 
>
> Agreed.
>
> >
> > For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some
aspectsI would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't
pretendto know the practical consequences of every change I mention. 
> >
> > - If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If
thathas not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier
abstractionsin the current patch. 
> > - As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally
alwaysknow what kind we are if we found out earlier. 
> > - For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the
sametime, trying to treat them the same is not always worthwhile. 
> > - Start to separate treatment of inner/leaves and see how it goes.
>
> Since I've not started coding the shared memory case seriously, I'm
> going to start with eliminating abstractions and splitting the
> treatment of inner and leaf nodes.

I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:

* Separate treatment of inner and leaf nodes
* Pack both the node kind and node count to an uint16 value.

I've also made a change in functions in bench_radix_tree test module:
the third argument of bench_seq/shuffle_search() is a flag to turn on
and off the randomizes block selection. The results of performance
tests in my environment are:

postgres(1:1665989)=# select * from bench_seq_search(0, 1* 1000 * 1000, false);
2022-10-24 14:29:40.705 JST [1665989] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871104 |           180000000 |         65 |
        |          248 |
(1 row)

postgres(1:1665989)=# select * from bench_seq_search(0, 2* 1000 * 1000, true);
2022-10-24 14:29:47.999 JST [1665989] LOG:  num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680736 |           179937720 |         71 |
       |          237 |
(1 row)

postgres(1:1665989)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-24 14:29:55.955 JST [1665989] LOG:  num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871104 |           180000000 |         65 |
        |          641 |
(1 row)

postgres(1:1665989)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-24 14:30:04.140 JST [1665989] LOG:  num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680736 |           179937720 |         71 |
       |          654 |
(1 row)

I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Allow file inclusion in pg_hba and pg_ident files
Next
From: Peter Smith
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply