Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoCiiKz_sxNFAvKY_--7-O6ioOwr6Rf0aL71cQvccipiPQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Mon, Nov 21, 2022 at 6:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of
variable-sizednodes), 3 has a nice property: no wasted padding space: 
> > >
> > > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > > node3: 5 + 3     + 3*8 = 32
> >
> > IIUC if we store the fanout member only in variable-sized nodes,
> > rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> > size doesn't change but there is 1 byte padding space.
>
> I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small
amountof code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability,
valgrind,or some other obstacle, I'm being pessimistic in my calculations. 
>
> > Also, even if we have the node3 a variable-sized node, size class 1
> > for node3 could be a good choice since it also doesn't need padding
> > space and could be a good alternative to path compression.
> >
> > node3         :  5 + 3 + 3*8 = 32 bytes
> > size class 1 : 5 + 3 + 1*8 = 16 bytes
>
> Precisely! I have that scenario in my notes as well -- it's quite compelling.

So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.

In this verification, I have all nodes except for node256
variable-sized nodes, and the sizes are:

radix tree node 1     :  6 + 4 + (6) + 1*8 = 24 bytes
radix tree node 4     :  6 + 4 + (6) + 4*8 = 48
radix tree node 15   :  6 + 32 + (2) + 15*8 = 160
radix tree node 32   :  6 + 32 + (2) + 32*8 = 296
radix tree node 61   :  inner 6 + 256 + (2) + 61*8 = 752,  leaf 6 +
256 + (2) + 16 + 61*8 = 768
radix tree node 128 :  inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 +
256 + (2) + 16 + 128*8 = 1304
radix tree node 256 :  inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32
+ 256*8 = 2088

I did some performance tests against two radix trees: a radix tree
supporting only fixed-size nodes (i.e. applying up to 0003 patch), and
a radix tree supporting variable-size nodes (i.e. applying all
attached patches). Also, I changed bench_search_random_nodes()
function so that we can specify the filter via a function argument.
Here are results:

Here are results:

* Query
select * from bench_seq_search(0, 1*1000*1000, false)

* Fixed-size
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 =
1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871216 |                     |         67 |
        |          212 |
(1 row)

* Variable-size
NOTICE:  num_keys = 1000000, height = 2, n1 = 0, n4 = 0, n15 = 0, n32
= 31251, n61 = 0, n128 = 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871280 |                     |         74 |
        |          212 |
(1 row)

---
* Query
select * from bench_seq_search(0, 2*1000*1000, true)
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1,
n256 = 245
* Fixed-size
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680848 |                     |         74 |
       |          201 |
(1 row)

* Variable-size
NOTICE:  num_keys = 999654, height = 2, n1 = 0, n4 = 1, n15 = 26951,
n32 = 35548, n61 = 1, n128 = 0, n256 = 245
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         16009040 |                     |         85 |
       |          201 |
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF')

* Fixed-size
NOTICE:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603,
n128 = 182670, n256 = 1024
 mem_allocated | search_ms
---------------+-----------
     343001456 |      1151
(1 row)

* Variable-size
NOTICE:  num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 =
138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024
 mem_allocated | search_ms
---------------+-----------
     230504328 |      1077
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F')
* Fixed-size
NOTICE:  num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 =
65536, n256 = 257
 mem_allocated | search_ms
---------------+-----------
      99911920 |       632
(1 row)
* Variable-size
NOTICE:  num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0,
n32 = 0, n61 = 61747, n128 = 3789, n256 = 257
 mem_allocated | search_ms
---------------+-----------
      64045688 |       554
(1 row)

Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance. I'm going to check the load
performance as well.

I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: TAP output format in pg_regress
Next
From: Thomas Kellerer
Date:
Subject: Re: Patch: Global Unique Index