Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBW0xjZYLURLCOUSUyOVnR+7uX2U8gKeGdm2RsJOpZ4Mw@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Wed, Nov 30, 2022 at 2:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update:
>
> On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > - See how much performance we actually gain from tagging the node kind.
>
> Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or
worsein its current form, depending on compiler(?). Put off for later. 
>
> > - Try additional size classes while keeping the node kinds to only four.
>
> This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I
imagineit will be easier to rebase shared memory logic than using this technique everywhere possible. 
>
> > - Optimize node128 insert.
>
> I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and
operateon word-sized (32- or 64-bit) types at a time, rather than bytes. 

Thanks! I think this is a good idea.

> To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's
probablygoing to need a separate email thread to resolve the coding style clash this causes, so that can be put off for
later.

Agreed. Since tidbitmap.c also has WORDNUM(x) and BITNUM(x), we can
use it if we move from bitmapset.h.

> This is not meant to be included in the next patchset.  For demonstration purposes, I get these results with a
functionthat repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it: 
>
> select * from bench_node128_load(120);
>
> v11
>
> NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
> --------+-------+------------------+------------------
>     120 | 14400 |           208304 |               56
>
> v11 + 0006 addendum
>
> NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
> --------+-------+------------------+------------------
>     120 | 14400 |           208816 |               34
>
> I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the
node256-leafisset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It
won'tmake a difference for performance because there is no iteration there. 


After updating the patch set according to recent comments, I've also
done the same test in my environment and got similar good results.

w/o 0006 addendum patch

NOTICE:  num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
    120 | 14400 |           204424 |               29
(1 row)

w/ 0006 addendum patch

NOTICE:  num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
    120 | 14400 |           204936 |               18
(1 row)

> > - Try templating out the differences between local and shared memory.
>
> I hope to start this sometime after the crashes on 32-bit are resolved.

I've attached updated patches that incorporated all comments I got so
far as well as fixes for compiler warnings. I included your bitmapword
patch as 0004 for benchmarking. Also I reverted the change around
pg_atomic_u64 since we don't support any locking as you mentioned and
if we have a single lwlock to protect the radix tree, we don't need to
use pg_atomic_u64 only for max_val and num_keys.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: gkokolatos@pm.me
Date:
Subject: Re: Add LZ4 compression in pg_dump
Next
From: Andres Freund
Date:
Subject: Re: Failed Assert in pgstat_assoc_relation