Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CANWCAZYqWibTRCWs5mV57mLj1A0nbKX-eV5G+d-KmBOGHTVY-w@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

> > My todo:
> > - benchmark tid store / vacuum again, since we haven't since varlen
> > types and removing unnecessary locks.

I ran a vacuum benchmark similar to the one in [1] (unlogged tables
for reproducibility), but smaller tables (100 million records),
deleting only the last 20% of the table, and including a parallel
vacuum test. Scripts attached.

monotonically ordered int column index:

master:
system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s

v-59:
system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s

uuid column index:

master:
system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s

v-59:
system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s

v59:
system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s

Over all, I'm pleased with these results, although I'm confused why
sometimes with the patch the first run reports running faster than the
others. I'm curious what others get. Traversing a tree that lives in
DSA has some overhead, as expected, but still comes out way ahead of
master.

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

> > I'm not sure what the test_node_types_* functions are testing that
> > test_basic doesn't. They have a different, and confusing, way to stop
> > at every size class and check the keys/values. It seems we can replace
> > all that with two more calls (asc/desc) to test_basic, with the
> > maximum level.

v58-0008:

+ /* borrowed from RT_MAX_SHIFT */
+ const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
BITS_PER_BYTE) * BITS_PER_BYTE;

This is harder to read than "64 - 8", and doesn't really help
maintainability either.
Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.

+ /* leaf nodes */
+ test_basic(test_info, 0);

+ /* internal nodes */
+ test_basic(test_info, 8);
+
+ /* max-level nodes */
+ test_basic(test_info, max_shift);

This three-way terminology is not very informative. How about:

+       /* a tree with one level, i.e. a single node under the root node. */
 ...
+       /* a tree with two levels */
 ...
+       /* a tree with the maximum number of levels */

+static void
+test_basic(rt_node_class_test_elem *test_info, int shift)
+{
+ elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
+
+ /* Test nodes while changing the key insertion order */
+ do_test_basic(test_info->nkeys, shift, false);
+ do_test_basic(test_info->nkeys, shift, true);

Adding a level of indirection makes this harder to read, and do we
still know whether a test failed in asc or desc keys?

> > My earlier opinion was that "handle" was a nicer variable name, but
> > this brings back the typedef and also keeps the variable name I didn't
> > like, but pushes it down into the function. I'm a bit confused, so
> > I've kept these not-squashed for now.
>
> I misunderstood your comment. I've changed to use a variable name
> rt_handle and removed the TidStoreHandle type. 0013 patch.

(diff against an earlier version)
-       pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+       pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);

Shall we use "handle" in vacuum_parallel.c as well?

> > I'm pretty sure there's an
> > accidental memset call that crept in there, but I'm running out of
> > steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

> > >  * WIP: notes about traditional radix tree trading off span vs height...
> > >
> > > Are you going to write it?
> >
> > Yes, when I draft a rough commit message, (for next time).

I haven't gotten to the commit message, but:

v59-0004 - I did some rewriting of the top header comment to explain
ART concepts for new readers, made small comment changes, and tidied
up some indentation that pgindent won't touch
v59-0005 - re-pgindent'ed


[1] https://www.postgresql.org/message-id/CAFBsxsHUxmXYy0y4RrhMcNe-R11Bm099Xe-wUdb78pOu0%2BPT2Q%40mail.gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Thoughts about NUM_BUFFER_PARTITIONS
Next
From: Amit Kapila
Date:
Subject: Re: Synchronizing slots from primary to standby