Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBAROsGaRn846j6KfHO3_0xRsb3a_LxLwCjWJG8Hjw9wg@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <johncnaylorls@gmail.com>)
List pgsql-hackers
On Tue, Jan 30, 2024 at 7:20 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> > > I meant the macro could probably be
> > >
> > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
> > >
> > > (Right now N=32). I also realize I didn't answer your question earlier
> > > about block sizes being powers of two. I was talking about PG in
> > > general -- I was thinking all block sizes were powers of two. If
> > > that's true, I'm not sure if it's because programmers find the macro
> > > calculations easy to reason about, or if there was an implementation
> > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> > > just above a power of two, so if we did  round that up it would be
> > > 128kB.
> >
> > Thank you for your explanation. It might be better to follow other
> > codes. Does the calculation below make sense to you?
> >
> > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
> > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
> > while (inner_blocksize < 32 * size_class.allocsize)
> >      inner_blocksize <<= 1;
>
> It does make sense, but we can do it more simply:
>
> Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))

Thanks!

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.
0013-0015: address review comments on tidstore codes.
0017-0018: address review comments on vacuum integration codes.

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

---
 * XXX There are 4 node kinds, and this should never be increased,
 * for several reasons:
 * 1. With 5 or more kinds, gcc tends to use a jump table for switch
 *    statements.
 * 2. The 4 kinds can be represented with 2 bits, so we have the option
 *    in the future to tag the node pointer with the kind, even on
 *    platforms with 32-bit pointers. This might speed up node traversal
 *    in trees with highly random node kinds.
 * 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

---
 * WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

---
#ifdef RT_SHMEM
/*  WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

---
 * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
 * inside a single bitmapword on most platforms, so it's a good starting
 * point. We can make it higher if we need to.
 */
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

---
    /* WIP: We could go first to the higher node16 size class */
    newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

---
 * TODO: The current locking mechanism is not optimized for high concurrency
 * with mixed read-write workloads. In the future it might be worthwhile
 * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
 * the paper "The ART of Practical Synchronization" by the same authors as
 * the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4     4

Is there something we need to do here?

---
/*
 * Return index of the chunk and slot arrays for inserting into the node,
 * such that the chunk array remains ordered.
 * TODO: Improve performance for non-SIMD platforms.
 */

Are you going to work on this?

---
/* Delete the element at 'idx' */
/*  TODO: replace slow memmove's */

Are you going to work on this?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Possibility to disable `ALTER SYSTEM`
Next
From: vignesh C
Date:
Subject: Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix