Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDKRu+fmsHiG0Nkikm2-RM=czpXuiXaJLKd6UHwKdJUvg@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 28, 2022 at 10:10 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > I
> > > > suspect other optimizations would be worth a lot more than using AVX2:
> > > > - collapsing inner nodes
> > > > - taking care when constructing the key (more on this when we
> > > > integrate with VACUUM)
> > > > ...and a couple Andres mentioned:
> > > > - memory management: in
> > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > > - node dispatch:
> > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > > >
> > > > Therefore, I would suggest that we use SSE2 only, because:
> > > > - portability is very easy
> > > > - to avoid a performance hit from indirecting through a function pointer
> > >
> > > Okay, I'll try these optimizations and see if the performance becomes better.
> >
> > FWIW, I think it's fine if we delay these until after committing a
> > good-enough version. The exception is key construction and I think
> > that deserves some attention now (more on this below).
>
> Agreed.
>
> >
> > > I've done benchmark tests while changing the node types. The code base
> > > is v3 patch that doesn't have the optimization you mentioned below
> > > (memory management and node dispatch) but I added the code to use SSE2
> > > for node-16 and node-32.
> >
> > Great, this is helpful to visualize what's going on!
> >
> > > * sse2_4_16_48_256
> > >     * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
> > >     * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50
> > >
> > > * sse2_4_32_128_256
> > >     * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
> > >     * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1
> >
> > > Observations are:
> > >
> > > In both test cases, There is not much difference between using AVX2
> > > and SSE2. The more mode types, the more time it takes for loading the
> > > data (see sse2_4_16_32_128_256).
> >
> > Good to know. And as Andres mentioned in his PoC, more node types
> > would be a barrier for pointer tagging, since 32-bit platforms only
> > have two spare bits in the pointer.
> >
> > > In dense case, since most nodes have around 100 children, the radix
> > > tree that has node-128 had a good figure in terms of memory usage. On
> >
> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?
>
> >
> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably it
> might be better to move the most significant bit of item offset to the
> left of blockhi. Or more simply:
>
> 9 bits of item offset || blockhi || blocklo
>
> >
> > > the other hand, the radix tree that doesn't have node-128 has a better
> > > number in terms of insertion performance. This is probably because we
> > > need to iterate over 'isset' flags from the beginning of the array in
> > > order to find an empty slot when inserting new data. We do the same
> > > thing also for node-48 but it was better than node-128 as it's up to
> > > 48.
> >
> > I mentioned in my diff, but for those following along, I think we can
> > improve that by iterating over the bytes and if it's 0xFF all 8 bits
> > are set already so keep looking...
>
> Right. Using 0xFF also makes the code readable so I'll change that.
>
> >
> > > In terms of lookup performance, the results vary but I could not find
> > > any common pattern that makes the performance better or worse. Getting
> > > more statistics such as the number of each node type per tree level
> > > might help me.
> >
> > I think that's a sign that the choice of node types might not be
> > terribly important for these two cases. That's good if that's true in
> > general -- a future performance-critical use of this code might tweak
> > things for itself without upsetting vacuum.
>
> Agreed.
>

I've attached an updated patch that incorporated comments from John.
Here are some comments I could not address and the reason:

+// bitfield is uint32, so we don't need UINT64_C
  bitfield &= ((UINT64_C(1) << node->n.count) - 1);

Since node->n.count could be 32, I think we need to use UINT64CONST() here.

 /* Macros for radix tree nodes */
+// not sure why are we doing casts here?
 #define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
 #define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)

I've left the casts as I use IS_LEAF_NODE for rt_node_4/16/32/128/256.

Also, I've dropped the configure script support for AVX2, and support
for SSE2 is missing. I'll update it later.

I've not addressed the comments I got from Andres yet so I'll update
the patch according to the discussion but the current patch would be
more readable than the previous one thanks to the comments from John.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: pgsql: dshash: Add sequential scan support.
Next
From: Michael Paquier
Date:
Subject: Linking issue with libldap on morepork (OpenBSD 6.9)