Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoAT0p0crCjiY+UwXbkQG2bRNFbe0UF0Ba5EfC_w458JBQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Thu, Oct 27, 2022 at 12:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > True. I'm going to start with 6 bytes and will consider reducing it to
> > 5 bytes.
>
> Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class
decouplingafter v8 and see how it goes. 
>
> > Encoding the kind in a pointer tag could be tricky given DSA
>
> If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I
hopeto at least test it out with local memory. 
>
> > support so currently I'm thinking to pack the node kind and node
> > capacity classes to uint8.
>
> That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can
directlycompare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my
lastmessage, we need to access the kind quickly, without more cycles. 

Understood.

>
> > I've made some progress on investigating DSA support. I've written
> > draft patch for that and regression tests passed. I'll share it as a
> > separate patch for discussion with v8 radix tree patch.
>
> Great!
>
> > While implementing DSA support, I realized that we may not need to use
> > pointer tagging to distinguish between backend-local address or
> > dsa_pointer. In order to get a backend-local address from dsa_pointer,
> > we need to pass dsa_area like:
>
> I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know
theplaces that would be affected by tagging the pointer with the node kind. 
>
> Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in
thebackend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done
too.However, it's okay to keep the benchmarking module in autoconf, since it won't be committed. 

Updated to support Meson.

>
> > > +static inline void
> > > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > > +             uint8 *dst_chunks, rt_node **dst_children, int count)
> > > +{
> > > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > > +}
> > >
> > > gcc generates better code with something like this (but not hard-coded) at the top:
> > >
> > >     if (count > 4)
> > >         pg_unreachable();
>
> Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4,
sowhy doesn't the compiler know that? I believe it boils down to 
>
> static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
>
> In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not
constant.This might not be important just yet, because I want to base the check on the proposed node capacity instead,
butI mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants. 

I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.

In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL). Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,

select * from bench_seq_search(0, 1* 1000 * 1000, false, false);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871240 |           180000000 |         67 |
        |          241 |
(1 row)

select * from bench_seq_search(0, 1* 1000 * 1000, false, true);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         14680064 |           180000000 |         81 |
        |          483 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, false);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680872 |           179937720 |         74 |
       |          235 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, true);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         23068672 |           179937720 |         86 |
       |          445 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, false);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |          9871240 |           180000000 |         67 |
        |          640 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, true);
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         14680064 |           180000000 |         81 |
        |         1002 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, false);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         19680872 |           179937720 |         74 |
       |          697 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, true);
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         23068672 |           179937720 |         86 |
       |         1030 |
(1 row)

In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().

Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Commit fest 2022-11
Next
From: "Regina Obe"
Date:
Subject: RE: [PATCH] Support % wildcard in extension upgrade filenames