Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDpe=egYjkBS=VSg0i4PnY9EXX_YpikyXxaU=s_oM2K7Q@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
> >
> > Apart from more rounds of reviews and tests, my todo items that need
> > discussion and possibly implementation are:
>
> Quick thoughts on these:
>
> > * The memory measurement in radix trees and the memory limit in
> > tidstores. I've implemented it in v30-0007 through 0009 but we need to
> > review it. This is the highest priority for me.
>
> Agreed.
>
> > * Additional size classes. It's important for an alternative of path
> > compression as well as supporting our decoupling approach. Middle
> > priority.
>
> I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1
fromAndres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the
lackof path compression. 

But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.

> I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still
usethat loop. 

I've evaluated the performance of node1 but the result seems to show
the opposite. I used the test query:

select * from bench_search_random_nodes(100 * 1000 * 1000,
'0xFF000000000000FF');

Which make the radix tree that has node1 like:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n1 = 1536, n3 = 0, n15 = 0, n32 = 0, n61 = 0, n256 = 257

All internal nodes except for the root node are node1. The radix tree
that doesn't have node1 is:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n3 = 1536, n15 = 0, n32 = 0, n125 = 0, n256 = 257

Here is the result:

* w/ node1
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
        573448 |    1848 |      1707
(1 row)

* w/o node1
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
        598024 |    2014 |      1825
(1 row)

Am I missing something?

>
> About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node
size768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads
we'veseen so far. 

Makes sense to me.

>
> > * Node shrinking support. Low priority.
>
> This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it
sometime.If we're not going to make this work, why ship a deletion API at all? 
>
> I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect
whento shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into
that,but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this
point.

I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it. There might not be such extensions. Given the
lack of use cases in the core (and the rest of time), I think it's
okay even if the implementation of such API is minimal and not
optimized enough.  For instance, the implementation of dshash.c is
minimalist, and doesn't have resizing. We can improve them in the
future if extensions or other core features want.

Personally I think we should focus on addressing feedback that we
would get and improving the existing use cases for the rest of time.
That's why considering min-max size class has a higher priority than
the node shrinking support in my todo list.

FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: psql \watch 2nd argument: iteration count
Next
From: "shiy.fnst@fujitsu.com"
Date:
Subject: RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher