Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsHbBm_M22gLBO+AZT4mfMq3L_oX3wdKZxjeNnT7fHsYMQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers

On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I've attached v24 patches. The locking support patch is separated
> (0005 patch). Also I kept the updates for TidStore and the vacuum
> integration from v23 separate.

Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple of my own. I've kept the CF status at "needs review" because no specific action is required at the moment.

I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided to re-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole -- some good and bad news on that below.

0001:

I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I took another look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that doesn't play nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in case someone else gets a similar idea.

I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef maze.", but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but I removed the reminder from the commit message.

0003:

The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected multiple other patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top comment to the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so also not sent separately.

0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate to make it easier to test other patch versions.

The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my efforts to fix this improved *other* functions, but the former didn't budge at all. First the patches:

0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on Andres' advice some months ago.

0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.

v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search nodes. I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want to clutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking schemes don't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered iteration, or at least make it optional as in the patch.

Now for some numbers:

========================================
psql -c "select * from bench_search_random_nodes(10*1000*1000)"
(min load time of three)

v15:
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     334182184 |    3352 |      2073

v25-0005:
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    3426 |      2126

v25-0006 (inlining or not):
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    3327 |      2035

v25-0007 (remove dead code):
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    3313 |      2037

v25-addendum...txt (no ordering):
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     331987008 |    2762 |      2042

Allowing unordered inserts helps a lot here in loading. That's expected because there are a lot of inserts into the linear nodes. 0006 might help a little.

========================================
psql -c "select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a"

v15:
         avg          
----------------------
 207.3000000000000000

v25-0005:
         avg          
----------------------
 190.6000000000000000

v25-0006 (inlining or not):
         avg          
----------------------
 189.3333333333333333

v25-0007 (remove dead code):
         avg          
----------------------
 186.4666666666666667

v25-addendum...txt (no ordering):
         avg          
----------------------
 179.7000000000000000

Most of the improvement from v15 to v25 probably comes from the change from node4 to node3, and this test stresses that node the most. That shows in the total memory used: it goes from 152MB to 132MB. Allowing unordered inserts helps some, the others are not convincing.

========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)"
(min load time of three)

v15:
 rt_load_ms | rt_search_ms
------------+--------------
        113 |          455

v25-0005:
 rt_load_ms | rt_search_ms
------------+--------------
        135 |          456

v25-0006 (inlining or not):
 rt_load_ms | rt_search_ms
------------+--------------
        136 |          455

v25-0007 (remove dead code):
 rt_load_ms | rt_search_ms
------------+--------------
        135 |          455

v25-addendum...txt (no ordering):
 rt_load_ms | rt_search_ms
------------+--------------
        134 |          455

Note: The regression seems to have started in v17, which is the first with a full template.

Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful. Instead of putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid store itself. That's more realistic, since this function was intended to test load and search of tids, but the tid store doesn't quite operate so simply anymore. What do you think, Masahiko?

I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove dead code.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: chandan kunal
Date:
Subject: Regarding TPCC benchmarking of postgresql for streaming
Next
From: Amit Kapila
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)