Re: Parallel heap vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: Parallel heap vacuum
Date
Msg-id CANWCAZb5CmC9iu3Z1sWBtMN6PZ5x-99=hQYeJodYSp6WQZ14SA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel heap vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Right. I'm inclined to support only the second heap pass as the first
> step. If we support parallelism only for the second pass, it cannot
> help speed up freezing the entire table in emergency situations, but
> it would be beneficial for cases where a big table have a large amount
> of spread garbage.

I started looking at the most recent patch set for this, and while
looking back over the thread, a couple random thoughts occurred to me:

On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> - Patched
> parallel 0: 53468.734 (15990, 28296, 9465)
> parallel 1: 35712.613 ( 8254, 23569, 3700)

These measurements were done before before phase I and III were
stream-ified, but even here those phases were already the quickest
ones for a modest number of indexes and a single worker. I think it
would have an even bigger difference when only a small percentage of
pages need scanning/vacuuming, because it still has to read the entire
index. It's a bit unfortunate that the simplest heap phase to
parallelize is also the quickest to begin with.

Pre-existing issue: We don't get anywhere near 2x improvement in phase
II for 2 parallel index scans. We've known for a while that the shared
memory TID store has more overhead than private memory, and here that
overhead is about the same as the entirety of phase III with a single
worker! It may be time to look into mitigations there, independently
of this thread.

The same commit that made the parallel scanning patch more difficult
should also reduce the risk of having a large amount of freezing work
at once in the first place. (There are still plenty of systemic things
that can go wrong, of course, and it's always good if unpleasant work
finishes faster.)

I seem to recall a proposal from David Rowley to (something like)
batch gathering xids for visibility checks during executor scans, but
if so I can't find it in the archives. It's possible some similar work
might speed up heap scanning in a more localized fashion.

On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:

> > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?
>
> TidStoreBeginIterateShared() is designed for multiple parallel workers
> to iterate a shared TidStore. During an iteration, parallel workers
> share the iteration state and iterate the underlying radix tree while
> taking appropriate locks. Therefore, it's available only for a shared
> TidStore. This is required to implement the parallel heap vacuum,
> where multiple parallel workers do the iteration on the shared
> TidStore.
>
> On the other hand, TidStoreBeginIterate() is designed for a single
> process to iterate a TidStore. It accepts even a shared TidStore as
> you mentioned, but during an iteration there is no inter-process
> coordination such as locking. When it comes to parallel vacuum,
> supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> cover the case where we use only parallel index vacuum but not
> parallel heap scan/vacuum. In this case, we need to store dead tuple
> TIDs on the shared TidStore during heap scan so parallel workers can
> use it during index vacuum. But it's not necessary to use
> TidStoreBeginIterateShared() because only one (leader) process does
> heap vacuum.

The takeaway I got is that the word "shared" is used for two different
concepts. Future readers are not going to think to look back at this
email for explanation. At the least there needs to be a different word
for the new concept.

There are quite a lot of additions to radix_tree.h and tidstore.c to
make it work this way, including a new "control object", new locks,
new atomic variables. I'm not sure it's really necessary. Is it
possible to just store the "most recent block heap-vacuumed" in the
shared vacuum state? Then each backend would lock the TID store,
iterate starting at the next block, and unlock the store when it has
enough blocks. Sometimes my brainstorms are unworkable for some reason
I failed to think about, but this way seems 95% simpler -- we would
only need to teach the existing iteration machinery to take a "start
key".

--
John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: Jacob Brazeal
Date:
Subject: Experimental tool to explore commitfest patches
Next
From: Akshat Jaimini
Date:
Subject: Re: Experimental tool to explore commitfest patches