Re: Parallel heap vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: Parallel heap vacuum |
Date | |
Msg-id | CANWCAZb5CmC9iu3Z1sWBtMN6PZ5x-99=hQYeJodYSp6WQZ14SA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel heap vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Right. I'm inclined to support only the second heap pass as the first > step. If we support parallelism only for the second pass, it cannot > help speed up freezing the entire table in emergency situations, but > it would be beneficial for cases where a big table have a large amount > of spread garbage. I started looking at the most recent patch set for this, and while looking back over the thread, a couple random thoughts occurred to me: On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > - Patched > parallel 0: 53468.734 (15990, 28296, 9465) > parallel 1: 35712.613 ( 8254, 23569, 3700) These measurements were done before before phase I and III were stream-ified, but even here those phases were already the quickest ones for a modest number of indexes and a single worker. I think it would have an even bigger difference when only a small percentage of pages need scanning/vacuuming, because it still has to read the entire index. It's a bit unfortunate that the simplest heap phase to parallelize is also the quickest to begin with. Pre-existing issue: We don't get anywhere near 2x improvement in phase II for 2 parallel index scans. We've known for a while that the shared memory TID store has more overhead than private memory, and here that overhead is about the same as the entirety of phase III with a single worker! It may be time to look into mitigations there, independently of this thread. The same commit that made the parallel scanning patch more difficult should also reduce the risk of having a large amount of freezing work at once in the first place. (There are still plenty of systemic things that can go wrong, of course, and it's always good if unpleasant work finishes faster.) I seem to recall a proposal from David Rowley to (something like) batch gathering xids for visibility checks during executor scans, but if so I can't find it in the archives. It's possible some similar work might speed up heap scanning in a more localized fashion. On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? > > TidStoreBeginIterateShared() is designed for multiple parallel workers > to iterate a shared TidStore. During an iteration, parallel workers > share the iteration state and iterate the underlying radix tree while > taking appropriate locks. Therefore, it's available only for a shared > TidStore. This is required to implement the parallel heap vacuum, > where multiple parallel workers do the iteration on the shared > TidStore. > > On the other hand, TidStoreBeginIterate() is designed for a single > process to iterate a TidStore. It accepts even a shared TidStore as > you mentioned, but during an iteration there is no inter-process > coordination such as locking. When it comes to parallel vacuum, > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > cover the case where we use only parallel index vacuum but not > parallel heap scan/vacuum. In this case, we need to store dead tuple > TIDs on the shared TidStore during heap scan so parallel workers can > use it during index vacuum. But it's not necessary to use > TidStoreBeginIterateShared() because only one (leader) process does > heap vacuum. The takeaway I got is that the word "shared" is used for two different concepts. Future readers are not going to think to look back at this email for explanation. At the least there needs to be a different word for the new concept. There are quite a lot of additions to radix_tree.h and tidstore.c to make it work this way, including a new "control object", new locks, new atomic variables. I'm not sure it's really necessary. Is it possible to just store the "most recent block heap-vacuumed" in the shared vacuum state? Then each backend would lock the TID store, iterate starting at the next block, and unlock the store when it has enough blocks. Sometimes my brainstorms are unworkable for some reason I failed to think about, but this way seems 95% simpler -- we would only need to teach the existing iteration machinery to take a "start key". -- John Naylor Amazon Web Services
pgsql-hackers by date: