Re: [HACKERS] Block level parallel vacuum - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] Block level parallel vacuum |
Date | |
Msg-id | CAH2-WznCY7aQxw6_+1OmD-=b11YEAkqB+rwXcqhQQWVX7xwgPA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Block level parallel vacuum (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] Block level parallel vacuum
|
List | pgsql-hackers |
On Fri, Jan 17, 2020 at 1:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for doing this test again. In the attached patch, I have > addressed all the comments and modified a few comments. I am in favor of the general idea of parallel VACUUM that parallelizes the processing of each index (I haven't looked at the patch, though). I observed something during a recent benchmark of the deduplication patch that seems like it might be relevant to parallel VACUUM. This happened during a recreation of the original WARM benchmark, which is described here: https://www.postgresql.org/message-id/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com (There is an extra pgbench_accounts index on abalance, plus 4 indexes on large text columns with filler MD5 hashes, all of which are random.) On the master branch, I can clearly observe that the "filler" MD5 indexes are bloated to a degree that is affected by the order of their original creation/pg_class OID order. These are all indexes that become bloated purely due to "version churn" -- or what I like to call "unnecessary" page splits. The keys used in each pgbench_accounts logical row never change, except in the case of the extra abalance index (the idea is to prevent all HOT updates without ever updating most indexed columns). I noticed that pgb_a_filler1 is a bit less bloated than pgb_a_filler2, which is a little less bloated than pgb_a_filler3, which is a little less bloated than pgb_a_filler4. Even after 4 hours, and even though the "shape" of each index is identical. This demonstrates an important general principle about vacuuming indexes: timeliness can matter a lot. In general, a big benefit of the deduplication patch is that it "buys time" for VACUUM to run before "unnecessary" page splits can occur -- that is why the deduplication patch prevents *all* page splits in these "filler" indexes, whereas on the master branch the filler indexes are about 2x larger (the exact amount varies based on VACUUM processing order, at least earlier on). For tables with several indexes, giving each index its own VACUUM worker process will prevent "unnecessary" page splits caused by version churn, simply because VACUUM will start to clean each index sooner than it would compared to serial processing (except for the "lucky" first index). There is no "lucky" first index that gets preferential treatment -- presumably VACUUM will start processing each index at the same time with this patch, making each index equally "lucky". I think that there may even be a *complementary* effect with parallel VACUUM, though I haven't tested that theory. Deduplication "buys time" for VACUUM to run, while at the same time VACUUM takes less time to show up and prevent "unnecessary" page splits. My guess is that these two seemingly unrelated patches may actually address this "unnecessary page split" problem from two completely different angles, with an overall effect that is greater than the sum of its parts. While the difference in size of each filler index on the master branch wasn't that significant on its own, it's still interesting. It's probably quite workload dependent. -- Peter Geoghegan
pgsql-hackers by date: