Re: POC: Parallel processing of indexes in autovacuum - Mailing list pgsql-hackers
| From | SATYANARAYANA NARLAPURAM |
|---|---|
| Subject | Re: POC: Parallel processing of indexes in autovacuum |
| Date | |
| Msg-id | CAHg+QDehxaJEd1Yp1MpW8UO71xmbasy7t2GZGvqOYwkr0md8DQ@mail.gmail.com Whole thread |
| In response to | Re: POC: Parallel processing of indexes in autovacuum (Daniil Davydov <3danissimo@gmail.com>) |
| Responses |
Re: POC: Parallel processing of indexes in autovacuum
|
| List | pgsql-hackers |
Hi
On Sat, Mar 28, 2026 at 4:11 AM Daniil Davydov <3danissimo@gmail.com> wrote:
Hi,
On Thu, Mar 26, 2026 at 5:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 25, 2026 at 12:45 AM Daniil Davydov <3danissimo@gmail.com> wrote:
> >
> > Searching for arguments in
> > favor of opt-in style, I asked for help from another person who has been
> > managing the setup of highload systems for decades. He promised to share his
> > opinion next week.
>
> Given that we have one and half weeks before the feature freeze, I
> think it's better to complete the project first before waiting for
> his/her comments next week. Even if we finish this feature with the
> opt-out style, we can hear more opinions on it and change the default
> behavior as the change would be privial. What do you think?
>
Sure, if we can change the default value after the feature freeze, I don't
mind leaving our parameter in opt-out style by now.
> I've squashed all patches except for the documentation patch as I
> assume you're working on it. The attached fixup patch contains several
> changes: using opt-out style, comment improvements, and fixing typos
> etc.
>
Thank you very much for the proposed fixes!
I like the way you have changed nparallel_workers calculation (autovacuum.c).
Forcing parallel workers to always read shared cost params at the first time
is a good decision. All comments changes are also LGTM.
The only place that I have changed is reloptions.c :
As you have explained, it is not appropriate to use the "overrides" wording
in the reloption's description, so I decided to return an old one.
On Fri, Mar 27, 2026 at 10:54 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> On Wed, Mar 25, 2026 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Given that we have one and half weeks before the feature freeze, I
> > think it's better to complete the project first before waiting for
> > his/her comments next week. Even if we finish this feature with the
> > opt-out style, we can hear more opinions on it and change the default
> > behavior as the change would be privial. What do you think?
> >
> > I've squashed all patches except for the documentation patch as I
> > assume you're working on it. The attached fixup patch contains several
> > changes: using opt-out style, comment improvements, and fixing typos
> > etc.
>
> +1 for enabling this feature by default. When enough CPU is available,
> vacuuming multiple indexes of a table in parallel in autovacuum
> definitely speeds things up.
Yes, for sure. But I have concerns that enabling parallel a/v for everyone
will cause the parallel workers shortage during processing of the most huge
tables.
> Thank you for sending the latest patches. I quickly reviewed the v31
> patches. Here are some comments.
>
> 1/ + {"autovacuum_parallel_workers", RELOPT_TYPE_INT,
>
> I haven't looked at the whole thread, but do we all think we need this
> as a relopt? IMHO, we can wait for field experience and introduce this
> later.
I think that we should leave both reloption and the config parameter.
Getting rid from the reloption will greatly reduce the ability of users to
tune this feature. I'm afraid that this may lead to people not using parallel
autovacuum.
> I'm having a hard time finding a use-case where one wants to
> disable the indexes at the table level. If there was already an
> agreement, I agree to commit to that decision.
You can read discussion from [1] to the current message in order to dive into
the question.
To make the long story short, I think that the most common use case for this
feature is allowing parallel a/v for 2-3 tables, each of which has ~100
indexes. The rest of the tables do not require parallel processing (at least
it's a much lower priority for them).
At the same time, Masahiko-san thinks that only the system should decide which
tables will be processed in parallel. System's decision should be based on the
number of indexes and a few other config parameters (e.g.
min_parallel_index_scan_size). Thus, possibly many tables will be able to be
processed in parallel.
(Both opinions are pretty simplified).
>
> 2/ + /*
> + * If 'true' then we are running parallel autovacuum. Otherwise, we are
> + * running parallel maintenence VACUUM.
> + */
> + bool is_autovacuum;
> +
>
> The variable name looks a bit confusing. How about we rely on
> AmAutoVacuumWorkerProcess() and avoid the bool in shared memory?
This variable is needed for parallel workers, which are taken from the
bgworkers pool. I.e. AmAutovacuumWorker() will return 'false' for them.
We need the "is_autovacuum" variable in order to understand exactly what this
process was started for (VACUUM PARALLEL or parallel autovacuum).
Thanks everyone for the review!
Please, see an updated set of patches :
As I promised, I created a dedicated chapter for Parallel Vacuum description.
Both maintenance VACUUM and autovacuum now refer to this chapter.
I am pretty inexperienced in the documentation writing, so forgive me if
something is out of code style.
[1] https://www.postgresql.org/message-id/CAJDiXggH1bW%3D4n%2B55CGLvs_sRU4SYNXwYLZ37wvJ5H_3yURSPw%40mail.gmail.com
Thank you for working on this, very useful feature. Sharing a few thoughts:
1. Shouldn't we also cap by max_parallel_workers to avoid wasting DSM resources in parallel_vacuum_compute_workers?
2. Is it intentional that other autovacuum workers not yield cost limits to the parallel auto vacuum workers? Cost limits are distributed first equally to the autovacuum workers.
and then they share that. Therefore, parallel workers will be heavily throttled. IIUC, this problem doesn't exist with manual vacuum.
If we don't fix this, at least we should document this.
3. Additionally, is there a point where, based on the cost limits, launching additional workers becomes counterproductive compared to running fewer workers and preventing it?
4. Would it make sense to add a table level override to disable parallelism or set parallel worker count?
I ran some perf tests to show the improvements with parallel vacuum and shared below.
System Configuration
--------------------
Hardware:
CPU: 16 cores
RAM: 128 GB
Storage: NVMe SSDs
OS: Ubuntu Linux
Workload Description
--------------------
Table: avtest
- 5,000,000 rows
- 9 columns: id (bigint PK), col1-col5 (int), col6 (text), col7 (timestamp),
padding (text, 50 bytes)
- 8 indexes:
avtest_pkey (col: id) 107 MB
idx_av_col7 (col: col7) 107 MB
idx_av_col2 (col: col2) 56 MB
idx_av_col4 (col: col4) 56 MB
idx_av_col5 (col: col5) 56 MB
idx_av_col1 (col: col1) 56 MB
idx_av_col3 (col: col3) 56 MB
idx_av_col6 (col: col6) 35 MB
- Total size: 1171 MB
Each test iteration:
1. Delete 2,000,000 rows (40%) using: DELETE WHERE id % 5 IN (1, 2)
2. CHECKPOINT to flush dirty pages
3. Trigger autovacuum by setting autovacuum_vacuum_threshold = 100 and
autovacuum_vacuum_scale_factor = 0 on the table
4. Wait for autovacuum to complete (detected via server log)
5. Re-insert the deleted rows and VACUUM to restore the table for the next run
Test Methodology
----------------
Worker configurations tested: 0, 2, 4, 7 parallel workers
(7 is the maximum: nindexes - 1, since the leader always handles one index)
Two experiments were run with different cost-based vacuum delay settings:
Experiment A: cost_limit=200, cost_delay=2ms
Experiment B: cost_limit=60, cost_delay=2ms
Common server settings for both experiments:
shared_buffers = 120 GB (entire dataset fits in shared buffers)
maintenance_work_mem = 1 GB
max_wal_size = 100 GB (prevents checkpoints during vacuum)
min_wal_size = 10 GB
checkpoint_timeout = 1 hour (prevents time-based checkpoints)
wal_buffers = 128 MB
max_parallel_workers = 16
max_worker_processes = 24
autovacuum_naptime = 1s
Between every single run:
1. PostgreSQL server is fully stopped (pg_ctl stop -m fast)
2. OS page cache is dropped (echo 3 > /proc/sys/vm/drop_caches)
3. Server is restarted with a clean log file
4. After DELETE and CHECKPOINT, the server is stopped again, OS caches
dropped again, and the server restarted -- so vacuum starts fully cold
5. The autovacuum_max_parallel_workers GUC is reloaded via pg_ctl reload
Each configuration was tested for 5 iterations.
Timing is extracted from the PostgreSQL server log "system usage" line that
autovacuum emits at completion. This reports elapsed wall-clock time and CPU
time for the autovacuum worker leader process.
Results: Experiment A (cost_limit=200, cost_delay=2ms)
------------------------------------------------------
Workers Iter1 Iter2 Iter3 Iter4 Iter5 Avg(s) Speedup
------- ------ ------ ------ ------ ------ ------ -------
0 66.21 79.11 66.27 77.11 66.30 71.00 1.00x
2 66.55 53.27 52.66 55.74 55.71 56.78 1.25x
4 51.50 51.74 65.07 52.06 70.25 58.12 1.22x
7 50.05 50.35 50.04 50.12 50.07 50.12 1.41x
CPU usage (leader process only):
Workers Avg CPU user Avg CPU sys
------- ----------- ----------
0 3.04s 1.70s
2 1.24s 1.50s
4 0.78s 1.49s
7 0.79s 1.48s
Results: Experiment B (cost_limit=60, cost_delay=2ms)
-----------------------------------------------------
Workers Iter1 Iter2 Iter3 Iter4 Iter5 Avg(s) Speedup
------- ------ ------ ------ ------ ------ ------ -------
0 199.00 195.26 191.44 191.90 191.67 193.85 1.00x
2 160.68 181.33 176.85 167.84 159.47 169.23 1.14x
4 154.02 165.02 174.33 164.16 156.53 162.81 1.19x
7 148.49 158.68 160.66 154.37 149.20 154.28 1.25x
CPU usage (leader process only):
Workers Avg CPU user Avg CPU sys
------- ----------- ----------
0 3.06s 1.90s
2 1.28s 1.72s
4 0.80s 1.69s
7 0.82s 1.68s
Observations:
1. Parallel autovacuum provides consistent speedup. With cost_limit=200 and
7 workers, vacuum completes 1.41x faster (71s -> 50s). With cost_limit=60,
the speedup is 1.25x (194s -> 154s).
2. I see the benefit comes from parallelizing index vacuum. With 8 indexes totaling
~530 MB, parallel workers scan indexes concurrently instead of the leader
scanning them one by one. The leader's CPU user time drops from ~3s to
~0.8s as index work is offloaded
--------------------
Hardware:
CPU: 16 cores
RAM: 128 GB
Storage: NVMe SSDs
OS: Ubuntu Linux
Workload Description
--------------------
Table: avtest
- 5,000,000 rows
- 9 columns: id (bigint PK), col1-col5 (int), col6 (text), col7 (timestamp),
padding (text, 50 bytes)
- 8 indexes:
avtest_pkey (col: id) 107 MB
idx_av_col7 (col: col7) 107 MB
idx_av_col2 (col: col2) 56 MB
idx_av_col4 (col: col4) 56 MB
idx_av_col5 (col: col5) 56 MB
idx_av_col1 (col: col1) 56 MB
idx_av_col3 (col: col3) 56 MB
idx_av_col6 (col: col6) 35 MB
- Total size: 1171 MB
Each test iteration:
1. Delete 2,000,000 rows (40%) using: DELETE WHERE id % 5 IN (1, 2)
2. CHECKPOINT to flush dirty pages
3. Trigger autovacuum by setting autovacuum_vacuum_threshold = 100 and
autovacuum_vacuum_scale_factor = 0 on the table
4. Wait for autovacuum to complete (detected via server log)
5. Re-insert the deleted rows and VACUUM to restore the table for the next run
Test Methodology
----------------
Worker configurations tested: 0, 2, 4, 7 parallel workers
(7 is the maximum: nindexes - 1, since the leader always handles one index)
Two experiments were run with different cost-based vacuum delay settings:
Experiment A: cost_limit=200, cost_delay=2ms
Experiment B: cost_limit=60, cost_delay=2ms
Common server settings for both experiments:
shared_buffers = 120 GB (entire dataset fits in shared buffers)
maintenance_work_mem = 1 GB
max_wal_size = 100 GB (prevents checkpoints during vacuum)
min_wal_size = 10 GB
checkpoint_timeout = 1 hour (prevents time-based checkpoints)
wal_buffers = 128 MB
max_parallel_workers = 16
max_worker_processes = 24
autovacuum_naptime = 1s
Between every single run:
1. PostgreSQL server is fully stopped (pg_ctl stop -m fast)
2. OS page cache is dropped (echo 3 > /proc/sys/vm/drop_caches)
3. Server is restarted with a clean log file
4. After DELETE and CHECKPOINT, the server is stopped again, OS caches
dropped again, and the server restarted -- so vacuum starts fully cold
5. The autovacuum_max_parallel_workers GUC is reloaded via pg_ctl reload
Each configuration was tested for 5 iterations.
Timing is extracted from the PostgreSQL server log "system usage" line that
autovacuum emits at completion. This reports elapsed wall-clock time and CPU
time for the autovacuum worker leader process.
Results: Experiment A (cost_limit=200, cost_delay=2ms)
------------------------------------------------------
Workers Iter1 Iter2 Iter3 Iter4 Iter5 Avg(s) Speedup
------- ------ ------ ------ ------ ------ ------ -------
0 66.21 79.11 66.27 77.11 66.30 71.00 1.00x
2 66.55 53.27 52.66 55.74 55.71 56.78 1.25x
4 51.50 51.74 65.07 52.06 70.25 58.12 1.22x
7 50.05 50.35 50.04 50.12 50.07 50.12 1.41x
CPU usage (leader process only):
Workers Avg CPU user Avg CPU sys
------- ----------- ----------
0 3.04s 1.70s
2 1.24s 1.50s
4 0.78s 1.49s
7 0.79s 1.48s
Results: Experiment B (cost_limit=60, cost_delay=2ms)
-----------------------------------------------------
Workers Iter1 Iter2 Iter3 Iter4 Iter5 Avg(s) Speedup
------- ------ ------ ------ ------ ------ ------ -------
0 199.00 195.26 191.44 191.90 191.67 193.85 1.00x
2 160.68 181.33 176.85 167.84 159.47 169.23 1.14x
4 154.02 165.02 174.33 164.16 156.53 162.81 1.19x
7 148.49 158.68 160.66 154.37 149.20 154.28 1.25x
CPU usage (leader process only):
Workers Avg CPU user Avg CPU sys
------- ----------- ----------
0 3.06s 1.90s
2 1.28s 1.72s
4 0.80s 1.69s
7 0.82s 1.68s
Observations:
1. Parallel autovacuum provides consistent speedup. With cost_limit=200 and
7 workers, vacuum completes 1.41x faster (71s -> 50s). With cost_limit=60,
the speedup is 1.25x (194s -> 154s).
2. I see the benefit comes from parallelizing index vacuum. With 8 indexes totaling
~530 MB, parallel workers scan indexes concurrently instead of the leader
scanning them one by one. The leader's CPU user time drops from ~3s to
~0.8s as index work is offloaded
Thanks,
Satya
pgsql-hackers by date: