Thread: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From

Shayon Mukherjee

Date:

07 May, 01:05:42

Hello hackers,

I'd like to discuss an issue we observed where the truncation phase of autovacuum can become significantly prolonged on toast tables of busy systems due to repeated interruptions by lock waiters, leading to extended `AccessExclusiveLock` contention and impacting overall database & application availability.

Recap of the production incident

We experienced this issue directly where a `VACUUM` on a TOAST table took approximately 5 hours. The following shows the scale of pages eventually removed and the duration:

```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236 skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490 dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted, 298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```

During this 5-hour window, read queries against primary and replica databases (on AWS Aurora) were frequently failing due to lock timeouts (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock` being held and released repeatedly by the truncation part of the autovacuum process.

This then further led to me getting a better understanding of the problem and order of events. Which brings me to:

Understanding the problem
My understanding is that the PostgreSQL autovacuum process, specifically its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim space. One critical phase, if deemed beneficial, is the truncation of empty pages from the end of a relation, handled by the `lazy_truncate_heap` function.

As I traced down the code, the interaction occurs as follows:

- To physically truncate a relation, `VACUUM` must acquire an `AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a verification step by scanning the relation backwards from its apparent end (`count_nondeletable_pages`). This is necessary because other transactions might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward scan, `VACUUM` periodically checks if any other database sessions are waiting for locks on the same relation using `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms) and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
- If `LockHasWaitersRelation` returns `true` during the backward scan, `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and returns, effectively "suspending" its scan to allow waiting sessions to proceed.
- Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true, `VACUUM` releases the `AccessExclusiveLock`.
- `lazy_truncate_heap` contains a `do...while` loop. If `lock_waiter_detected` was true AND there are still potentially truncatable pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM` to retry the entire truncation sequence: attempt to re-acquire the `AccessExclusiveLock` (with its own timeout of `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the `count_nondeletable_pages` backward scan, and so on.

- Next comes the "stuck in a loop" problem and this is where things get interesting:
- On a very busy system with constant concurrent access, `VACUUM` can repeatedly acquire the `AccessExclusiveLock`, start the backward scan, detect waiters, release the lock, and retry. Each retry of `count_nondeletable_pages` involves re-reading potentially many blocks from disk while holding the exclusive lock which aggravates the lock contention.
- This cycle leads to prolonged `VACUUM` operations, significant repeated I/O, and extended lock contention, reducing database availability and causing query timeouts for other sessions (e.g., `canceling statement due to lock timeout`, `LockNotAvailable`). This is particularly acute for the TOAST tables in my case which sadly happens to be very large (for now).

Testing the theory and patch

To validate this understanding, I was able to reproduce a similar scenario by loading a test table with a significant amount of TOAST-able data and then simulating a busy read workload against it concurrently with a `VACUUM` operation. I have attached a sql script to load the data and a ruby script to simulate the concurrent workload if anyone is curious. The tests demonstrated how the truncation phase could indeed enter extended periods of repeatedly attempting to acquire the `AccessExclusiveLock` and performing the backward scan, only to be interrupted by the simulated concurrent workload.

Considering the potential for the truncation phase to cause prolonged lock contention under these conditions, I am wondering if it would be beneficial to explicitly cap the number of retries for the truncation process within the `do...while` loop in `lazy_truncate_heap` when interruptions by lock waiters occur?

Because I was already too deep in the code, I decided to write a small patch to introduce a limit to the retries and observe the behavior. Specifically, it adds a counter `truncate_interruption_retry_count` and checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES` (which for now I have set to 3). If the backward scan in `count_nondeletable_pages` is interrupted by lock waiters this many times, `VACUUM` will cease further truncation attempts for the current `VACUUM` cycle on that table.

I was able to perform the test on a similar setup and the results are as follows (noticed the `stopping truncate after 3 retries due to repeated conflicting lock requests` coming from the patch):

```
VACUUM (VERBOSE) toast_target_table;
INFO: vacuuming "postgres.public.toast_target_table"
INFO: table "toast_target_table": suspending truncate due to conflicting lock request
INFO: table "toast_target_table": truncated 144032 to 143744 pages
INFO: table "toast_target_table": suspending truncate due to conflicting lock request
INFO: table "toast_target_table": truncated 143744 to 143456 pages
INFO: table "toast_target_table": suspending truncate due to conflicting lock request
INFO: table "toast_target_table": truncated 143456 to 143168 pages
INFO: table "toast_target_table": stopping truncate after 3 retries due to repeated conflicting lock requests
INFO: finished vacuuming "postgres.public.toast_target_table": index scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted, 27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO: vacuuming "postgres.pg_toast.pg_toast_649469"
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO: table "pg_toast_649469": stopping truncate after 3 retries due to repeated conflicting lock requests
INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans: 0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted, 55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```

All that said, I wanted to bring some visibility into this part of the problem and also learn from the community on potential ways we could solve it. One way to solve this is capping the number of retries like in the patch I attached. The reason I feel like something like this is beneficial is because it would prioritize overall system availability by preventing lazy truncate from acquiring the `AccessExclusiveLock` and holding it for an extended period of time.

The patch is definitely more of a proof of concept for now (it lacks some tests and maybe even some configurable params?) and more importantly I am also curious to hear from the community if this feels like a worthwhile change or approach (assuming my understanding of the problem makes sense). Or perhaps there are other ways to solve this problem as well?

Thank you
Shayon Mukherjee

Attachment

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From

Shayon Mukherjee

Date:

07 May, 16:33:25

Hello,

Attached are the updated files to help reproduce the state mentioned in the previous email more easily.

- The load_data.sql has the commands to insert a lot of dummy data that will overflow to TOAST

- Which then is deleted (via the commands in the sql file) to create a situation where a truncate will happen on a future vacuum

- Updated ruby script.rb to make simple select calls with high concurrency

- Renamed the POC to be `v1`

Thank you

Shayon

On Tue, May 6, 2025 at 6:05 PM Shayon Mukherjee <shayonj@gmail.com> wrote:

Hello hackers,

I'd like to discuss an issue we observed where the truncation phase of autovacuum can become significantly prolonged on toast tables of busy systems due to repeated interruptions by lock waiters, leading to extended `AccessExclusiveLock` contention and impacting overall database & application availability.

Recap of the production incident

We experienced this issue directly where a `VACUUM` on a TOAST table took approximately 5 hours. The following shows the scale of pages eventually removed and the duration:

```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236 skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490 dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted, 298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```

During this 5-hour window, read queries against primary and replica databases (on AWS Aurora) were frequently failing due to lock timeouts (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock` being held and released repeatedly by the truncation part of the autovacuum process.

This then further led to me getting a better understanding of the problem and order of events. Which brings me to:

Understanding the problem
My understanding is that the PostgreSQL autovacuum process, specifically its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim space. One critical phase, if deemed beneficial, is the truncation of empty pages from the end of a relation, handled by the `lazy_truncate_heap` function.

As I traced down the code, the interaction occurs as follows:

- To physically truncate a relation, `VACUUM` must acquire an `AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a verification step by scanning the relation backwards from its apparent end (`count_nondeletable_pages`). This is necessary because other transactions might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward scan, `VACUUM` periodically checks if any other database sessions are waiting for locks on the same relation using `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms) and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
- If `LockHasWaitersRelation` returns `true` during the backward scan, `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and returns, effectively "suspending" its scan to allow waiting sessions to proceed.
- Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true, `VACUUM` releases the `AccessExclusiveLock`.
- `lazy_truncate_heap` contains a `do...while` loop. If `lock_waiter_detected` was true AND there are still potentially truncatable pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM` to retry the entire truncation sequence: attempt to re-acquire the `AccessExclusiveLock` (with its own timeout of `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the `count_nondeletable_pages` backward scan, and so on.

- Next comes the "stuck in a loop" problem and this is where things get interesting:
- On a very busy system with constant concurrent access, `VACUUM` can repeatedly acquire the `AccessExclusiveLock`, start the backward scan, detect waiters, release the lock, and retry. Each retry of `count_nondeletable_pages` involves re-reading potentially many blocks from disk while holding the exclusive lock which aggravates the lock contention.
- This cycle leads to prolonged `VACUUM` operations, significant repeated I/O, and extended lock contention, reducing database availability and causing query timeouts for other sessions (e.g., `canceling statement due to lock timeout`, `LockNotAvailable`). This is particularly acute for the TOAST tables in my case which sadly happens to be very large (for now).

Testing the theory and patch

To validate this understanding, I was able to reproduce a similar scenario by loading a test table with a significant amount of TOAST-able data and then simulating a busy read workload against it concurrently with a `VACUUM` operation. I have attached a sql script to load the data and a ruby script to simulate the concurrent workload if anyone is curious. The tests demonstrated how the truncation phase could indeed enter extended periods of repeatedly attempting to acquire the `AccessExclusiveLock` and performing the backward scan, only to be interrupted by the simulated concurrent workload.

Considering the potential for the truncation phase to cause prolonged lock contention under these conditions, I am wondering if it would be beneficial to explicitly cap the number of retries for the truncation process within the `do...while` loop in `lazy_truncate_heap` when interruptions by lock waiters occur?

Because I was already too deep in the code, I decided to write a small patch to introduce a limit to the retries and observe the behavior. Specifically, it adds a counter `truncate_interruption_retry_count` and checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES` (which for now I have set to 3). If the backward scan in `count_nondeletable_pages` is interrupted by lock waiters this many times, `VACUUM` will cease further truncation attempts for the current `VACUUM` cycle on that table.

I was able to perform the test on a similar setup and the results are as follows (noticed the `stopping truncate after 3 retries due to repeated conflicting lock requests` coming from the patch):

```
VACUUM (VERBOSE) toast_target_table;
INFO: vacuuming "postgres.public.toast_target_table"
INFO: table "toast_target_table": suspending truncate due to conflicting lock request
INFO: table "toast_target_table": truncated 144032 to 143744 pages
INFO: table "toast_target_table": suspending truncate due to conflicting lock request
INFO: table "toast_target_table": truncated 143744 to 143456 pages
INFO: table "toast_target_table": suspending truncate due to conflicting lock request
INFO: table "toast_target_table": truncated 143456 to 143168 pages
INFO: table "toast_target_table": stopping truncate after 3 retries due to repeated conflicting lock requests
INFO: finished vacuuming "postgres.public.toast_target_table": index scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted, 27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO: vacuuming "postgres.pg_toast.pg_toast_649469"
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO: table "pg_toast_649469": stopping truncate after 3 retries due to repeated conflicting lock requests
INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans: 0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted, 55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```

All that said, I wanted to bring some visibility into this part of the problem and also learn from the community on potential ways we could solve it. One way to solve this is capping the number of retries like in the patch I attached. The reason I feel like something like this is beneficial is because it would prioritize overall system availability by preventing lazy truncate from acquiring the `AccessExclusiveLock` and holding it for an extended period of time.

The patch is definitely more of a proof of concept for now (it lacks some tests and maybe even some configurable params?) and more importantly I am also curious to hear from the community if this feels like a worthwhile change or approach (assuming my understanding of the problem makes sense). Or perhaps there are other ways to solve this problem as well?

Thank you
Shayon Mukherjee

Attachment

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From

Robert Treat

Date:

08 May, 21:23:28

On Thu, May 8, 2025 at 12:10 PM Sami Imseih <samimseih@gmail.com> wrote:
> I think it's better to simply disable the truncate work and perform it
> at a later time
> than introduce some new limit to how many times the truncation can be suspended.
> In the type of workload you are referring to, it is likely that the
> truncation will
> end up not completing, so why even try at all?
>

Yeah, I pretty much had the same thought; if you bail out after some
kind of cap, there will be workloads that will never complete the
truncate heap phase, in which case you should probably be using
truncate off.  But the one thing I was a bit sympathetic on was how to
actually determine you are in this situation, or how bad the situation
was, in order to know that setting truncate off would help? To that
end, I wonder if it'd be worth adding the counter and printing that
information (and maybe elapsed time in this phase?) in vacuum verbose
output?

Robert Treat
https://xzilla.net

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From

Sami Imseih

Date:

08 May, 21:56:55

> actually determine you are in this situation, or how bad the situation
> was, in order to know that setting truncate off would help? To that

Adding counters for this area is not a bad idea in general, as this hits
customers particularly hard on hot standbys when the truncate does actually
occur on the primary.

What about adding cumulative counters ( per table and pg_stat_database
) such as:

pages_vac_truncated - # of pages truncated by vacuum
vac_truncate_conflicts - # of time truncate was skipped due to conflict
vac_truncate_suspended - # of times the truncate was suspended.

The difference between conflict and suspended is conflict causes vacuum
to skip the truncate phase while suspended causes vacuum to retry the
phase continually.

The argument against adding these counters is that the views,
pg_stat_all_tables/
pg_stat_all_database are becoming super-wide, so maybe we need to think
about inventing a new view for vacuum related counter metrics. This seems like
a good discussion for v19.

--
Sami

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From

Shayon Mukherjee

Date:

08 May, 23:58:35

Thank you for engaging into the conversation and sharing your thoughts Sami and Robert.

On Thu, May 8, 2025 at 2:57 PM Sami Imseih <samimseih@gmail.com> wrote:

> actually determine you are in this situation, or how bad the situation
> was, in order to know that setting truncate off would help? To that

That's definitely a sound idea and the way I was able to replicate and learn myself was by doing a PATCH that would just print the # of attempts it was doing. That said, just to add more color - I was coming at this from a point of view where truncation would be on a "best effort" basis and the trade off here would be less availability risk at the cost of missed truncations. My first instinct was that, having vacuum_truncate set to ON by default on busy systems that have a lot of churn can catch a lot of developers and PostgreSQL by surprise (sure did to me :D), so the best effort and "fail fast" behavior could help, however at the same time I agree that it doesn't solve either of the problems really well. Another idea I had around this section was to expose an attribute/guc/setting like VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES with a default of `1` or something, that users could optionally configure, perhaps that could be a middleground, but I am myself not fully sold on the idea either (?).

Adding counters for this area is not a bad idea in general, as this hits
customers particularly hard on hot standbys when the truncate does actually
occur on the primary.

What about adding cumulative counters ( per table and pg_stat_database
) such as:

pages_vac_truncated - # of pages truncated by vacuum
vac_truncate_conflicts - # of time truncate was skipped due to conflict
vac_truncate_suspended - # of times the truncate was suspended.

The difference between conflict and suspended is conflict causes vacuum
to skip the truncate phase while suspended causes vacuum to retry the
phase continually.

Would vac_truncate_retries or something similar be more immediately clear? If so, maybe something like

- vac_truncate_interruptions - # of time truncate was skipped due to conflict and not retried

- vac_truncate_retries - # of times the truncate was suspended and retried

The argument against adding these counters is that the views,
pg_stat_all_tables/
pg_stat_all_database are becoming super-wide, so maybe we need to think
about inventing a new view for vacuum related counter metrics. This seems like
a good discussion for v19.

+1 for this. I'd even be happy to even just start showing the attempts as part of INFO or VERBOSE on conflicts & suspensions, without keeping track, but it is useful statistical data.

Thanks

Shayon