Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch - Mailing list pgsql-hackers

From Shayon Mukherjee
Subject Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch
Date
Msg-id CANqtF-rSHv4z9N5PczB7hsmZAO3=J3u7KPTc+-BWe4UUP5p=5w@mail.gmail.com
Whole thread Raw
In response to Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch  (Shayon Mukherjee <shayonj@gmail.com>)
List pgsql-hackers
Hello,

Attached are the updated files to help reproduce the state mentioned in the previous email more easily. 

- The load_data.sql has the commands to insert a lot of dummy data that will overflow to TOAST
  - Which then is deleted (via the commands in the sql file) to create a situation where a truncate will happen on a future vacuum
- Updated ruby script.rb to make simple select calls with high concurrency 
- Renamed the POC to be `v1`

Thank you
Shayon 

On Tue, May 6, 2025 at 6:05 PM Shayon Mukherjee <shayonj@gmail.com> wrote:
Hello hackers,

I'd like to discuss an issue we observed where the truncation phase of autovacuum can become significantly prolonged on toast tables of busy systems due to repeated interruptions by lock waiters, leading to extended `AccessExclusiveLock` contention and impacting overall database & application availability.

Recap of the production incident

We experienced this issue directly where a `VACUUM` on a TOAST table took approximately 5 hours. The following shows the scale of pages eventually removed and the duration:

```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236 skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490 dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted, 298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```

During this 5-hour window, read queries against primary and replica databases (on AWS Aurora) were frequently failing due to lock timeouts (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock` being held and released repeatedly by the truncation part of the autovacuum process.

This then further led to me getting a better understanding of the problem and order of events. Which brings me to:

Understanding the problem
My understanding is that the PostgreSQL autovacuum process, specifically its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim space. One critical phase, if deemed beneficial, is the truncation of empty pages from the end of a relation, handled by the `lazy_truncate_heap` function.

As I traced down the code, the interaction occurs as follows:

- To physically truncate a relation, `VACUUM` must acquire an `AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a verification step by scanning the relation backwards from its apparent end (`count_nondeletable_pages`). This is necessary because other transactions might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward scan, `VACUUM` periodically checks if any other database sessions are waiting for locks on the same relation using `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms) and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
    - If `LockHasWaitersRelation` returns `true` during the backward scan, `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and returns, effectively "suspending" its scan to allow waiting sessions to proceed.
    - Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true, `VACUUM` releases the `AccessExclusiveLock`.
    - `lazy_truncate_heap` contains a `do...while` loop. If `lock_waiter_detected` was true AND there are still potentially truncatable pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM` to retry the entire truncation sequence: attempt to re-acquire the `AccessExclusiveLock` (with its own timeout of `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the `count_nondeletable_pages` backward scan, and so on.

- Next comes the "stuck in a loop" problem and this is where things get interesting:
    - On a very busy system with constant concurrent access, `VACUUM` can repeatedly acquire the `AccessExclusiveLock`, start the backward scan, detect waiters, release the lock, and retry. Each retry of `count_nondeletable_pages` involves re-reading potentially many blocks from disk while holding the exclusive lock which aggravates the lock contention.
    - This cycle leads to prolonged `VACUUM` operations, significant repeated I/O, and extended lock contention, reducing database availability and causing query timeouts for other sessions (e.g., `canceling statement due to lock timeout`, `LockNotAvailable`). This is particularly acute for the TOAST tables in my case which sadly happens to be very large (for now).

Testing the theory and patch

To validate this understanding, I was able to reproduce a similar scenario by loading a test table with a significant amount of TOAST-able data and then simulating a busy read workload against it concurrently with a `VACUUM` operation. I have attached a sql script to load the data and a ruby script to simulate the concurrent workload if anyone is curious. The tests demonstrated how the truncation phase could indeed enter extended periods of repeatedly attempting to acquire the `AccessExclusiveLock` and performing the backward scan, only to be interrupted by the simulated concurrent workload.

Considering the potential for the truncation phase to cause prolonged lock contention under these conditions, I am wondering if it would be beneficial to explicitly cap the number of retries for the truncation process within the `do...while` loop in `lazy_truncate_heap` when interruptions by lock waiters occur?

Because I was already too deep in the code, I decided to write a small patch to introduce a limit to the retries and observe the behavior. Specifically, it adds a counter `truncate_interruption_retry_count` and checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES` (which for now I have set to 3). If the backward scan in `count_nondeletable_pages` is interrupted by lock waiters this many times, `VACUUM` will cease further truncation attempts for the current `VACUUM` cycle on that table.

I was able to perform the test on a similar setup and the results are as follows (noticed the `stopping truncate after 3 retries due to repeated conflicting lock requests` coming from the patch):

```
VACUUM (VERBOSE) toast_target_table;
INFO:  vacuuming "postgres.public.toast_target_table"
INFO:  table "toast_target_table": suspending truncate due to conflicting lock request
INFO:  table "toast_target_table": truncated 144032 to 143744 pages
INFO:  table "toast_target_table": suspending truncate due to conflicting lock request
INFO:  table "toast_target_table": truncated 143744 to 143456 pages
INFO:  table "toast_target_table": suspending truncate due to conflicting lock request
INFO:  table "toast_target_table": truncated 143456 to 143168 pages
INFO:  table "toast_target_table": stopping truncate after 3 retries due to repeated conflicting lock requests
INFO:  finished vacuuming "postgres.public.toast_target_table": index scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted, 27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO:  vacuuming "postgres.pg_toast.pg_toast_649469"
INFO:  table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO:  table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO:  table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO:  table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO:  table "pg_toast_649469": suspending truncate due to conflicting lock request
INFO:  table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO:  table "pg_toast_649469": stopping truncate after 3 retries due to repeated conflicting lock requests
INFO:  finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans: 0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted, 55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```

All that said, I wanted to bring some visibility into this part of the problem and also learn from the community on potential ways we could solve it. One way to solve this is capping the number of retries like in the patch I attached. The reason I feel like something like this is beneficial is because it would prioritize overall system availability by preventing lazy truncate from acquiring the `AccessExclusiveLock` and holding it for an extended period of time.

The patch is definitely more of a proof of concept for now (it lacks some tests and maybe even some configurable params?) and more importantly I am also curious to hear from the community if this feels like a worthwhile change or approach (assuming my understanding of the problem makes sense). Or perhaps there are other ways to solve this problem as well?

Thank you
Shayon Mukherjee
Attachment

pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Improve tab completion for COPY
Next
From: Greg Sabino Mullane
Date:
Subject: Re: regdatabase