A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my. - Mailing list pgsql-hackers

From Ian Axelrod
Subject A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my.
Date
Msg-id CA+B3G4QvCy_a2+SUbbeQU5Q0TzUrH2iUYKg-XBezeJDMfVoq7Q@mail.gmail.com
Whole thread Raw
Responses Re: A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my.
List pgsql-hackers
Hi,

I help administrate a set of reasonably large PostgreSQL* database
servers at my current place of employment. Late last week, there was a
serious failure on one of the replicas for my employer's largest
database server. I have not seen a failure like this in 10 years of
working heavily with PostgreSQL. I am hoping someone here has
encountered the situation before and can state if:

1. What I saw was a bug or a misconfiguration of the server.
2. There are any actions I can take beyond those I already have or
intend to (more on that at the end of this post) to eliminate the
issue, or at least reduce the frequency and/or severity.

*This issue occurred on an AWS RDS Aurora PostgreSQL installation. It
is definitely possible that this is an Aurora-specific issue. I will
be reaching out to AWS as well, but getting in touch with an internal
team that can provide real answers and do deep investigation has
proven to be very difficult. My hope is that this might be an issue
that is not specific to Aurora.

-------
The Database

PostgreSQL 15.10 on AWS Aurora

Primary & affected read replica: r6gd.4xlarge (16 vCPU, 128 GB memory,
Aurora's 'Optimized Reads' feature enabled)
Fallback replica: r7g.4xlarge (16 vCPU, 128 GB memory, no Optimized
Reads feature)

Potentially relevant settings:
autovacuum_max_workers = 3 (Aurora default, determined by instance size)
autovacuum_naptime = 5 (Aurora default)
autovacuum_vacuum_cost_delay = 5 (Aurora default)
autovacuum_vacuum_cost_limit = 1800 (Aurora default, determined by
instance size)
autovacuum_work_mem = 4074383kB (Aurora default, determined by instance size)
rds.logical_replication = 1
track_commit_timestamp = 1

Logical replication in use? Yes. Did not observe significant logical
replication slot lag during the event.

-------
The Problem

The read replica that serves all production traffic began cancelling
queries with the following messages:
"""
ERROR: canceling statement due to conflict with recovery
DETAIL: User query may not have access to page data due to replica disconnect.
HINT: When the replica reconnects you will be able to repeat your command.
"""

Initial diagnosis: I checked the database metrics and confirmed that
replication lag had quickly shot up from 20-30 milliseconds to 60,000+
milliseconds, at which point Aurora disconnected the read replica and
reconnected to try to fix the issue. This did not fix the issue:
replica lag would increase to 60k+ ms several more times until I took
corrective action. No other replica appeared to be having the same
issue, at least not to the extent of restarting replication.

Corrective action: Switched over traffic to the fallback replica.
Result: Queries no longer cancelled, however, replication lag on the
primary read replica was still extremely high (10k+ ms) and decreased
very slowly. Moreover, replication lag on the fallback replica was
volatile: it was jumping from 30 ms to 100 ms rapidily, which is very
unusual for Aurora.

Corrective action: Restarted primary read replica.
Result: Replica lag went back to normal after the restart. I was able
to switch traffic back to the primary read replica. The fallback
replica was still experiencing unusually jittery replica lag for 30
minutes or so after. It then returned to normal.

-------
The Investigation

This issue occurred when three major operations were taking place on
the primary. (I failed to recreate these exact conditions outside of
production, unfortunately.)

1. Aggressive autovacuum of a very large table. This table is
approximately 2 TB with 4.7 billion rows. It is append-only: a batch
job adds to it once per month over the course of a few days.
Otherwise, it is read-only. Anyway, the autovacuum ran for about 16
hours, and after completion, it output the following log:
LOG: automatic aggressive vacuum to prevent wraparound of table
"REDACTED": index scans: 0
pages: 0 removed, 169701283 remain, 1 scanned (0.00% of total)
tuples: 0 removed, 4530232832 remain, 0 are dead but not yet removable
removable cutoff: 1225281859, which was 31289976 XIDs old when operation ended
new relfrozenxid: 1225281859, which is 199999042 XIDs ahead of previous value
new relminmxid: 2075086366, which is 23596562 MXIDs ahead of previous value
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
index "REDACTED": pages: 63503449 in total, 0 newly deleted, 0
currently deleted, 0 reusable
I/O timings: read: 56733479.933 ms, write: 0.000 ms
avg read rate: 8.701 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5327 hits, 63555559 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 980.36 s, system: 845.00 s, elapsed: 57068.70 s
2. Batch job that was writing at a high rate (100+ UPDATEs / second)
to two small (< 25,000 row) tables. This job started near the tail-end
of (1), which had already been running for nearly 14 hours.
3. Seemingly endless, concurrent autovacuuming of the tables written
to by (2) until the aggressive autovacuum on (1) completed. The
autovacuuming coincided with the batch job from (2). These
autovacuums, which should have taken a few seconds, continued for 2.5
hours and were dominated by CPU usage and lock contention to advance
datfrozenxid (in RDS: the 'Lock:frozenid' wait event).

Observations:
(1) clearly had to do a massive amount of I/O over a long period, as
nearly nothing was in cache. I was surprised that only 1 page was
scanned, considering the table size and infrequency that vacuum runs
on it.

(3) seemed extremely odd to me. It is my understanding that a
non-aggressive autovacuum rarely advances datfrozenxid. In the case
where it does, I would not expect it to keep advancing it for every
run. I looked into logs on the primary and tens of thousands of lines
like this:
2025-04-30 21:11:11 UTC::@:[9128]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:11 UTC::@:[9129]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:11 UTC::@:[9130]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:12 UTC::@:[9131]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:12 UTC::@:[9132]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:12 UTC::@:[9136]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:12 UTC::@:[9138]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:12 UTC::@:[9141]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
2025-04-30 21:11:12 UTC::@:[9147]:LOG: could not truncate directory
"pg_commit_ts": apparent wraparound
....

datfrozenxid age was only a little over 250,000,000 when the issue
occurred; and there were no other logs concerning transaction id
wraparound. So, I am confident that this was not an indication of a
real txid wraparound. Looking at the code, it looks like this message
is emitted as part of pg_commit_ts SLRU truncation. This can happen as
part of vacuum from my understanding, and should only happen during
frozenxid advancement. Please correct me if I am wrong.

I looked at SLRU truncate metrics timeline for pg_commit_ts. Sure
enough, there were around 20 truncate operations on "committs" SLRU
per second. This led me to believe that autovacuum was crashing,
despite no log message indicating so. There were three autovacuum
processes running concurrently, with two of them potentially
restarting. If they were rerunning due to the high rate of updates, I
would expect them to adhere to autovacuum_naptime of 5 seconds. Based
on the # of truncate operations and the fact that multiple "could not
truncate directory" error messages were occurring within the same
second from different PIDs, this doesn't seem to be the case.

I found that one of the autovacuums ran a few hours earlier with
similar high CPU usage, but it ran for less time (30~ minutes) and did
not have contention on the frozenid advancement lock. There was no
unusual lag on the replicas during this time. However, the "could not
truncate directory "pg_commit_ts": apparent wraparound" message was
present for this autovacuum at a far greater frequency than without.

When I manually tried vacuuming one of the tables that had the
CPU-intensive autovacuum loop, it succeeded but could not clean up all
dead tuples. I tried this multiple times.

-------
My thoughts

Pretty stumped. What I do not think this is:

- Long-running transaction on the replica holding up WAL (or Aurora's
buffer page cache replication, in this case) application, thus
increasing replication lag.

The two tables involved are not actively queried on the primary read
replica. And though the table that was undergoing aggressive
autovacuum is queried, I would have expected log_lock_waits to log
something (deadlock_timeout = 1s) and/or there to be query
cancellations for conflict recovery due to a lock held too long. There
were neither of these in the logs. I did not see any long
idle_in_transaction queries either... at least not until the replica
lag had already climbed extremely high.

- AccessExclusive lock held by one of the restarting/rerunning
autovacuum processes trying and failing to truncate dead tuples at the
physical end of the table, which is blocking SELECTS on the replica
for short periods of time.

As mentioned, the two tables involved in the regular (non-aggressive)
autovacuum are not queried on the read replica. There would be no
conflicts in the presence of an AccessExclusive lock. Also, had this
been the case somehow, I would have expected the issue to persist
after I switched replicas.

- Poorly configured autovacuum

None of the tables involved have custom autovacuum settings. They are
using the Aurora defaults. And although the aggressive autovacuum did
take a long time to finish, I do not think this should be an issue. I
am fine with it taking a while on a table this large. I imagine trying
to speed things up would only exacerbate the problem.

What I do believe thus far:

I believe the size of the table combined with its complete disuse on
the primary (hence all the buffer cache misses) broke something, but I
am not really sure what. I think this is a legitimate bug rather than
a poorly configured server. That's really all I've got other than
many, many questions.

1. Why would a struggling, or even failing, autovacuum cause any
replica lag in the absence of long-running queries?
2. Why would the problem be fixed by switching replicas? The issue on
the primary remained after the switch. The queries on the replicas
would not be different. Just very specific timing that irreversibly
(requiring restart) broke one of the replicas?
3. Why weren't the re-running autovacuums respecting naptime? Or were
they, and SLRU truncation can happen many times within a single
vacuum? Or from some other action?
4. Why did pg_commit_ts encounter an 'apparent wraparound' in the first place?
5. Is the aggressive autovacuum a necessary and sufficient condition
for this issue and the other vacuums were just along for the ride, or
did they play an active role? I should note that this table sees an
aggressive autovacuum every week or so. This is the first time there
has been an issue since I migrated the company to Aurora several
months ago. Granted, that is not a long time.

-------
Mitigations

Current:
1. Created a pg_cron cron job to VACUUM (FREEZE, INDEX_CLEANUP OFF)
the two largest mostly static tables during periods of low traffic.
2. Turned off track_commit_timestamp

Planned:
1. Partition the tables to reduce the workload for any individual
autovacuum process.

-------
Conclusion

If you have read my brain dump, thank you! I am looking for any
insight whatsoever.
Has anyone encountered this situation before on vanilla PostgreSQL? On Aurora?
Should I be posting this to pgsql-bugs? (I don't think so.... at least
not yet?) I saw a few similar threads on pgsql-bugs, but they did not
look to be exactly what I experienced. I am not yet positive this is a
bug, and if it is, whether it is a PG or Aurora bug.
Should I just DROP the database while laughing maniacally?

Cheers,

- Ian



pgsql-hackers by date:

Previous
From: Jack Ng
Date:
Subject: RE: Changing shared_buffers without restart
Next
From: Daniil Davydov
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum