A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my. - Mailing list pgsql-hackers
From | Ian Axelrod |
---|---|
Subject | A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my. |
Date | |
Msg-id | CA+B3G4QvCy_a2+SUbbeQU5Q0TzUrH2iUYKg-XBezeJDMfVoq7Q@mail.gmail.com Whole thread Raw |
Responses |
Re: A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my.
|
List | pgsql-hackers |
Hi, I help administrate a set of reasonably large PostgreSQL* database servers at my current place of employment. Late last week, there was a serious failure on one of the replicas for my employer's largest database server. I have not seen a failure like this in 10 years of working heavily with PostgreSQL. I am hoping someone here has encountered the situation before and can state if: 1. What I saw was a bug or a misconfiguration of the server. 2. There are any actions I can take beyond those I already have or intend to (more on that at the end of this post) to eliminate the issue, or at least reduce the frequency and/or severity. *This issue occurred on an AWS RDS Aurora PostgreSQL installation. It is definitely possible that this is an Aurora-specific issue. I will be reaching out to AWS as well, but getting in touch with an internal team that can provide real answers and do deep investigation has proven to be very difficult. My hope is that this might be an issue that is not specific to Aurora. ------- The Database PostgreSQL 15.10 on AWS Aurora Primary & affected read replica: r6gd.4xlarge (16 vCPU, 128 GB memory, Aurora's 'Optimized Reads' feature enabled) Fallback replica: r7g.4xlarge (16 vCPU, 128 GB memory, no Optimized Reads feature) Potentially relevant settings: autovacuum_max_workers = 3 (Aurora default, determined by instance size) autovacuum_naptime = 5 (Aurora default) autovacuum_vacuum_cost_delay = 5 (Aurora default) autovacuum_vacuum_cost_limit = 1800 (Aurora default, determined by instance size) autovacuum_work_mem = 4074383kB (Aurora default, determined by instance size) rds.logical_replication = 1 track_commit_timestamp = 1 Logical replication in use? Yes. Did not observe significant logical replication slot lag during the event. ------- The Problem The read replica that serves all production traffic began cancelling queries with the following messages: """ ERROR: canceling statement due to conflict with recovery DETAIL: User query may not have access to page data due to replica disconnect. HINT: When the replica reconnects you will be able to repeat your command. """ Initial diagnosis: I checked the database metrics and confirmed that replication lag had quickly shot up from 20-30 milliseconds to 60,000+ milliseconds, at which point Aurora disconnected the read replica and reconnected to try to fix the issue. This did not fix the issue: replica lag would increase to 60k+ ms several more times until I took corrective action. No other replica appeared to be having the same issue, at least not to the extent of restarting replication. Corrective action: Switched over traffic to the fallback replica. Result: Queries no longer cancelled, however, replication lag on the primary read replica was still extremely high (10k+ ms) and decreased very slowly. Moreover, replication lag on the fallback replica was volatile: it was jumping from 30 ms to 100 ms rapidily, which is very unusual for Aurora. Corrective action: Restarted primary read replica. Result: Replica lag went back to normal after the restart. I was able to switch traffic back to the primary read replica. The fallback replica was still experiencing unusually jittery replica lag for 30 minutes or so after. It then returned to normal. ------- The Investigation This issue occurred when three major operations were taking place on the primary. (I failed to recreate these exact conditions outside of production, unfortunately.) 1. Aggressive autovacuum of a very large table. This table is approximately 2 TB with 4.7 billion rows. It is append-only: a batch job adds to it once per month over the course of a few days. Otherwise, it is read-only. Anyway, the autovacuum ran for about 16 hours, and after completion, it output the following log: LOG: automatic aggressive vacuum to prevent wraparound of table "REDACTED": index scans: 0 pages: 0 removed, 169701283 remain, 1 scanned (0.00% of total) tuples: 0 removed, 4530232832 remain, 0 are dead but not yet removable removable cutoff: 1225281859, which was 31289976 XIDs old when operation ended new relfrozenxid: 1225281859, which is 199999042 XIDs ahead of previous value new relminmxid: 2075086366, which is 23596562 MXIDs ahead of previous value index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed index "REDACTED": pages: 63503449 in total, 0 newly deleted, 0 currently deleted, 0 reusable I/O timings: read: 56733479.933 ms, write: 0.000 ms avg read rate: 8.701 MB/s, avg write rate: 0.000 MB/s buffer usage: 5327 hits, 63555559 misses, 0 dirtied WAL usage: 0 records, 0 full page images, 0 bytes system usage: CPU: user: 980.36 s, system: 845.00 s, elapsed: 57068.70 s 2. Batch job that was writing at a high rate (100+ UPDATEs / second) to two small (< 25,000 row) tables. This job started near the tail-end of (1), which had already been running for nearly 14 hours. 3. Seemingly endless, concurrent autovacuuming of the tables written to by (2) until the aggressive autovacuum on (1) completed. The autovacuuming coincided with the batch job from (2). These autovacuums, which should have taken a few seconds, continued for 2.5 hours and were dominated by CPU usage and lock contention to advance datfrozenxid (in RDS: the 'Lock:frozenid' wait event). Observations: (1) clearly had to do a massive amount of I/O over a long period, as nearly nothing was in cache. I was surprised that only 1 page was scanned, considering the table size and infrequency that vacuum runs on it. (3) seemed extremely odd to me. It is my understanding that a non-aggressive autovacuum rarely advances datfrozenxid. In the case where it does, I would not expect it to keep advancing it for every run. I looked into logs on the primary and tens of thousands of lines like this: 2025-04-30 21:11:11 UTC::@:[9128]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:11 UTC::@:[9129]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:11 UTC::@:[9130]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:12 UTC::@:[9131]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:12 UTC::@:[9132]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:12 UTC::@:[9136]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:12 UTC::@:[9138]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:12 UTC::@:[9141]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound 2025-04-30 21:11:12 UTC::@:[9147]:LOG: could not truncate directory "pg_commit_ts": apparent wraparound .... datfrozenxid age was only a little over 250,000,000 when the issue occurred; and there were no other logs concerning transaction id wraparound. So, I am confident that this was not an indication of a real txid wraparound. Looking at the code, it looks like this message is emitted as part of pg_commit_ts SLRU truncation. This can happen as part of vacuum from my understanding, and should only happen during frozenxid advancement. Please correct me if I am wrong. I looked at SLRU truncate metrics timeline for pg_commit_ts. Sure enough, there were around 20 truncate operations on "committs" SLRU per second. This led me to believe that autovacuum was crashing, despite no log message indicating so. There were three autovacuum processes running concurrently, with two of them potentially restarting. If they were rerunning due to the high rate of updates, I would expect them to adhere to autovacuum_naptime of 5 seconds. Based on the # of truncate operations and the fact that multiple "could not truncate directory" error messages were occurring within the same second from different PIDs, this doesn't seem to be the case. I found that one of the autovacuums ran a few hours earlier with similar high CPU usage, but it ran for less time (30~ minutes) and did not have contention on the frozenid advancement lock. There was no unusual lag on the replicas during this time. However, the "could not truncate directory "pg_commit_ts": apparent wraparound" message was present for this autovacuum at a far greater frequency than without. When I manually tried vacuuming one of the tables that had the CPU-intensive autovacuum loop, it succeeded but could not clean up all dead tuples. I tried this multiple times. ------- My thoughts Pretty stumped. What I do not think this is: - Long-running transaction on the replica holding up WAL (or Aurora's buffer page cache replication, in this case) application, thus increasing replication lag. The two tables involved are not actively queried on the primary read replica. And though the table that was undergoing aggressive autovacuum is queried, I would have expected log_lock_waits to log something (deadlock_timeout = 1s) and/or there to be query cancellations for conflict recovery due to a lock held too long. There were neither of these in the logs. I did not see any long idle_in_transaction queries either... at least not until the replica lag had already climbed extremely high. - AccessExclusive lock held by one of the restarting/rerunning autovacuum processes trying and failing to truncate dead tuples at the physical end of the table, which is blocking SELECTS on the replica for short periods of time. As mentioned, the two tables involved in the regular (non-aggressive) autovacuum are not queried on the read replica. There would be no conflicts in the presence of an AccessExclusive lock. Also, had this been the case somehow, I would have expected the issue to persist after I switched replicas. - Poorly configured autovacuum None of the tables involved have custom autovacuum settings. They are using the Aurora defaults. And although the aggressive autovacuum did take a long time to finish, I do not think this should be an issue. I am fine with it taking a while on a table this large. I imagine trying to speed things up would only exacerbate the problem. What I do believe thus far: I believe the size of the table combined with its complete disuse on the primary (hence all the buffer cache misses) broke something, but I am not really sure what. I think this is a legitimate bug rather than a poorly configured server. That's really all I've got other than many, many questions. 1. Why would a struggling, or even failing, autovacuum cause any replica lag in the absence of long-running queries? 2. Why would the problem be fixed by switching replicas? The issue on the primary remained after the switch. The queries on the replicas would not be different. Just very specific timing that irreversibly (requiring restart) broke one of the replicas? 3. Why weren't the re-running autovacuums respecting naptime? Or were they, and SLRU truncation can happen many times within a single vacuum? Or from some other action? 4. Why did pg_commit_ts encounter an 'apparent wraparound' in the first place? 5. Is the aggressive autovacuum a necessary and sufficient condition for this issue and the other vacuums were just along for the ride, or did they play an active role? I should note that this table sees an aggressive autovacuum every week or so. This is the first time there has been an issue since I migrated the company to Aurora several months ago. Granted, that is not a long time. ------- Mitigations Current: 1. Created a pg_cron cron job to VACUUM (FREEZE, INDEX_CLEANUP OFF) the two largest mostly static tables during periods of low traffic. 2. Turned off track_commit_timestamp Planned: 1. Partition the tables to reduce the workload for any individual autovacuum process. ------- Conclusion If you have read my brain dump, thank you! I am looking for any insight whatsoever. Has anyone encountered this situation before on vanilla PostgreSQL? On Aurora? Should I be posting this to pgsql-bugs? (I don't think so.... at least not yet?) I saw a few similar threads on pgsql-bugs, but they did not look to be exactly what I experienced. I am not yet positive this is a bug, and if it is, whether it is a PG or Aurora bug. Should I just DROP the database while laughing maniacally? Cheers, - Ian
pgsql-hackers by date: