PostgreSQL version and HA extension in use
- PostgreSQL 13.10 version
- pg_auto_failover 2.0
CPU usage and load were increasing due to high load.
Failover was performed while a large number of WALwrite events occurred in the primary DB.
I confirmed that the part where the secondary was not promoted was a pg_auto_failover issue.
I promoted the secondary manually.
And I originally tried to make the primary DB a new secondary using the archived wal file, but there seemed to be a missing WAL record.
So, I opened the WAL file using pg_waldump and there was a missing record.
It was not a DB server crash.
Can records not be recorded in the WAL file even when a failover is performed due to high load?
I'm wondering if this could be considered a bug or if it was a situation where WAL records could be lost.
I will send you the information confirmed through DB log and pg_waldump.
I'll share some DB settings too.
hot_standby_feedback = on
hot_standby = on
synchronous_commit = on
wal_writer_flush_after = 1MB
wal_sync_method = fdatasync
wal_writer_delay = 200ms
wal_buffers = 16MB
wal_segment_size= 16MB
[When the first failover occurs]
- WAL apply DB log
- Check the wal record using pg_waldump
I verified that there are no missing lsn in 0000000300005015000000A6 and 0000000300005015000000A7.
However, the prev lsn shown in 0000000300005015000000A8 is not found in 0000000300005015000000A7.
- The last LSN of 0000000300005015000000A7 is 5015/A6003778
-The prev LSN of the first record of 0000000300005015000000A8 is 5015/A7FFED78.
[When the second failover occurs]
- DB log
- Check the wal record using pg_waldump
The last LSN of 000000030000501E0000008E is 501E/8EFFCED8.
The prev lsn of the first record in 000000030000501E0000008F wal file is 501E/8EFFEEC8.
It appears to have been lost due to the large difference in LSN.