Thread: incorrect wal removal due to max_slot_wal_keep_size
I was testing logical replication over my (remarkably bad) wifi network to see what kind of throughput and lag I would get. I was using pgbench default transaction as the workload generator with all 4 tables being replicated. I had synchronous replication configured by synchronous_standby_names, except at the time it was not actually in use due to synchronous_commit being set to 'local' on the benchmarking connections.
The master was shutdown cleanly with a 'smart shutdown request' (in a state where substantial lag had accumulated--I don't know exactly how much but at least 20,000 transaction had replayed after replication restarted before it stalled) when I got distracted by other things and decided to reboot the ubuntu machine it was running on.
When I restarted the master PostgreSQL server, the replica started to catch up, but then eventually stalled.
On the master, I had this log, which occurred right after the first checkpoint (since the server restart) began.:
4790 00000 2024-10-09 12:03:12.819 EDT LOG: invalidating obsolete replication slot "sub"
4790 00000 2024-10-09 12:03:12.819 EDT DETAIL: The slot's restart_lsn 1/84C5B510 exceeds the limit by 37374704 bytes.
4790 00000 2024-10-09 12:03:12.819 EDT HINT: You might need to increase "max_slot_wal_keep_size".
4790 00000 2024-10-09 12:03:12.819 EDT DETAIL: The slot's restart_lsn 1/84C5B510 exceeds the limit by 37374704 bytes.
4790 00000 2024-10-09 12:03:12.819 EDT HINT: You might need to increase "max_slot_wal_keep_size".
But max_slot_wal_keep_size was set to -1 and had never been set to anything other than that!
The master was running 18devel-d94cf5ca7f. Not for any particular reason, but just because that is what I happened to have on when I started mucking around with this. I don't recall running this particular test in this manner before, and have no reason to think it is only broken in 18dev.
I'm going to try to reproduce this on 17.0, but in the meantime any other suggestions for investigating this?
I have noticed some previous similar complaints about max_slot_wal_keep_size being incorrectly invoked, but it didn't look like they were ever resolved.
Cheers,
Jeff
Dear Jeff, Thanks for reporting the issue. I've tried to reproduce the issue (by adding delay on worker-side and immediate shut-down), but not done yet. If possible, could you please share a script to reproduce? It is helpful to analyze. > I'm going to try to reproduce this on 17.0, but in the meantime any other suggestions for investigating this?x It is very helpful to check the content of pg_stat_replication_slots view and pg_wal directory of the postgres, when you succeed to reproduce. Also, please set log_min_messages = DEBUG2 to check logs from RemoveOldXlogFiles() and RemoveXlogFile(). I want to see the log when you can reproduce. They are inspired by [1]. I doubt the thread and yours are the same issue or not. [1]: https://www.postgresql.org/message-id/flat/Yz2hivgyjS1RfMKs%40depesz.com Best regards, Hayato Kuroda FUJITSU LIMITED