incorrect wal removal due to max_slot_wal_keep_size - Mailing list pgsql-hackers

From Jeff Janes
Subject incorrect wal removal due to max_slot_wal_keep_size
Date
Msg-id CAMkU=1zvU1HjCighsRu3Xqo4tQBsyWhj0NySsx7D0i6zLsyomA@mail.gmail.com
Whole thread Raw
Responses RE: incorrect wal removal due to max_slot_wal_keep_size
List pgsql-hackers
I was testing logical replication over my (remarkably bad) wifi network to see what kind of throughput and lag I would get.  I was using pgbench default transaction as the workload generator with all 4 tables being replicated.  I had synchronous replication configured by synchronous_standby_names, except at the time it was not actually in use due to synchronous_commit being set to 'local' on the benchmarking connections.

The master was shutdown cleanly with a 'smart shutdown request' (in a state where substantial lag had accumulated--I don't know exactly how much but at least 20,000 transaction had replayed after replication restarted before it stalled) when I got distracted by other things and decided to reboot the ubuntu machine it was running on.

When I restarted the master PostgreSQL server, the replica started to catch up, but then eventually stalled.

On the master, I had this log, which occurred right after the first checkpoint (since the server restart) began.:

4790   00000 2024-10-09 12:03:12.819 EDT LOG:  invalidating obsolete replication slot "sub"
4790   00000 2024-10-09 12:03:12.819 EDT DETAIL:  The slot's restart_lsn 1/84C5B510 exceeds the limit by 37374704 bytes.
4790   00000 2024-10-09 12:03:12.819 EDT HINT:  You might need to increase "max_slot_wal_keep_size".
  
But max_slot_wal_keep_size was set to -1 and had never been set to anything other than that!

The master was running 18devel-d94cf5ca7f.  Not for any particular reason, but just because that is what I happened to have on when I started mucking around with this.  I don't recall running this particular test in this manner before, and have no reason to think it is only broken in 18dev.

I'm going to try to reproduce this on 17.0, but in the meantime any other suggestions for investigating this?

I have noticed some previous similar complaints about max_slot_wal_keep_size being incorrectly invoked, but it didn't look like they were ever resolved.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Add contrib/pg_logicalsnapinspect
Next
From: Daniel Gustafsson
Date:
Subject: Re: Remove deprecated -H option from oid2name