Re: replication primary writting infinite number of WAL files - Mailing list pgsql-general

From Sándor Daku
Subject Re: replication primary writting infinite number of WAL files
Date
Msg-id CAKyoTgb80GwiGJdR92rtAh7DhU40r7SY3KeNep3nzrsjeBHp3Q@mail.gmail.com
Whole thread Raw
In response to Re: replication primary writting infinite number of WAL files  (Ron Johnson <ronljohnsonjr@gmail.com>)
List pgsql-general


On Fri, 24 Nov 2023, 17:12 Ron Johnson, <ronljohnsonjr@gmail.com> wrote:
On Fri, Nov 24, 2023 at 11:00 AM Les <nagylzs@gmail.com> wrote:
[snip] 
Writing of WAL files continued after we shut down all clients, and restarted the primary PostgreSQL server.

The order was:

1. shut down all clients
2. stop the primary
3. start the primary
4. primary started to write like mad again
5. removed replication slot
6. primary stopped madness and deleted all WAL files (except for a few)

How can the primary server generate more and more WAL files (writes) after all clients have been shut down and the server was restarted? My only bet was the autovacuum. But I ruled that out, because removing a replication slot has no effect on the autovacuum (am I wrong?). Now you are saying that this looks like a huge rollback. Does rolling back changes require even more data to be written to the WAL after server restart? As far as I know, if something was not written to the WAL, then it is not something that can be rolled back. Does removing a replication slot lessen the amount of data needed to be written for a rollback (or for anything else)? It is a fact that the primary stopped writing at 1.5GB/sec the moment we removed the slot.

I'm not saying that you are wrong. Maybe there was a crazy application. I'm just saying that a crazy application cannot be the whole picture. It cannot explain this behaviour as a whole. Or maybe I have a deep misunderstanding about how WAL files work.  On the second occasion, the primary was running for a few minutes when pg_wal started to increase. We noticed that early, and shut down all clients, then restarted the primary server. After the restart, the primary was writing out more WAL files for many more minutes, until we dropped the slot again. E.g. it was writing much more data after the restart than before the restart; and it only stopped (exactly) when we removed the slot.

pg_stat_activity will tell you something about what's happening even after you think "all clients have been shut down".

I'd crank up the logging.to at least:
log_error_verbosity = verbose
log_statement = all
track_activity_query_size = 10240
client_min_messages = notice
log_line_prefix = '%m\t%r\t%u\t%d\t%p\t%i\t%a\t%e\t'

I dont know if it makes any sense, but is there a relatively painless way to look into the produced wal files to see what are they filled with? It might give some pointers to the source of the issue.

Regards,
Sándor

pgsql-general by date:

Previous
From: Ron Johnson
Date:
Subject: Re: replication primary writting infinite number of WAL files
Next
From: Zahir Lalani
Date:
Subject: RE: Odd Shortcut behaviour in PG14