Home > mailing lists

replication primary writting infinite number of WAL files - Mailing list pgsql-general

From	Les
Subject	replication primary writting infinite number of WAL files
Date	November 24, 2023 14:39:40
Msg-id	CAKXe9UDZKGevV6t+Y0adUgy2-k-1gMdRUi4Baz=4BMKX1Oig_A@mail.gmail.com Whole thread Raw
Responses	Re: replication primary writting infinite number of WAL files Re: replication primary writting infinite number of WAL files Re: replication primary writting infinite number of WAL files
List	pgsql-general

Tree view

Hello,

Yesterday we were faced with a problem that we do not understand and cannot solve ourselves. We have a postgresql cluster using repmgr, which has three members. The version of all instances (currently) is "PostgreSQL 14.5 (Debian 14.5-2.pgdg110+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit.". Under normal circumstances, the number of write operations is relatively low, with an average of 4-5 MB/sec total write speed on the disk associated with the data directory. Yesterday, the primary server suddenly started writing to the pg_wal directory at a crazy pace, 1.5GB/sec, but sometimes it went up to over 3GB/sec. The pg_wal started fattening up and didn't stop until it ran out of disk space. It happened so fast that we didn't have time to react. We then stopped all applications (postgresql clients) because we thought one of them was causing the problem. We then increased the disk size and restarted the primary. The primary finished recovery and started working "normally" for a little while, but then started filling up the disk again. In the meantime, we found out that the WAL files to one of the standbys were not going through. Originally, the (streaming replication) connection was lost due to a network error, later it slowed down and the standby could not keep up with the primary. We then unregistered and disconnected this standby and deleted its replication slot. This caused the primary to delete the unnecessary files from pg_wal and start working normally again. Upon further analysis of the database, we found that we did not see any mass data changes in any of the tables. The only exception is a sequence value that was moved millions of steps within a single minute. Of particular interest is that the sequence value did not increase afterwards; but even though we restarted the primary (without connecting normal clients), it continued to write endless WAL files until we deleted that standby replication slot. It is also interesting that there were two standbys, and dropping one of them "solved" the problem. The other standby could keep up with the writes, and it was also writing out 1.5GB/sec to its disk. Since we needed the third standby, we created a new one (using a fresh pg_basebackup) after completely deleting the old one. This new instance worked for about 12 hours. This morning, the error occurred again, in the same form. Based on our previous experience, we immediately deleted the standby and its replication slot, and the problem resolved itself (except that the standby had to be deleted again). Without rebooting or restarting anything else, the problem went away. I managed to save small part of the pg_wal before deleting the slot. We looked into this, we saw something like this:

We looked at the PostgreSQL release history, and we see some bug fixes in version 14.7 that might have something to do with this:

https://www.postgresql.org/docs/release/14.7/

> Ignore invalidated logical-replication slots while determining oldest catalog xmin (Sirisha Chamarthi) A replication slot could prevent cleanup of dead tuples in the system catalogs even after it becomes invalidated due to exceeding max_slot_wal_keep_size. Thus, failure of a replication consumer could lead to indefinitely-large catalog bloat.

This might not be the problem after all. We don't have enough knowledge and experience to determine the cause of the problem. It is a production system and we should somehow ensure that this won't happen in the future. Also, we would like to add another standby, but we do not want to do this right now, because it seems to be causing the problem (or at least it is strongly correlated).

Thank you,

Laszlo

Attachment

image.png

pgsql-general by date:

From: Adrian Klaver
Date: 24 November 2023, 08:21:18
Subject: Re: Inquiry Regarding Initial Seed for pgsql Protocol Fuzz Testing

From: Andreas Kretschmer
Date: 24 November 2023, 15:22:08
Subject: Re: replication primary writting infinite number of WAL files

replication primary writting infinite number of WAL files - Mailing list pgsql-general

Attachment

Previous

Next