Thread: Logical replication and wal segment retention
Hello, folks, Yesterday, I had a small file system fill up, due to some logical replication testing we had been performing. We had beentesting IBM’s IIDR system and apparently it had built a logical replication slot on my server. When the test was completed,nobody removed the slot, so WAL segments stopped being dropped. Now I can understand the difficulty separatingwhat physical versus logical replication needs from the WAL segments, but as logical replication is database specificnot cluster wide, this behavior was a little unexpected, since the WAL segments are cluster wide. Are WAL segmentsgoing to pile up whenever something drops a logical replication connection? I’ve seen it, but it seems like thiscould be a bad thing. - Jay Sent from my iPhone
Hi Jay, On Wed, Feb 27, 2019 at 07:40:26AM -0500, John Scalia wrote: > Hello, folks, > > Yesterday, I had a small file system fill up, due to some logical > replication testing we had been performing. We had been testing IBM’s IIDR > system and apparently it had built a logical replication slot on my server. > When the test was completed, nobody removed the slot, so WAL segments > stopped being dropped. Now I can understand the difficulty separating what > physical versus logical replication needs from the WAL segments, but as > logical replication is database specific not cluster wide, this behavior was > a little unexpected, since the WAL segments are cluster wide. Are WAL > segments going to pile up whenever something drops a logical replication > connection? I’ve seen it, but it seems like this could be a bad thing. Since Logical Replication is piggybacked on Physical Replication, you cannot use the first without having the latter. And yes, what you experienced is one of the dangers of using replication slots when having a busy database (i.e. producing lots of WAL) and a filesystem with little excess space. Under these circumstances, it is imperative to monitor for (and alert on) anything going awry with your replication slot consumers, and/or the size of your wal/xlog directory. It's a feature of replication slots to work that way - but one that may end up biting you. -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: https://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-email or attachments. Thank you.
Attachment
On 27/2/19 2:52 μ.μ., Johannes Truschnigg wrote: > Hi Jay, > > On Wed, Feb 27, 2019 at 07:40:26AM -0500, John Scalia wrote: >> Hello, folks, >> >> Yesterday, I had a small file system fill up, due to some logical >> replication testing we had been performing. We had been testing IBM’s IIDR >> system and apparently it had built a logical replication slot on my server. >> When the test was completed, nobody removed the slot, so WAL segments >> stopped being dropped. Now I can understand the difficulty separating what >> physical versus logical replication needs from the WAL segments, but as >> logical replication is database specific not cluster wide, this behavior was >> a little unexpected, since the WAL segments are cluster wide. Are WAL >> segments going to pile up whenever something drops a logical replication >> connection? I’ve seen it, but it seems like this could be a bad thing. > Since Logical Replication is piggybacked on Physical Replication, you cannot > use the first without having the latter. And yes, what you experienced is one > of the dangers of using replication slots when having a busy database (i.e. > producing lots of WAL) and a filesystem with little excess space. Under these > circumstances, it is imperative to monitor for (and alert on) anything going > awry with your replication slot consumers, and/or the size of your wal/xlog > directory. It's a feature of replication slots to work that way - but one that > may end up biting you. A logical approach for replication slots would be to accept a parameter regarding max WAL files to retain, after which newerWALs will be removed and the primary server saved. Pretty much like : --archive-push-queue-max argument of pgbackrest . > -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
I thought as much. The basic problem, however, is that I never created the logical slots. The IIDR application did that allby itself, and after it terminated, it did not bother to remove the slots. So, my disk monitor threw up when the WAL filesystem began to fill up. I was trying then to figure out why it did that. Sent from my iPhone > On Feb 27, 2019, at 7:57 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote: > >> On 27/2/19 2:52 μ.μ., Johannes Truschnigg wrote: >> Hi Jay, >> >>> On Wed, Feb 27, 2019 at 07:40:26AM -0500, John Scalia wrote: >>> Hello, folks, >>> >>> Yesterday, I had a small file system fill up, due to some logical >>> replication testing we had been performing. We had been testing IBM’s IIDR >>> system and apparently it had built a logical replication slot on my server. >>> When the test was completed, nobody removed the slot, so WAL segments >>> stopped being dropped. Now I can understand the difficulty separating what >>> physical versus logical replication needs from the WAL segments, but as >>> logical replication is database specific not cluster wide, this behavior was >>> a little unexpected, since the WAL segments are cluster wide. Are WAL >>> segments going to pile up whenever something drops a logical replication >>> connection? I’ve seen it, but it seems like this could be a bad thing. >> Since Logical Replication is piggybacked on Physical Replication, you cannot >> use the first without having the latter. And yes, what you experienced is one >> of the dangers of using replication slots when having a busy database (i.e. >> producing lots of WAL) and a filesystem with little excess space. Under these >> circumstances, it is imperative to monitor for (and alert on) anything going >> awry with your replication slot consumers, and/or the size of your wal/xlog >> directory. It's a feature of replication slots to work that way - but one that >> may end up biting you. > A logical approach for replication slots would be to accept a parameter regarding max WAL files to retain, after whichnewer WALs will be removed and the primary server saved. Pretty much like : --archive-push-queue-max argument of pgbackrest. >> > > > -- > Achilleas Mantzios > IT DEV Lead > IT DEPT > Dynacom Tankers Mgmt > >
On Wed, Feb 27, 2019 at 02:57:46PM +0200, Achilleas Mantzios wrote: > > [...] > A logical approach for replication slots would be to accept a parameter > regarding max WAL files to retain, after which newer WALs will be removed > and the primary server saved. Pretty much like : --archive-push-queue-max > argument of pgbackrest . Before replication slots where a thing, you had to carefully balance wal_keep_segments in regard to WAL production or/and (usually and :)) set up a proper WAL archive for replication to be able to soldier on even after a WAL receiver experienced service-interrupting trouble for a while. The benefit of that was that the WAL producer remained unaffected (unless you bungled the archiving process profoundly) of such calamities. To me, that was the preferred trade-off for all use-cases of replication that I personally encountered. If it were possible to have the best of both worlds (i.e. have a kind of "high water mark number of WAL segments"-setting per replication slot, over which the slot would be abandoned - with a heavy heart and lots of screaming in the logs, of course - by the producer), that sure would be awesome. But at this time, we are where we are :) -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: https://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-email or attachments. Thank you.
Attachment
On Wed, Feb 27, 2019 at 08:29:09AM -0500, John Scalia wrote: > I thought as much. The basic problem, however, is that I never created the > logical slots. The IIDR application did that all by itself, and after it > terminated, it did not bother to remove the slots. So, my disk monitor threw > up when the WAL file system began to fill up. I was trying then to figure > out why it did that. I don't know the particular product that made you experience these troubles, but it could be on purpose (if it relies on consuming the WAL continuously, like a proper streaming replication slave/secondary would, and expectes to be able to continue working where it left off before terminating) - or it could be a rather dangerous usability hurdle that should, at the very least, be clearly documented. -- with best regards: - Johannes Truschnigg ( johannes@truschnigg.info ) www: https://johannes.truschnigg.info/ phone: +43 650 2 133337 xmpp: johannes@truschnigg.info Please do not bother me with HTML-email or attachments. Thank you.
Attachment
When using replication slots with standby nodes, a master node retains the necessary WAL files in pg_xlog until the standby has received them at the cost of monitoring the space used by WAL files in pg_xlog as now the disk space that those filesuse is not strictly controlled by wal_keep_segments or checkpoint_segments but by elements (perhaps) external to the server where the master node is running.
In the case of a standby node using streaming replication, the server does not actually wait for the slave to catch up if it disconnects and simply deletes the WAL files that are not needed. This has the advantage to facilitate management of the disk space used by WAL files: use checkpoint_segments as well in this case. The amount of WAL to keep on master side can as well be tuned with wal_keep_segments.
On Wed, Feb 27, 2019 at 6:10 PM John Scalia <jayknowsunix@gmail.com> wrote:
Hello, folks,
Yesterday, I had a small file system fill up, due to some logical replication testing we had been performing. We had been testing IBM’s IIDR system and apparently it had built a logical replication slot on my server. When the test was completed, nobody removed the slot, so WAL segments stopped being dropped. Now I can understand the difficulty separating what physical versus logical replication needs from the WAL segments, but as logical replication is database specific not cluster wide, this behavior was a little unexpected, since the WAL segments are cluster wide. Are WAL segments going to pile up whenever something drops a logical replication connection? I’ve seen it, but it seems like this could be a bad thing.
-
Jay
Sent from my iPhone