BUG #9420: WAL-Sender repeatedly killed - Mailing list pgsql-bugs

From proels@web.de
Subject BUG #9420: WAL-Sender repeatedly killed
Date
Msg-id 20140303125342.2159.79464@wrigleys.postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      9420
Logged by:          Stefan Pröls
Email address:      proels@web.de
PostgreSQL version: 9.2.7
Operating system:   opensuse 11.4
Description:

I'm unsure whether this is really a bug, but at least the log messages don't
look like expected behavior.

We are running async. streaming replication between 2 PostgreSQL 9.2.7
servers. Last night the slave failed to reboot nicely because of a
filesystem inconsistency which required remote hands. So there was a delay
of about 30 hours until the slave was up an running again. Thus, it had to
catch up quite some replication delay. It did succeed doing so -- however,
according to the log messages the WAL sender process was killed several
times during this period. From the master's log:

2014-03-03 13:02:30 CET [unbekannt] standby LOG:  breche WAL-Sender-Prozess
ab wegen Zeitüberschreitung bei der Replikation
2014-03-03 13:05:08 CET [unbekannt] standby LOG:  breche WAL-Sender-Prozess
ab wegen Zeitüberschreitung bei der Replikation
[several more times]

>From the slave's log:

2014-03-03 12:47:18 CET   FATAL:  konnte keine Daten an den WAL-Stream
senden: Server beendete die Verbindung unerwartet
                Das heißt wahrscheinlich, dass der Server abnormal beendete
                bevor oder während die Anweisung bearbeitet wurde.

2014-03-03 13:00:34 CET   LOG:  ungültige magische Zahl 0000 in Logdatei 17,
Segment 111, Offset 3538944
2014-03-03 13:00:35 CET   LOG:  Streaming-Replikation hat erfolgreich mit
primärem Server verbunden
2014-03-03 13:03:07 CET   FATAL:  konnte keine Daten an den WAL-Stream
senden: Server beendete die Verbindung unerwartet
                Das heißt wahrscheinlich, dass der Server abnormal beendete
                bevor oder während die Anweisung bearbeitet wurde.

2014-03-03 13:04:04 CET   LOG:  ungültige magische Zahl 0000 in Logdatei 17,
Segment 185, Offset 5111808
2014-03-03 13:04:04 CET   LOG:  Streaming-Replikation hat erfolgreich mit
primärem Server verbunden
[also repeated serverali more times]

It appears the slave had trouble keeping up with the amount of data sent by
the master and, therefore, the wal-sender was repeatedly killed until the
slave finally catched up. I don't know where the "invalid magic number 0000"
comes from, though.

Note that the master has much faster disks (SSDs) than the slave
(SATA-HDDs), so some congestion on the master is likely to arise when the
master streams out more data than the slave can keep up with.

pgsql-bugs by date:

Previous
From: muhammadk@chetu.com
Date:
Subject: BUG #9422: PostgreSQL Zero downtime during deployement
Next
From: Pavel Stehule
Date:
Subject: Re: BUG #9422: PostgreSQL Zero downtime during deployement