Clean switchover - Mailing list pgsql-hackers

From Fujii Masao
Subject Clean switchover
Date
Msg-id CAHGQGwHLjEROTMtSWJd=xg_VFwRe3oJWnTYsyBDUbRYa6rr0DQ@mail.gmail.com
Whole thread Raw
Responses Re: Clean switchover
Re: Clean switchover
Re: Clean switchover
List pgsql-hackers
Hi,

In streaming replication, when we shutdown the master, walsender tries to
send all the outstanding WAL records including the shutdown checkpoint
record to the standby, and then to exit. This basically means that all the
WAL records are fully synced between two servers after the clean shutdown
of the master. So, after promoting the standby to new master, we can
restart the stopped master as new standby without the need for a fresh
backup from new master.

But there is one problem: though walsender tries to send all the outstanding
WAL records, it doesn't wait for them to be replicated to the standby. IOW,
walsender closes the replication connection as soon as it sends WAL records.
Then, before receiving all the WAL records, walreceiver can detect
the closure of connection and exit. We cannot guarantee that there is no
missing WAL in the standby after clean shutdown of the master. In this case,
backup from new master is required when restarting the stopped master as
new standby. I have experienced this case several times, especially when
enabling WAL archiving.

The attached patch fixes this problem. It just changes walsender so that it
waits for all the outstanding WAL records to be replicated to the standby
before closing the replication connection.

You may be concerned the case where the standby gets stuck and the
walsender keeps waiting for the reply from that standby. In this case,
wal_sender_timeout detects such inactive standby and then walsender
ends. So even in that case, the shutdown can end.

Thought?

Regards,

--
Fujii Masao

Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: JSON and unicode surrogate pairs
Next
From: Andrew Dunstan
Date:
Subject: Re: JSON and unicode surrogate pairs