Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery - Mailing list pgsql-hackers

From Marco Nenciarini
Subject Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Date
Msg-id CA+nrD2dRNzWAxc227uqy5tdFEk-UmK7R5965GYL9yzLzP+g6+Q@mail.gmail.com
Whole thread
In response to Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery  (Xuneng Zhou <xunengzhou@gmail.com>)
Responses Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
List pgsql-hackers
Attached is a v2 patch that implements the "handshake clamp" approach
Xuneng suggested.  Rather than tracking lastStreamedFlush in
process-local state (which doesn't survive a cascade restart, as
Fujii-san demonstrated), it uses the WAL flush position already
returned by IDENTIFY_SYSTEM.

The walreceiver now checks the upstream's flush position before issuing
START_REPLICATION.  If the requested startpoint is ahead (on the same
timeline), it waits for wal_retrieve_retry_interval and retries.  This
works across restarts since it queries the upstream's live position on
every connection attempt, and requires no new state variables.

When timelines differ, we let START_REPLICATION handle the timeline
negotiation as before.

The patch includes a TAP test (053_cascade_reconnect.pl) that
reproduces the scenario and verifies the fix.

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add starelid, attnum to pg_stats and leverage this in pg_dump
Next
From: Matthias van de Meent
Date:
Subject: Re: Adding REPACK [concurrently]