Re: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Exit walsender before confirming remote flush in logical replication
Date
Msg-id 20230204130155.h6zegoibl3k4yqb3@alap3.anarazel.de
Whole thread Raw
In response to Re: Exit walsender before confirming remote flush in logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Exit walsender before confirming remote flush in logical replication
List pgsql-hackers
Hi,

On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
> The main problem we want to solve here is to avoid shutdown failing in
> case walreceiver/applyworker is busy waiting for some lock or for some
> other reason as shown in the email [1].

Isn't handling this part of the job of wal_sender_timeout?


I don't at all agree that it's ok to just stop replicating changes
because we're blocked on network IO. The patch justifies this with:

> Currently, at shutdown, walsender processes wait to send all pending data and
> ensure the all data is flushed in remote node. This mechanism was added by
> 985bd7 for supporting clean switch over, but such use-case cannot be supported
> for logical replication. This commit remove the blocking in the case.

and at the start of the thread with:

> In case of logical replication, however, we cannot support the use-case that
> switches the role publisher <-> subscriber. Suppose same case as above, additional
> transactions are committed while doing step2. To catch up such changes subscriber
> must receive WALs related with trans, but it cannot be done because subscriber
> cannot request WALs from the specific position. In the case, we must truncate all
> data in new subscriber once, and then create new subscription with copy_data
> = true.

But that seems a too narrow view to me. Imagine you want to decomission
the current primary, and instead start to use the logical standby as the
primary. For that you'd obviously want to replicate the last few
changes. But with the proposed change, that'd be hard to ever achieve.

Note that even disallowing any writes on the logical primary would make
it hard to be sure that everything is replicated, because autovacuum,
bgwriter, checkpointer all can continue to write WAL. Without being able
to check that the last LSN has indeed been sent out, how do you know
that you didn't miss something?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Use windows VMs instead of windows containers on the CI
Next
From: Andres Freund
Date:
Subject: undersized unions