Re: Taking into account syncrep position in flush_lsn reported by apply worker - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Taking into account syncrep position in flush_lsn reported by apply worker
Date
Msg-id f592bea4-b07d-462c-a915-bb23485d6826@iki.fi
Whole thread Raw
In response to Re: Taking into account syncrep position in flush_lsn reported by apply worker  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Taking into account syncrep position in flush_lsn reported by apply worker
List pgsql-hackers
On 21/08/2024 09:25, Amit Kapila wrote:
> On Wed, Aug 21, 2024 at 2:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> On 14/08/2024 16:54, Arseny Sher wrote:
>>> On 8/13/24 06:35, Amit Kapila wrote:
>>>> On Mon, Aug 12, 2024 at 3:43 PM Arseny Sher <ars@neon.tech> wrote:
>>>>>
>>>>> Sorry for the poor formatting of the message above, this should be
>>>>> better:
>>>>>
>>>>> Hey. Currently synchronous_commit is disabled for logical apply worker
>>>>> on the ground that reported flush_lsn includes only locally flushed data
>>>>> so slot (publisher) preserves everything higher than this, and so in
>>>>> case of subscriber restart no data is lost. However, imagine that
>>>>> subscriber is made highly available by standby to which synchronous
>>>>> replication is enabled. Then reported flush_lsn is ignorant of this
>>>>> synchronous replication progress, and in case of failover data loss may
>>>>> occur if subscriber managed to ack flush_lsn ahead of syncrep.
>>>>
>>>> Won't the same can be achieved by enabling the synchronous_commit
>>>> parameter for a subscription?
>>>
>>> Nope, because it would force WAL flush and wait for replication to the
>>> standby in the apply worker, slowing down it. The logic missing
>>> currently is not to wait for the synchronous commit, but still mind its
>>> progress in the flush_lsn reporting.
>>
>> I think this patch makes sense. I'm not sure we've actually made any
>> promises on it, but it feels wrong that the slot's LSN might be advanced
>> past the LSN that's been has been acknowledged by the replica, if
>> synchronous replication is configured. I see little downside in making
>> that promise.
> 
> One possible downside of such a promise could be that the publisher
> may slow down for sync replication because it has to wait for all the
> configured sync_standbys of subscribers to acknowledge the LSN. I
> don't know how many applications can be impacted due to this if we do
> it by default but if we feel there won't be any such cases or they
> will be in the minority then it is okay to proceed with this.

It only slows down updating the flush LSN on the publisher, which is 
updated quite lazily anyway.

A more serious scenario is if the sync replica crashes or is not 
responding at all. In that case, the flush LSN on the publisher cannot 
advance, and WAL starts to accumulate. However, if a sync replica is not 
responding, that's very painful for the (subscribing) server anyway: all 
commits will hang waiting for the replica. Holding back the flush LSN on 
the publisher seems like a minor problem compared to that.

It would be good to have some kind of an escape hatch though. If you get 
into that situation, is there a way to advance the publisher's flush LSN 
even though the synchronous replica has crashed? You can temporarily 
turn off synchronous replication on the subscriber. That will release 
any COMMITs on the server too. In theory you might not want that, but in 
practice stuck COMMITs are so painful that if you are taking manual 
action, you probably do want to release them as well.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Virtual generated columns
Next
From: Yugo Nagata
Date:
Subject: Re: Disallow USING clause when altering type of generated column