Re: Minimal logical decoding on standbys - Mailing list pgsql-hackers

From Drouvot, Bertrand
Subject Re: Minimal logical decoding on standbys
Date
Msg-id 69aad0bf-697a-04e1-df6c-0920ec8fa528@amazon.com
Whole thread Raw
In response to Re: Minimal logical decoding on standbys  (Andres Freund <andres@anarazel.de>)
Responses Re: Minimal logical decoding on standbys
List pgsql-hackers
Hi Andres,

On 7/27/21 7:22 PM, Andres Freund wrote:
> Hi,
>
> On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:
>> Thanks for the warning, rebase done and new v21 version attached.
> Did you have a go at fixing the walsender race conditions I
> (re-)discovered? Without fixing those I don't see this patch going in...

Those new patches should be addressing all your previous code and TAP 
tests remarks, except those 2 for which I would need your input:

1. The first one is linked to your remarks:
"

 > While working on this I found a, somewhat substantial, issue:
 >
 > When the primary is idle, on the standby logical decoding via walsender
 > will typically not process the records until further WAL writes come in
 > from the primary, or until a 10s lapsed.
 >
 > The problem is that WalSndWaitForWal() waits for the *replay* LSN to
 > increase, but gets woken up by walreceiver when new WAL has been
 > flushed. Which means that typically walsenders will get woken up at the
 > same time that the startup process will be - which means that by the
 > time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
 > that the startup process already replayed the record and updated
 > XLogCtl->lastReplayedEndRecPtr.
 >
 > I think fixing this would require too invasive changes at this point. I
 > think we might be able to live with 10s delay issue for one release, but
 > it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

"

Is that what you are referring to as the “walsender race conditions”?
If so, do you already have in mind a way to handle this? (I thought you 
already had in mind a way to handle it so the question)

2. The second one is linked to your remark:

"There's also no test 
for a recovery conflict due to row removal"

Don't you think that the 
"vacuum full" conflict test is enough?

if not, what kind of additional 
tests would you like to see?


In the same time, I am attaching a new v22 as a rebase was needed since v21.

Thanks

Bertrand


Attachment

pgsql-hackers by date:

Previous
From: "houzj.fnst@fujitsu.com"
Date:
Subject: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION
Next
From: Robert Haas
Date:
Subject: Re: Background writer and checkpointer in crash recovery