Re: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers

From Andres Freund
Subject Re: failures in t/031_recovery_conflict.pl on CI
Date
Msg-id 20220409220054.fqn5arvbeesmxdg5@alap3.anarazel.de
Whole thread Raw
In response to Re: failures in t/031_recovery_conflict.pl on CI  (Andres Freund <andres@anarazel.de>)
Responses Re: failures in t/031_recovery_conflict.pl on CI  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2022-04-08 22:05:01 -0700, Andres Freund wrote:
> On 2022-04-08 21:55:15 -0700, Andres Freund wrote:
> > on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
> > interesting, because I ran it there dozens if not hundreds of times before
> > commit, with - I think - only cosmetic changes.
> 
> Scratch that part - I found an instance of the freebsd failure earlier, just
> didn't notice because that run failed for other reasons as well. So this might
> just have uncovered an older bug around recovery conflict handling,
> potentially platform dependent.
> 
> I guess I'll try to reproduce it on freebsd...

That failed.


Adding a bunch of debug statements, I think I might have found at least some
problems.

The code in LockBufferForCleanup() that actually interacts with other backends
is:

            /* Publish the bufid that Startup process waits on */
            SetStartupBufferPinWaitBufId(buffer - 1);
            /* Set alarm and then wait to be signaled by UnpinBuffer() */
            ResolveRecoveryConflictWithBufferPin();
            /* Reset the published bufid */
            SetStartupBufferPinWaitBufId(-1);

where ResolveRecoveryConflictWithBufferPin() sends procsignals to all other
backends and then waits with ProcWaitForSignal():

void
ProcWaitForSignal(uint32 wait_event_info)
{
    (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
                     wait_event_info);
    ResetLatch(MyLatch);
    CHECK_FOR_INTERRUPTS();
}

one problem is that we pretty much immediately get a SIGALRM whenever we're in
that WaitLatch(). Which does a SetLatch(), interrupting the WaitLatch(). The
startup process then proceeds to SetStartupBufferPinWaitBufId(-1).

In the unlucky cases, the backend holding the pin only processes the interrupt
(in RecoveryConflictInterrupt()) after the
SetStartupBufferPinWaitBufId(-1). The backend then does sees that
HoldingBufferPinThatDelaysRecovery() returns false, and happily continues.


But that's not the whole story, I think. It's a problem leading to conflicts
being handled more slowly, but we eventually should not have more timeouts.

Here's debugging output from a failing run, where I added a few debugging statements:

https://api.cirrus-ci.com/v1/artifact/task/6179111512047616/log/src/test/recovery/tmp_check/log/000_0recovery_conflict_standby.log
https://github.com/anarazel/postgres/commit/212268e753093861aa22a51657c6598c65eeb81b

Curiously, there's only
20644: received interrupt 11

Which is PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, not
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN.

I guess we've never gotten past the standby timeout.


20644: received interrupt 11
2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG:  one cycle of LockBufferForCleanup() iterating in HS
2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
 
2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG:  setting timeout() in 0 10000
2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
 
2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG:  setting latch()
2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
 
2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG:  setting timeout() in 0 3481
2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
 
20644: received interrupt 11
2022-04-09 21:23:47.975 UTC [20631][walreceiver] FATAL:  could not receive data from WAL stream: server closed the
connectionunexpectedly
 

So we sent a conflict interrupt, and then waited. And nothing happened.


What are we expecting to wake the startup process up, once it does
SendRecoveryConflictWithBufferPin()?

It's likely not the problem here, because we never seem to have even reached
that path, but afaics once we've called disable_all_timeouts() at the bottom
of ResolveRecoveryConflictWithBufferPin() and then re-entered
ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind,
so clear a path as quickly as possible."  path, there's no guarantee for any
timeout to be pending anymore?

If there's either no backend that we're still conflicting with (an entirely
possible race condition), or if there's e.g. a snapshot or database conflict,
there's afaics nobody setting the startup processes' latch.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: Commitfest wrapup