Thread: failures in t/031_recovery_conflict.pl on CI

failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

09 April 2022, 04:55:15

Hi,

on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
interesting, because I ran it there dozens if not hundreds of times before
commit, with - I think - only cosmetic changes.

I've reproduced it in a private branch, with more logging. And the results are
sure interesting.

https://cirrus-ci.com/task/6448492666159104

https://api.cirrus-ci.com/v1/artifact/task/6448492666159104/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log

The primary is waiting for 0/343A000 to be applied, which requires a recovery
conflict to be detected and resolved. On the standby there's the following
sequence (some omitted):

prerequisite for recovery conflict:
2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG:  statement: BEGIN;
2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG:  statement: DECLARE
test_recovery_conflict_cursorCURSOR FOR SELECT b FROM test_recovery_conflict_table1;
 
2022-04-09 04:05:31.293 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG:  statement: FETCH FORWARD
FROMtest_recovery_conflict_cursor;
 

detecting the conflict:
2022-04-09 04:05:31.382 UTC [35038][startup] LOG:  recovery still waiting after 28.821 ms: recovery conflict on buffer
pin
2022-04-09 04:05:31.382 UTC [35038][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
 

and then nothing until the timeout:
2022-04-09 04:09:19.317 UTC [35035][postmaster] LOG:  received immediate shutdown request
2022-04-09 04:09:19.317 UTC [35035][postmaster] DEBUG:  sending signal 3 to process 35071
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  server process (PID 35071) exited with exit code 2

Afaics that has to mean something is broken around sending, receiving or
processing of recovery conflict interrupts.


All the failures so far were on freebsd, from what I can see. There were other
failures in other tests, but I think for reverted or fixed things.


Except for not previously triggering while the shmstats patch was in
development, it's hard to tell whether it's a regression or just a
longstanding bug - we never had tests for recovery conflicts...


I don't really see how recovery prefetching could play a role here, clearly
we've been trying to replay the record. So we're elsewhere...

Greetings,

Andres Freund

https://cirrus-ci.com/github/postgres/postgres/master

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

09 April 2022, 05:05:01

Hi,

On 2022-04-08 21:55:15 -0700, Andres Freund wrote:
> on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
> interesting, because I ran it there dozens if not hundreds of times before
> commit, with - I think - only cosmetic changes.

Scratch that part - I found an instance of the freebsd failure earlier, just
didn't notice because that run failed for other reasons as well. So this might
just have uncovered an older bug around recovery conflict handling,
potentially platform dependent.

I guess I'll try to reproduce it on freebsd...

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

09 April 2022, 22:00:54

Hi,

On 2022-04-08 22:05:01 -0700, Andres Freund wrote:
> On 2022-04-08 21:55:15 -0700, Andres Freund wrote:
> > on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
> > interesting, because I ran it there dozens if not hundreds of times before
> > commit, with - I think - only cosmetic changes.
> 
> Scratch that part - I found an instance of the freebsd failure earlier, just
> didn't notice because that run failed for other reasons as well. So this might
> just have uncovered an older bug around recovery conflict handling,
> potentially platform dependent.
> 
> I guess I'll try to reproduce it on freebsd...

That failed.

Adding a bunch of debug statements, I think I might have found at least some
problems.

The code in LockBufferForCleanup() that actually interacts with other backends
is:

            /* Publish the bufid that Startup process waits on */
            SetStartupBufferPinWaitBufId(buffer - 1);
            /* Set alarm and then wait to be signaled by UnpinBuffer() */
            ResolveRecoveryConflictWithBufferPin();
            /* Reset the published bufid */
            SetStartupBufferPinWaitBufId(-1);

where ResolveRecoveryConflictWithBufferPin() sends procsignals to all other
backends and then waits with ProcWaitForSignal():

void
ProcWaitForSignal(uint32 wait_event_info)
{
    (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
                     wait_event_info);
    ResetLatch(MyLatch);
    CHECK_FOR_INTERRUPTS();
}

one problem is that we pretty much immediately get a SIGALRM whenever we're in
that WaitLatch(). Which does a SetLatch(), interrupting the WaitLatch(). The
startup process then proceeds to SetStartupBufferPinWaitBufId(-1).

In the unlucky cases, the backend holding the pin only processes the interrupt
(in RecoveryConflictInterrupt()) after the
SetStartupBufferPinWaitBufId(-1). The backend then does sees that
HoldingBufferPinThatDelaysRecovery() returns false, and happily continues.

But that's not the whole story, I think. It's a problem leading to conflicts
being handled more slowly, but we eventually should not have more timeouts.

Here's debugging output from a failing run, where I added a few debugging statements:

https://api.cirrus-ci.com/v1/artifact/task/6179111512047616/log/src/test/recovery/tmp_check/log/000_0recovery_conflict_standby.log
https://github.com/anarazel/postgres/commit/212268e753093861aa22a51657c6598c65eeb81b

Curiously, there's only
20644: received interrupt 11

Which is PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, not
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN.

I guess we've never gotten past the standby timeout.

20644: received interrupt 11
2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG:  one cycle of LockBufferForCleanup() iterating in HS
2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0

2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG:  setting timeout() in 0 10000
2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0

2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG:  setting latch()
2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0

2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG:  setting timeout() in 0 3481
2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT:  WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0
nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0

20644: received interrupt 11
2022-04-09 21:23:47.975 UTC [20631][walreceiver] FATAL:  could not receive data from WAL stream: server closed the
connectionunexpectedly

So we sent a conflict interrupt, and then waited. And nothing happened.

What are we expecting to wake the startup process up, once it does
SendRecoveryConflictWithBufferPin()?

It's likely not the problem here, because we never seem to have even reached
that path, but afaics once we've called disable_all_timeouts() at the bottom
of ResolveRecoveryConflictWithBufferPin() and then re-entered
ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind,
so clear a path as quickly as possible."  path, there's no guarantee for any
timeout to be pending anymore?

If there's either no backend that we're still conflicting with (an entirely
possible race condition), or if there's e.g. a snapshot or database conflict,
there's afaics nobody setting the startup processes' latch.

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

09 April 2022, 23:10:02

Hi,

On 2022-04-09 15:00:54 -0700, Andres Freund wrote:
> What are we expecting to wake the startup process up, once it does
> SendRecoveryConflictWithBufferPin()?
>
> It's likely not the problem here, because we never seem to have even reached
> that path, but afaics once we've called disable_all_timeouts() at the bottom
> of ResolveRecoveryConflictWithBufferPin() and then re-entered
> ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind,
> so clear a path as quickly as possible."  path, there's no guarantee for any
> timeout to be pending anymore?
>
> If there's either no backend that we're still conflicting with (an entirely
> possible race condition), or if there's e.g. a snapshot or database conflict,
> there's afaics nobody setting the startup processes' latch.

It's not that (although I still suspect it's a problem). It's a self-deadlock,
because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin()
*explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does
CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls,
if the deadlock timeout is reached.

To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom:

(gdb) bt
#0  __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0,
clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at
../sysdeps/nptl/futex-internal.c:87
#1  0x00007fd4ce5a215b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fd4cb001138,
expected=expected@entry=0,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at
../sysdeps/nptl/futex-internal.c:123
#2  0x00007fd4ce59e44f in do_futex_wait (sem=sem@entry=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:112
#3  0x00007fd4ce59e4e8 in __new_sem_wait_slow64 (sem=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:184
#4  0x00007fd4cf20d823 in PGSemaphoreLock (sema=0x7fd4cb001138) at pg_sema.c:327
#5  0x00007fd4cf2dd675 in LWLockAcquire (lock=0x7fd4cb001600, mode=LW_EXCLUSIVE) at
/home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1324
#6  0x00007fd4cf2c36e7 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
conflictPending=false)
    at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3638
#7  0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)
    at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846
#8  0x00007fd4cf2cc69d in StandbyTimeoutHandler () at
/home/andres/src/postgresql/src/backend/storage/ipc/standby.c:911
#9  0x00007fd4cf4e68d7 in handle_sig_alarm (postgres_signal_arg=14) at
/home/andres/src/postgresql/src/backend/utils/misc/timeout.c:421
#10 <signal handler called>
#11 0x00007fd4cddfffc4 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e5561c0) at
../sysdeps/unix/sysv/linux/select.c:71
#12 0x00007fd4cf52ea2a in pg_usleep (microsec=10000) at /home/andres/src/postgresql/src/port/pgsleep.c:56
#13 0x00007fd4cf2c36f1 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
conflictPending=false)
    at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3640
#14 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)
    at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846
#15 0x00007fd4cf2cc50f in ResolveRecoveryConflictWithBufferPin () at
/home/andres/src/postgresql/src/backend/storage/ipc/standby.c:820
#16 0x00007fd4cf2a996f in LockBufferForCleanup (buffer=43) at
/home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4336
#17 0x00007fd4ceec911c in XLogReadBufferForRedoExtended (record=0x7fd4d106a618, block_id=0 '\000', mode=RBM_NORMAL,
get_cleanup_lock=true,buf=0x7ffc6e556394)
 
    at /home/andres/src/postgresql/src/backend/access/transam/xlogutils.c:427
#18 0x00007fd4cee1aa41 in heap_xlog_prune (record=0x7fd4d106a618) at
/home/andres/src/postgresql/src/backend/access/heap/heapam.c:8634

it's reproducible on linux.


I'm lacking words I dare to put in an email to describe how bad an idea it is
to call CancelDBBackends() from within a timeout function, particularly when
the function enabling the timeout also calls that function itself. Before
disabling timers.

I ...

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

09 April 2022, 23:31:07

Hi,

On 2022-04-09 16:10:02 -0700, Andres Freund wrote:
> It's not that (although I still suspect it's a problem). It's a self-deadlock,
> because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin()
> *explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does
> CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls,
> if the deadlock timeout is reached.
> 
> To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom:
> 
> [... backtrace ... ]
>
> it's reproducible on linux.
> 
> 
> I'm lacking words I dare to put in an email to describe how bad an idea it is
> to call CancelDBBackends() from within a timeout function, particularly when
> the function enabling the timeout also calls that function itself. Before
> disabling timers.

It's been broken in different ways all the way back to 9.0, from what I can
see, but I didn't check every single version.

Afaics the fix is to nuke the idea of doing anything substantial in the signal
handler from orbit, and instead just set a flag in the handler. Then check
whether the timeout happend after ProcWaitForSignal() and call
SendRecoveryConflictWithBufferPin().

Also worth noting that the disable_all_timeouts() calls appears to break
STARTUP_PROGRESS_TIMEOUT.

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

09 April 2022, 23:34:26

Andres Freund <andres@anarazel.de> writes:
> It's been broken in different ways all the way back to 9.0, from what I can
> see, but I didn't check every single version.

> Afaics the fix is to nuke the idea of doing anything substantial in the signal
> handler from orbit, and instead just set a flag in the handler.

+1.  This is probably more feasible given the latch infrastructure
than it was when that code was first written.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

12 April 2022, 18:49:13

Hi,

On 2022-04-09 19:34:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > It's been broken in different ways all the way back to 9.0, from what I can
> > see, but I didn't check every single version.
> 
> > Afaics the fix is to nuke the idea of doing anything substantial in the signal
> > handler from orbit, and instead just set a flag in the handler.
> 
> +1.  This is probably more feasible given the latch infrastructure
> than it was when that code was first written.

What do you think about just reordering the disable_all_timeouts() to be
before the got_standby_deadlock_timeout check in the back branches? I think
that should close at least the most obvious hole.  And fix it properly in
HEAD?

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

12 April 2022, 19:05:22

Andres Freund <andres@anarazel.de> writes:
> On 2022-04-09 19:34:26 -0400, Tom Lane wrote:
>> +1.  This is probably more feasible given the latch infrastructure
>> than it was when that code was first written.

> What do you think about just reordering the disable_all_timeouts() to be
> before the got_standby_deadlock_timeout check in the back branches? I think
> that should close at least the most obvious hole.  And fix it properly in
> HEAD?

I don't have much faith in that, and I don't see why we can't fix it
properly.  Don't we just need to have the signal handler set MyLatch,
and then do the unsafe stuff back in the "if (got_standby_deadlock_timeout)"
stanza in mainline?

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

13 April 2022, 00:26:26

Hi,

On 2022-04-12 15:05:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-04-09 19:34:26 -0400, Tom Lane wrote:
> >> +1.  This is probably more feasible given the latch infrastructure
> >> than it was when that code was first written.
> 
> > What do you think about just reordering the disable_all_timeouts() to be
> > before the got_standby_deadlock_timeout check in the back branches? I think
> > that should close at least the most obvious hole.  And fix it properly in
> > HEAD?
> 
> I don't have much faith in that, and I don't see why we can't fix it
> properly.

It's not too hard, agreed.

It's somewhat hard to test that it works though. The new recovery conflict
tests test all recovery conflicts except for deadlock ones, because they're
not easy to trigger... But I think I nearly got it working reliably.

It's probably worth backpatching the tests, after stripping them of the stats
checks?

Three questions:

- For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
  replay progress reporting. Is there a reason to keep them in the
  backbranches? Hard to see how an extension or something could rely on it,
  but ...?

- I named the variable set by the signal handler got_standby_delay_timeout,
  because just got_standby_timeout looked weird besides the _deadlock_. Which
  makes me think that we should rename STANDBY_TIMEOUT to
  STANDBY_DELAY_TIMEOUT too?

- There's the following comment in ResolveRecoveryConflictWithBufferPin():

  "We assume that only UnpinBuffer() and the timeout requests established
   above can wake us up here."

  That bogus afaict? There's plenty other things that can cause MyProc->latch
  to be set. Is it worth doing something about this at the same time? Right
  now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
  succession initially.

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

29 April 2022, 20:08:09

Hi,

Attached are patches for this issue.

It adds a test case for deadlock conflicts to make sure that case isn't
broken. I also tested the recovery conflict tests in the back branches, and
they work there with a reasonably small set of changes.

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
  preceding numbers are unused. It seems way more problematic to use a
  different number in the backbranches than have gaps?

- The test uses pump_until() and wait_for_log(), which don't exist in the
  backbranches. For now I've just inlined the implementation, but I guess we
  could also backpatch their introduction?

- There's a few incompatibilities in the test with older branches:
  - older branches don't have allow_in_place_tablespaces - I think just
    skipping tablespace conflicts is fine, they're comparatively
    simple.

    Eventually it might make sense to backpatch allow_in_place_tablespaces,
    our test coverage in the area is quite poor.

  - the stats tests can't easily made reliably in the backbranches - which is
    ok, as the conflict itself is verified via the log

  - some branches don't have log_recovery_conflict_waits, since it's not
    critical to the test, it's ok to just not include it there

  I played with the idea of handling the differences using version comparisons
  in the code, and have the test be identically across branches. Since it's
  something we don't do so far, I'm leaning against it, but ...


> - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
>   replay progress reporting. Is there a reason to keep them in the
>   backbranches? Hard to see how an extension or something could rely on it,
>   but ...?

I've left it as is for now, will start a separate thread.


> - There's the following comment in ResolveRecoveryConflictWithBufferPin():
>
>   "We assume that only UnpinBuffer() and the timeout requests established
>    above can wake us up here."
>
>   That bogus afaict? There's plenty other things that can cause MyProc->latch
>   to be set. Is it worth doing something about this at the same time? Right
>   now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
>   succession initially.

The comment is more recent than I had realized. I raised this separately in
https://postgr.es/m/20220429191815.xewxjlpmq7mxhsr2%40alap3.anarazel.de


pgindent uses some crazy formatting nearby:
        SendRecoveryConflictWithBufferPin(
                                          PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I'm tempted to clean that up in passing by having just one
SendRecoveryConflictWithBufferPin() call instead of two, storing the type of
conflict in a local variable? Doesn't look entirely pretty either, but ...


I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(),
btw. But that'd be a non-backpatchable cleanup, I think:
 * The ProcWaitForSignal() sleep normally done in LockBufferForCleanup()
 * (when not InHotStandby) is performed here, for code clarity.


Greetings,

Andres Freund

Attachment

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

29 April 2022, 23:26:59

Andres Freund <andres@anarazel.de> writes:
> Questions:
> - I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
>   preceding numbers are unused. It seems way more problematic to use a
>   different number in the backbranches than have gaps?

+1

> - The test uses pump_until() and wait_for_log(), which don't exist in the
>   backbranches. For now I've just inlined the implementation, but I guess we
>   could also backpatch their introduction?

I'd backpatch --- seems unlikely this will be the last need for 'em.

> pgindent uses some crazy formatting nearby:
>         SendRecoveryConflictWithBufferPin(
>                                           PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I do not believe that that line break is pgindent's fault.
If you just fold it into one line it should stay that way.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

03 May 2022, 01:48:03

Hi,

On 2022-04-29 19:26:59 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > - The test uses pump_until() and wait_for_log(), which don't exist in the
> >   backbranches. For now I've just inlined the implementation, but I guess we
> >   could also backpatch their introduction?
> 
> I'd backpatch --- seems unlikely this will be the last need for 'em.

Done.

I ended up committing the extension of the test first, before the fix. I think
that's the cause of the failure on longfin on serinus. Let's hope the
situation improves with the now also committed (and backpatched) fix.

> > pgindent uses some crazy formatting nearby:
> >         SendRecoveryConflictWithBufferPin(
> >                                           PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
> 
> I do not believe that that line break is pgindent's fault.

Oh - I'm fairly certain I've seen pgindent do that in the past. But you're
right, it's not. Perhaps it was an older version of pgindent?

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

03 May 2022, 03:44:32

Andres Freund <andres@anarazel.de> writes:
> I ended up committing the extension of the test first, before the fix. I think
> that's the cause of the failure on longfin on serinus. Let's hope the
> situation improves with the now also committed (and backpatched) fix.

longfin's definitely not very happy: four out of six tries have ended with


psql:<stdin>:8: ERROR:  canceling statement due to conflict with recovery
LINE 1: SELECT * FROM test_recovery_conflict_table2;
                      ^
DETAIL:  User was holding shared buffer pin for too long.
timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.pl
line358. 


I can poke into that tomorrow, but are you sure that that isn't an
expectable result?

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

03 May 2022, 04:05:57

Hi,

On 2022-05-02 23:44:32 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I ended up committing the extension of the test first, before the fix. I think
> > that's the cause of the failure on longfin on serinus. Let's hope the
> > situation improves with the now also committed (and backpatched) fix.
> 
> longfin's definitely not very happy: four out of six tries have ended with

Too bad :(


> psql:<stdin>:8: ERROR:  canceling statement due to conflict with recovery
> LINE 1: SELECT * FROM test_recovery_conflict_table2;
>                       ^
> DETAIL:  User was holding shared buffer pin for too long.
> timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at
t/031_recovery_conflict.plline 358.
 
> 
> 
> I can poke into that tomorrow, but are you sure that that isn't an
> expectable result?

It's not expected. But I think I might see what the problem is:

$psql_standby{stdin} .= qq[
    BEGIN;
    -- hold pin
    DECLARE $cursor1 CURSOR FOR SELECT a FROM $table1;
    FETCH FORWARD FROM $cursor1;
    -- wait for lock held by prepared transaction
    SELECT * FROM $table2;
    ];
ok( pump_until(
        $psql_standby{run},     $psql_timeout,
        \$psql_standby{stdout}, qr/^1$/m,),
    "$sect: cursor holding conflicting pin, also waiting for lock, established"
);

We wait for the FETCH (and thus the buffer pin to be acquired). But that
doesn't guarantee that the lock has been acquired. We can't check that with
pump_until() afaics, because there'll not be any output. But a query_until()
checking pg_locks should do the trick?

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

03 May 2022, 05:16:46

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-02 23:44:32 -0400, Tom Lane wrote:
>> I can poke into that tomorrow, but are you sure that that isn't an
>> expectable result?

> It's not expected. But I think I might see what the problem is:
> We wait for the FETCH (and thus the buffer pin to be acquired). But that
> doesn't guarantee that the lock has been acquired. We can't check that with
> pump_until() afaics, because there'll not be any output. But a query_until()
> checking pg_locks should do the trick?

Irritatingly, it doesn't reproduce (at least not easily) in a manual
build on the same box.  So it's almost surely a timing issue, and
your theory here seems plausible.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Alvaro Herrera

Date:

03 May 2022, 07:46:10

On 2022-May-02, Andres Freund wrote:

> > > pgindent uses some crazy formatting nearby:
> > >         SendRecoveryConflictWithBufferPin(
> > >                                           PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
> > 
> > I do not believe that that line break is pgindent's fault.
> 
> Oh - I'm fairly certain I've seen pgindent do that in the past. But you're
> right, it's not. Perhaps it was an older version of pgindent?

No, it's never done that.  We used to fold lines that way manually,
because pgindent used to push the argument to the left, but that changed
at commit 382ceffdf7f6.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria.  Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

03 May 2022, 18:20:25

Hi,

On 2022-05-03 01:16:46 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-02 23:44:32 -0400, Tom Lane wrote:
> >> I can poke into that tomorrow, but are you sure that that isn't an
> >> expectable result?
> 
> > It's not expected. But I think I might see what the problem is:
> > We wait for the FETCH (and thus the buffer pin to be acquired). But that
> > doesn't guarantee that the lock has been acquired. We can't check that with
> > pump_until() afaics, because there'll not be any output. But a query_until()
> > checking pg_locks should do the trick?
> 
> Irritatingly, it doesn't reproduce (at least not easily) in a manual
> build on the same box.

Odd, given how readily it seem to reproduce on the bf. I assume you built with
> Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

> So it's almost surely a timing issue, and your theory here seems plausible.

Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...

# just to make sure we're waiting for lock already
ok( $node_standby->poll_query_until(
        'postgres', qq[
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
], 'waiting'),
    "$sect: lock acquisition is waiting");

and on longfin that step completes sucessfully.

I think what happens is that we get a buffer pin conflict, because these days
we can actually process buffer pin conflicts while waiting for a lock. The
easiest way to get around that is to increase the replay timeout for that
test, I think?

I think we need a restart, not a reload, because reloads aren't guaranteed to
be processed at any certain point in time :/.

Testing a fix in a variety of timing circumstances now...

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

03 May 2022, 18:23:23

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-03 01:16:46 -0400, Tom Lane wrote:
>> Irritatingly, it doesn't reproduce (at least not easily) in a manual
>> build on the same box.

> Odd, given how readily it seem to reproduce on the bf. I assume you built with
>> Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

Yeah, I copied all that stuff ...

>> So it's almost surely a timing issue, and your theory here seems plausible.

> Unfortunately I don't think my theory holds, because I actually had added a
> defense against this into the test that I forgot about momentarily...

Oh, hm.  I can try harder to repro it.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

03 May 2022, 19:13:05

Hi,

On 2022-05-03 14:23:23 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> So it's almost surely a timing issue, and your theory here seems plausible.
> 
> > Unfortunately I don't think my theory holds, because I actually had added a
> > defense against this into the test that I forgot about momentarily...
> 
> Oh, hm.  I can try harder to repro it.

I've now reproduced it a couple times here running under rr, so it's probably
not worth putting too much effort into that.

Attached is a fix for the test that I think should avoid the problem. Couldn't
repro it with it applied, under both rr and valgrind.

My current problem is that I'm running into some IPC::Run issues (bug?). I get
"ack Broken pipe:" iff I add "SELECT pg_sleep(1);" after
"-- wait for lock held by prepared transaction"

It doesn't happen without that debugging thing, but it makes me worried that
it's something that'll come up in random situations.

It looks to me like it's a bug in IPC::Run - with a few changes I get the
failure to happen inside pump_nb(), which seems like it shouldn't error out
just because the child process exited...

I *think* it might not happen without the sleep. But I'm not at all confident.

In general I'm kinda worried on how much effectively unmaintained perl stuff
we're depending :(

Greetings,

Andres Freund

Attachment

0001-Fix-timing-issue-in-deadlock-recovery-conflict-test.patch

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

06 May 2022, 02:07:40

Andres Freund <andres@anarazel.de> writes:
> Attached is a fix for the test that I think should avoid the problem. Couldn't
> repro it with it applied, under both rr and valgrind.

May I ask where we're at on this?  Next week's back-branch release is
getting uncomfortably close, and I'm still seeing various buildfarm
animals erratically failing on 031_recovery_conflict.pl.  Should we
just remove that test from the back branches for now?

Also, it appears that the back-patch of pump_until failed to remove
some pre-existing copies, eg check-world in v14 now reports

Subroutine pump_until redefined at t/013_crash_restart.pl line 248.
Subroutine pump_until redefined at t/022_crash_temp_files.pl line 272.

I didn't check whether these are exact duplicates.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

06 May 2022, 03:09:27

Hi,

On 2022-05-05 22:07:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Attached is a fix for the test that I think should avoid the problem. Couldn't
> > repro it with it applied, under both rr and valgrind.
> 
> May I ask where we're at on this?  Next week's back-branch release is
> getting uncomfortably close, and I'm still seeing various buildfarm
> animals erratically failing on 031_recovery_conflict.pl.

Yea, sorry. I had crappy internet connectivity / device access for the last
few days, making it hard to make progress.

Looks like the problems are gone on HEAD at least.

> Should we just remove that test from the back branches for now?

That might be the best course, marking the test as TODO perhaps?

Unfortunately a pg_ctl reload isn't processed reliably by the time the next
test steps execute (saw that once when running in a loop), and a restart
causes other problems (throws stats away).

> Also, it appears that the back-patch of pump_until failed to remove
> some pre-existing copies, eg check-world in v14 now reports

> Subroutine pump_until redefined at t/013_crash_restart.pl line 248.
> Subroutine pump_until redefined at t/022_crash_temp_files.pl line 272.
> 
> I didn't check whether these are exact duplicates.

They're not quite identical copies, which is why I left them in-place. But the
warnings clearly make that a bad idea. I somehow mis-extrapolated from
Cluster.pm, where it's not a problem (accessed via object).  I'll remove them.

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

06 May 2022, 03:36:22

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-05 22:07:40 -0400, Tom Lane wrote:
>> May I ask where we're at on this?  Next week's back-branch release is
>> getting uncomfortably close, and I'm still seeing various buildfarm
>> animals erratically failing on 031_recovery_conflict.pl.

> Looks like the problems are gone on HEAD at least.

It does look that way, although the number of successes is not large yet.

>> Should we just remove that test from the back branches for now?

> That might be the best course, marking the test as TODO perhaps?

I poked closer and saw that you reverted 5136967f1 et al because
(I suppose) adjust_conf is not there in the back branches.  While
I'd certainly support back-patching that functionality, I think
we need to have a discussion about how to do it.  I wonder whether
we shouldn't drop src/test/perl/PostgreSQL/... into the back branches
in toto and make the old test APIs into a wrapper around the new ones
instead of vice versa.  But that's definitely not a task to undertake
three days before a release deadline.

So I reluctantly vote for removing 031_recovery_conflict.pl in the
back branches for now, with the expectation that we'll fix the
infrastructure and put it back after the current release round
is done.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

06 May 2022, 03:53:20

Hi,

On 2022-05-05 23:36:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-05 22:07:40 -0400, Tom Lane wrote:
> >> May I ask where we're at on this?  Next week's back-branch release is
> >> getting uncomfortably close, and I'm still seeing various buildfarm
> >> animals erratically failing on 031_recovery_conflict.pl.
> 
> > Looks like the problems are gone on HEAD at least.
> 
> It does look that way, although the number of successes is not large yet.
> 
> >> Should we just remove that test from the back branches for now?
> 
> > That might be the best course, marking the test as TODO perhaps?
> 
> I poked closer and saw that you reverted 5136967f1 et al because
> (I suppose) adjust_conf is not there in the back branches.

Yea. That one was a stupid mistake, working outside my usual environment. It's
easy enough to work around adjust_conf not existing (just appending works),
but then there's subsequent test failures...


> So I reluctantly vote for removing 031_recovery_conflict.pl in the
> back branches for now, with the expectation that we'll fix the
> infrastructure and put it back after the current release round
> is done.

What about instead marking the flapping test TODO? That'd still give us most
of the coverage...

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

06 May 2022, 03:57:28

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-05 23:36:22 -0400, Tom Lane wrote:
>> So I reluctantly vote for removing 031_recovery_conflict.pl in the
>> back branches for now, with the expectation that we'll fix the
>> infrastructure and put it back after the current release round
>> is done.

> What about instead marking the flapping test TODO? That'd still give us most
> of the coverage...

Are you sure there's just one test that's failing?  I haven't checked
the buildfarm history close enough to be sure of that.  But if it's
true, disabling just that one would be fine (again, as a stopgap
measure).

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

06 May 2022, 15:58:27

Hi,

On 2022-05-05 23:57:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-05 23:36:22 -0400, Tom Lane wrote:
> >> So I reluctantly vote for removing 031_recovery_conflict.pl in the
> >> back branches for now, with the expectation that we'll fix the
> >> infrastructure and put it back after the current release round
> >> is done.
> 
> > What about instead marking the flapping test TODO? That'd still give us most
> > of the coverage...
> 
> Are you sure there's just one test that's failing?  I haven't checked
> the buildfarm history close enough to be sure of that.  But if it's
> true, disabling just that one would be fine (again, as a stopgap
> measure).

I looked through all the failures I found and it's two kinds of failures, both
related to the deadlock test. So I'm thinking of skipping just that test as in
the attached.

Working on committing / backpatching that, unless somebody suggests changes
quickly...

Greetings,

Andres Freund

Attachment

disable-flapping-031-recovery-conflict.diff

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

06 May 2022, 16:12:19

Andres Freund <andres@anarazel.de> writes:
> I looked through all the failures I found and it's two kinds of failures, both
> related to the deadlock test. So I'm thinking of skipping just that test as in
> the attached.

> Working on committing / backpatching that, unless somebody suggests changes
> quickly...

WFM.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

06 May 2022, 17:28:50

On 2022-05-06 12:12:19 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I looked through all the failures I found and it's two kinds of failures, both
> > related to the deadlock test. So I'm thinking of skipping just that test as in
> > the attached.
> 
> > Working on committing / backpatching that, unless somebody suggests changes
> > quickly...
> 
> WFM.

Done. Perhaps you could trigger a run on longfin, that seems to have been the
most reliably failing animal?

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

06 May 2022, 18:01:56

Andres Freund <andres@anarazel.de> writes:
> Done. Perhaps you could trigger a run on longfin, that seems to have been the
> most reliably failing animal?

No need, its cron job launched already.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

08 May 2022, 15:28:34

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-05 23:57:28 -0400, Tom Lane wrote:
>> Are you sure there's just one test that's failing?  I haven't checked
>> the buildfarm history close enough to be sure of that.  But if it's
>> true, disabling just that one would be fine (again, as a stopgap
>> measure).

> I looked through all the failures I found and it's two kinds of failures, both
> related to the deadlock test. So I'm thinking of skipping just that test as in
> the attached.

Per lapwing's latest results [1], this wasn't enough.  I'm again thinking
we should pull the whole test from the back branches.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2022-05-07%2016%3A40%3A04

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

08 May 2022, 17:38:44

Hi,

On 2022-05-08 11:28:34 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-05 23:57:28 -0400, Tom Lane wrote:
> >> Are you sure there's just one test that's failing?  I haven't checked
> >> the buildfarm history close enough to be sure of that.  But if it's
> >> true, disabling just that one would be fine (again, as a stopgap
> >> measure).
> 
> > I looked through all the failures I found and it's two kinds of failures, both
> > related to the deadlock test. So I'm thinking of skipping just that test as in
> > the attached.
> 
> Per lapwing's latest results [1], this wasn't enough.  I'm again thinking
> we should pull the whole test from the back branches.

That failure is different from the earlier failures though. I don't think it's
a timing issue in the test like the deadlock check one. I rather suspect it's
indicative of further problems in this area. Potentially the known problem
with RecoveryConflictInterrupt() running in the signal handler? I think Thomas
has a patch for that...

One failure in ~20 runs, on one animal doesn't seem worth disabling the test
for.

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

08 May 2022, 17:59:09

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-08 11:28:34 -0400, Tom Lane wrote:
>> Per lapwing's latest results [1], this wasn't enough.  I'm again thinking
>> we should pull the whole test from the back branches.

> That failure is different from the earlier failures though. I don't think it's
> a timing issue in the test like the deadlock check one. I rather suspect it's
> indicative of further problems in this area.

Yeah, that was my guess too.

> Potentially the known problem
> with RecoveryConflictInterrupt() running in the signal handler? I think Thomas
> has a patch for that...

Maybe; or given that it's on v10, it could be telling us about some
yet-other problem we perhaps solved since then without realizing
it needed to be back-patched.

> One failure in ~20 runs, on one animal doesn't seem worth disabling the test
> for.

No one is going to thank us for shipping a known-unstable test case.
It does nothing to fix the problem; all it will lead to is possible
failures during package builds.  I have no idea whether any packagers
use "make check-world" rather than just "make check" while building.
But if they do, even fairly low-probability failures can be problematic.
(I still carry the scars I acquired while working at Red Hat and being
responsible for packaging mysql: at least back then, their test suite
was full of cases that mostly worked fine, except when getting stressed
in Red Hat's build farm.  Dealing with a test suite that fails 50% of
the time under load, while trying to push out an urgent security fix,
is NOT a pleasant situation.)

I'm happy to have this test in the stable branches once we have committed
fixes that address all known problems.  Until then, it will just be
a nuisance for anyone who is not a developer working on those problems.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

08 May 2022, 22:11:39

Hi,

On 2022-05-08 13:59:09 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-08 11:28:34 -0400, Tom Lane wrote:
> >> Per lapwing's latest results [1], this wasn't enough.  I'm again thinking
> >> we should pull the whole test from the back branches.
> 
> > That failure is different from the earlier failures though. I don't think it's
> > a timing issue in the test like the deadlock check one. I rather suspect it's
> > indicative of further problems in this area.
> 
> Yeah, that was my guess too.
> 
> > Potentially the known problem
> > with RecoveryConflictInterrupt() running in the signal handler? I think Thomas
> > has a patch for that...
> 
> Maybe; or given that it's on v10, it could be telling us about some
> yet-other problem we perhaps solved since then without realizing
> it needed to be back-patched.
> 
> > One failure in ~20 runs, on one animal doesn't seem worth disabling the test
> > for.
> 
> No one is going to thank us for shipping a known-unstable test case.

IDK, hiding failures indicating bugs isn't really better, at least if it
doesn't look like a bug in the test. But you seem to have a stronger opinion
on this than me, so I'll skip the entire test for now :/

Greetings,

Andres Freund

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

09 May 2022, 01:14:36

On 2022-05-08 15:11:39 -0700, Andres Freund wrote:
> But you seem to have a stronger opinion on this than me, so I'll skip the
> entire test for now :/

And done.

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

09 May 2022, 01:25:57

Andres Freund <andres@anarazel.de> writes:
> On 2022-05-08 15:11:39 -0700, Andres Freund wrote:
>> But you seem to have a stronger opinion on this than me, so I'll skip the
>> entire test for now :/

> And done.

Thanks, I appreciate that.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Alvaro Herrera

Date:

26 July 2022, 16:33:39

On 2022-May-08, Andres Freund wrote:

> On 2022-05-08 13:59:09 -0400, Tom Lane wrote:

> > No one is going to thank us for shipping a known-unstable test case.
> 
> IDK, hiding failures indicating bugs isn't really better, at least if it
> doesn't look like a bug in the test. But you seem to have a stronger opinion
> on this than me, so I'll skip the entire test for now :/

Hey, I just noticed that these tests are still disabled.  The next
minors are coming soon; should we wait until *those* are done and then
re-enable; or re-enable them now to see how they fare and then
re-disable before the next minors if there's still problems we don't
find fixes for?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)

Re: failures in t/031_recovery_conflict.pl on CI

From

Tom Lane

Date:

26 July 2022, 16:47:38

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Hey, I just noticed that these tests are still disabled.  The next
> minors are coming soon; should we wait until *those* are done and then
> re-enable; or re-enable them now to see how they fare and then
> re-disable before the next minors if there's still problems we don't
> find fixes for?

Maybe I missed it, but I don't think anything's been done to fix the
test's problems in the back branches.

            regards, tom lane

Re: failures in t/031_recovery_conflict.pl on CI

From

Andres Freund

Date:

26 July 2022, 17:40:22

Hi,

On 2022-07-26 12:47:38 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Hey, I just noticed that these tests are still disabled.  The next
> > minors are coming soon; should we wait until *those* are done and then
> > re-enable; or re-enable them now to see how they fare and then
> > re-disable before the next minors if there's still problems we don't
> > find fixes for?
> 
> Maybe I missed it, but I don't think anything's been done to fix the
> test's problems in the back branches.

Yea, I don't think either. What's worse, there's several actual problems
in the recovery conflict code in all branches - which causes occasional
failures of the test in HEAD as well.

There's one big set of fixes in
https://www.postgresql.org/message-id/CA%2BhUKGKrLKx7Ky1T_FHk-Y729K0oie-gOXKCbxCXyjbPDJAOOw%40mail.gmail.com

and I suspect it'll be hard to have test be fully reliable without
addressing
https://postgr.es/m/20220715172938.t7uivghdx5vj36cn%40awork3.anarazel.de

Greetings,

Andres Freund

RE: failures in t/031_recovery_conflict.pl on CI

From

Факеев Алексей

Date:

24 November 2022, 05:24:01

Hello.

Even after applying the patch, we are still facing an "ack Broken pipe"
problem. 
It occurs on the arm64 platform, presumably under high load. 
Here is a log snippet from buildfarm:
...
[19:08:12.150](0.394s) ok 13 - startup deadlock: cursor holding conflicting
pin, also waiting for lock, established
[19:08:12.208](0.057s) ok 14 - startup deadlock: lock acquisition is waiting
Waiting for replication conn standby's replay_lsn to pass 0/33F5FE0 on
primary
done
psql:<stdin>:8: ERROR:  canceling statement due to conflict with recovery
LINE 1: SELECT * FROM test_recovery_conflict_table2;
                      ^
DETAIL:  User transaction caused buffer deadlock with recovery.
[19:08:12.319](0.112s) ok 15 - startup deadlock: logfile contains terminated
connection due to recovery conflict
ack Broken pipe: write( 13, '\\q
' ) at /usr/share/perl5/IPC/Run/IO.pm line 550.
### Stopping node "primary" using mode immediate
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/REL_15_STABLE/pgsql.build/src/test/rec
overy/tmp_check/t_031_recovery_conflict_primary_data/pgdata -m immediate
stop
waiting for server to shut down... done
server stopped
# No postmaster PID for node "primary"
### Stopping node "standby" using mode immediate
# Running: pg_ctl -D
/home/bf/build/buildfarm-flaviventris/REL_15_STABLE/pgsql.build/src/test/rec
overy/tmp_check/t_031_recovery_conflict_standby_data/pgdata -m immediate
stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby"
[19:08:12.450](0.131s) # Tests were run but no plan was declared and
done_testing() was not seen.
[19:08:12.450](0.000s) # Looks like your test exited with 32 just after 15.
...

Below is a test report:
... 
[20:46:35] t/030_stats_cleanup_replica.pl ....... ok     9956 ms ( 0.01 usr
0.00 sys +  3.49 cusr  2.49 csys =  5.99 CPU)
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 32 just after 15.
[20:46:43] t/031_recovery_conflict.pl ........... 
Dubious, test returned 32 (wstat 8192, 0x2000)
All 15 subtests passed 
[20:46:56] t/032_relfilenode_reuse.pl ........... ok    13625 ms ( 0.01 usr
0.00 sys +  3.53 cusr  2.39 csys =  5.93 CPU)
[20:47:03] t/110_wal_sender_check_crc.pl ........ ok     6421 ms ( 0.00 usr
0.00 sys +  3.20 cusr  1.87 csys =  5.07 CPU)
[20:47:03]

Test Summary Report
-------------------
t/031_recovery_conflict.pl         (Wstat: 8192 Tests: 15 Failed: 0)
  Non-zero exit status: 32
  Parse errors: No plan found in TAP output




-----Original Message-----
From: Andres Freund <andres@anarazel.de> 
Sent: Tuesday, May 3, 2022 11:13 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Robert Haas <robertmhaas@gmail.com>; pgsql-hackers@postgresql.org;
Thomas Munro <thomas.munro@gmail.com>
Subject: Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-05-03 14:23:23 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> So it's almost surely a timing issue, and your theory here seems
plausible.
> 
> > Unfortunately I don't think my theory holds, because I actually had 
> > added a defense against this into the test that I forgot about
momentarily...
> 
> Oh, hm.  I can try harder to repro it.

I've now reproduced it a couple times here running under rr, so it's
probably not worth putting too much effort into that.

Attached is a fix for the test that I think should avoid the problem.
Couldn't repro it with it applied, under both rr and valgrind.


My current problem is that I'm running into some IPC::Run issues (bug?). I
get "ack Broken pipe:" iff I add "SELECT pg_sleep(1);" after
"-- wait for lock held by prepared transaction"

It doesn't happen without that debugging thing, but it makes me worried that
it's something that'll come up in random situations.

It looks to me like it's a bug in IPC::Run - with a few changes I get the
failure to happen inside pump_nb(), which seems like it shouldn't error out
just because the child process exited...


I *think* it might not happen without the sleep. But I'm not at all
confident.

In general I'm kinda worried on how much effectively unmaintained perl stuff
we're depending :(

Greetings,

Andres Freund