Re: Race condition in recovery? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Race condition in recovery?
Date
Msg-id CA+TgmoZTWe2jyGvCCziNuEXzbaxZ6+E64GbejELYhvrPV8=k+Q@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in recovery?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Race condition in recovery?
List pgsql-hackers
On Tue, Jun 8, 2021 at 12:26 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think the problem is here:
>
> Can't locate object method "lsn" via package "PostgresNode" at
> t/025_stuck_on_old_timeline.pl line 84.
>
> When that happens, it bails out, and cleans everything up, doing an
> immediate shutdown of all the nodes. The 'lsn' method was added by
> commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
> v10 and later. I think maybe we can think of back-porting that to 9.6.

Here's an updated set of patches. I removed the extra teardown_node
calls per Kyotaro Horiguchi's request. I adopted his suggestion for
setting a $perlbin variable from $^X, but found that $perlbin was
undefined, so I split the incantation into two lines to fix that. I
updated the code to use ->promote() instead of calling pg_promote(),
and to use poll_query_until() afterwards to wait for promotion as
suggested by Dilip. Also, I added a comment to the change in xlog.c.

Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:

LOG:  primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)

I find that the same failure happens if I back-port the master version
of the patch to v10 or v11, but if I back-port it to v12 or v13 then
the test passes as expected. I haven't figured out what the issue is
yet. I also noticed that if I back-port it to v12 and then revert the
code change, the test still passes. So I think there may be something
subtly wrong with this test case yet. Or maybe a code bug.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Decoding of two-phase xacts missing from CREATE_REPLICATION_SLOT command
Next
From: Bruce Momjian
Date:
Subject: Re: Remove server and libpq support for the version 2 wire protocol