Re: [BUG] non archived WAL removed during production crash recovery - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: [BUG] non archived WAL removed during production crash recovery
Date
Msg-id 20200427074945.GG11369@paquier.xyz
Whole thread Raw
In response to Re: [BUG] non archived WAL removed during production crash recovery  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: [BUG] non archived WAL removed during production crash recovery
List pgsql-bugs
On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote:
> I agree the three tests could be removed as they were not covering the bug we
> were chasing. However, they might still be useful to detect futur non expected
> behavior changes. If you agree with this, please, find in attachment a patch
> proposal against HEAD that recreate these three tests **after** a waiting loop
> on both standby1 and standby2. This waiting loop is inspired from the tests in
> 9.5 -> 10.

FWIW, I would prefer keeping all three tests as well.

So..  I have spent more time on this problem and mereswin here is a
very good sample because it failed all three tests:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2020-04-24%2006%3A03%3A53

For standby2, we get this failure:
ok 11 - .ready file for WAL segment 000000010000000000000001 existing
  in backup is kept with archive_mode=always on standby
not ok 12 - .ready file for WAL segment 000000010000000000000002
  created with archive_mode=always on standby

Then, looking at 020_archive_status_standby2.log, we have the
following logs:
2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG:
statement: CHECKPOINT
[...]
2020-04-24 02:08:32.303 PDT [9821:7] LOG:  restored log file
"000000010000000000000002" from archive

In this case, the test forced a checkpoint to test the segment
recycling *before* the extra restored segment we'd like to work on was
actually restored.  So it looks like my initial feeling about the
timing issue was right, and I am also able to reproduce the original
set of failures by adding a manual sleep to delay restores of
segments, like that for example:
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
    if (recoveryRestoreCommand == NULL ||
    strcmp(recoveryRestoreCommand, "") == 0)
            goto not_available;

+   pg_usleep(10 * 1000000); /* 10s */
+
    /*

With your patch the problem does not show up anymore even with the
delay added, so I would like to apply what you have sent and add back
those tests.  For now, I would just patch HEAD though as that's not
worth the risk of destabilizing stable branches in the buildfarm.

>  $primary->poll_query_until('postgres',
>      q{SELECT archived_count FROM pg_stat_archiver}, '1')
> -  or die "Timed out while waiting for archiving to finish";
> +    or die "Timed out while waiting for archiving to finish";

Some noise in the patch.  This may have come from some unfinished
business with pgindent.
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Sandeep Thakkar
Date:
Subject: Re: BUG #16341: Installation with EnterpriseDB Community installer inNT AUTHORITY\SYSTEM context not possible
Next
From: PG Bug reporting form
Date:
Subject: BUG #16393: PANIC: cannot abort transaction, it was already committed