Re: prevent immature WAL streaming - Mailing list pgsql-hackers

From Tom Lane
Subject Re: prevent immature WAL streaming
Date
Msg-id 45597.1637694259@sss.pgh.pa.us
Whole thread Raw
In response to Re: prevent immature WAL streaming  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: prevent immature WAL streaming
List pgsql-hackers
We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

  sysname   |    branch     |      snapshot       |     stage     |
l                                                      

------------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
 spurfowl   | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 00:08:09.324 EDT [2455:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 sidewinder | HEAD          | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 06:46:23.168 CEST [26393:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 francolin  | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 01:48:05.646 UTC [3417202][][1/0:0]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 petalura   | HEAD          | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 02:58:12.146 CET [61848fb3.28d157:6]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 lapwing    | REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 17:39:29.741 UTC [9831:6] FATAL:
mismatchingoverwritten LSN 0/1FFE014 -> 0/1FFE000 
 morepork   | HEAD          | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 04:03:33.576 CET [73561:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 petalura   | HEAD          | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 18:16:47.875 CET [6193e77f.35b87f:6]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 morepork   | HEAD          | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 04:57:04.359 CET [32089:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 spurfowl   | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 17:29:35.520 EST [16011:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test.  How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent.  I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at.  But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: pg_upgrade parallelism
Next
From: Tom Lane
Date:
Subject: Re: Post-CVE Wishlist