On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote:
> It is been difficult to get a generic repro, but the way we reproduce
> Is through our test suite. To give more details, we are running tests
> In which we constantly failover and promote standbys. The issue
> surfaces after we have gone through a few promotions which occur
> every few hours or so ( not really important but to give context ).
Hmm. Could you describe exactly the failover scenario you are using?
Is the test using a set of cascading standbys linked to the promoted
one? Are the standbys recycled from the promoted nodes with pg_rewind
or created from scratch with a new base backup taken from the
freshly-promoted primary? I have been looking more at this thread
through the day but I don't see a remaining issue. It could be
perfectly possible that we are missing a piece related to the handling
of those new overwrite contrecords in some cases, like in a rewind.
> I am adding some additional debugging to see if I can draw a better
> picture of what is happening. Will also give aborted_contrec_reset_3.patch
> a go, although I suspect it will not handle the specific case we are deaing with.
Yeah, this is not going to change much things if you are still seeing
an issue. This patch does not change the logic, aka it just
simplifies the tracking of the continuation record data, resetting it
when a complete record has been read. Saying that, getting rid of the
dependency on StandbyMode because we cannot promote in the middle of a
record is nice (my memories around that were a bit blurry but even
recovery_target_lsn would not recover in the middle of an continuation
record), and this is not bug so there is limited reason to backpatch
this part of the change.
--
Michael