Re: Recovery inconsistencies, standby much larger than primary - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Recovery inconsistencies, standby much larger than primary
Date
Msg-id CAM-w4HPvJCBRVV3dXg8aj0WzkU08dHuX-XYbfDYQhNrn5bnTQg@mail.gmail.com
Whole thread Raw
In response to Re: Recovery inconsistencies, standby much larger than primary  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Recovery inconsistencies, standby much larger than primary  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Feb 3, 2014 at 12:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What version were you running before 9.1.11 exactly?  I took a look
> through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
> changes that seemed even vaguely related to this.  There are some
> changes in known-transaction tracking, but it's hard to see a connection
> there.  Most of the other diffs are in code that wouldn't execute during
> WAL replay at all.


Both the primary and the standby were 9.1.11 from the get-go. The
database the primary was forked off of was 9.1.10 but as far as I can
tell the primary in the current pair has no problems.

What's worse is we created a new standby from the same base backup and
replayed the same records and it didn't reproduce the problem. This
means either it's a hardware problem -- but we've seen it on multiple
standbys on this database and at least one other database which is in
a different data centre -- or it's a race condition --but that's hard
to credit in the recovery code which is basically single-threaded.

And these records are from before the standby reaches a consistency so
it's hard to see how a connection from a hot standby client could
cause any kind of race condition. The only other thread that could
conceivably cause a heisenbug is the bgwriter. It's hard to imagine
how a race condition in there could be so easy to hit that it would
happen four times on one restore but otherwise go mostly unnoticed.
-- 
greg



pgsql-hackers by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: [DOCS] Re: Viability of text HISTORY/INSTALL/regression README files (was Re: [COMMITTERS] pgsql: Document a few more regression test hazards.)
Next
From: Tom Lane
Date:
Subject: Re: mvcc catalo gsnapshots and TopTransactionContext