Re: Completely broken replica after PANIC: WAL contains references to invalid pages - Mailing list pgsql-bugs

From Andres Freund
Subject Re: Completely broken replica after PANIC: WAL contains references to invalid pages
Date
Msg-id 20130402101012.GB2415@alap2.anarazel.de
Whole thread Raw
In response to Re: Completely broken replica after PANIC: WAL contains references to invalid pages  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Completely broken replica after PANIC: WAL contains references to invalid pages  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-bugs
On 2013-04-01 08:49:16 +0100, Simon Riggs wrote:
> On 30 March 2013 17:21, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > So if the xid is later than latestObservedXid we extend subtrans one by
> > one. So far so good. But we initialize it in
> > ProcArrayApplyRecoveryInfo() when consistency is initially reached:
> >                              latestObservedXid = running->nextXid;
> >                              TransactionIdRetreat(latestObservedXid);
> > Before that subtrans has initially been started up with:
> >                         if (wasShutdown)
> >                                 oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
> >                         else
> >                                 oldestActiveXID = checkPoint.oldestActiveXid;
> > ...
> >                         StartupSUBTRANS(oldestActiveXID);
> >
> > That means its only initialized up to checkPoint.oldestActiveXid. As it
> > can take some time till we reach consistency it seems rather plausible
> > that there now will be a gap in initilized pages. From
> > checkPoint.oldestActiveXid to running->nextXid if there are pages
> > inbetween.
>
> That was an old bug.
>
> StartupSUBTRANS() now explicitly fills that gap. Are you saying it
> does that incorrectly? How?

Well, no. I think StartupSUBTRANS does this correctly, but there's a gap
between the call to Startup* and the first call to ExtendSUBTRANS. The
latter is only called *after* we reached STANDBY_INITIALIZED via
ProcArrayApplyRecoveryInfo(). The problem is that we StartupSUBTRANS to
checkPoint.oldestActiveXid while we start to ExtendSUBTRANS from
running->nextXid - 1. There very well can be a gap inbetween.
The window isn't terribly big but if you use subtransactions as heavily
as Sergey seems to be it doesn't seem unlikely to hit it.

Let me come up with a testcase and patch.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-bugs by date:

Previous
From: Sandeep Thakkar
Date:
Subject: Re: BUG #7985: Postgres Windows Installer fails with "permission denied"
Next
From: mohansammeta@gmail.com
Date:
Subject: BUG #8027: Get generated key value while inserting in partitioned table