Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 - Mailing list pgsql-general

From Alvaro Herrera
Subject Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date
Msg-id 20150528021440.GI5885@postgresql.org
Whole thread Raw
In response to Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
List pgsql-general
Robert Haas wrote:
> On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Steve Kehlet wrote:
> >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
> >> just dropped new binaries in place) but it wouldn't start up. I found this
> >> in the logs:
> >>
> >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
> >>  database system was shut down at 2015-05-27 13:12:55 PDT
> >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
> >> starting up
> >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
> >> transaction 1
> >
> > I am debugging today a problem currently that looks very similar to
> > this.  AFAICT the problem is that WAL replay of an online checkpoint in
> > which multixact files are removed fails because replay tries to read a
> > file that has already been removed.
>
> Hmm, so what exactly is the sequence of events here?  It's possible
> that I'm not thinking clearly just now, but it seems to me that if
> we're replaying the same checkpoint we replayed previously, the offset
> of the oldest multixact will be the first file that we didn't remove.

Well I'm not very clear on what's the problematic case.  The scenario I
actually saw this first reported was a pg_basebackup taken on a very
large database, so the master could have truncated multixact and the
standby receives a truncated directory but actually tries to apply a
checkpoint that is much older than what the master currently has
transmitted as pg_multixact contents.

> > I think the fix to this is to verify whether the file exists on disk
> > before reading it; if it doesn't, assume the truncation has already
> > happened and that it's not necessary to remove it.
>
> That might be an OK fix, but this implementation doesn't seem very
> clean.  If we're going to remove the invariant that
> MultiXactState->oldestOffset will always be valid after replaying a
> checkpoint, then we should be explicit about that and add a flag
> indicating whether or not it's currently valid.  Shoving nextOffset in
> there and hoping that's good enough seems like a bad idea to me.
>
> I think we should modify the API for find_multixact_start.  Let's have
> it return a Boolean and return oldestOffset via an out parameter.  If
> !InRecovery, it will always return true and set the out parameter; but
> if in recovery, it is allowed to return false without setting the out
> parameter.  Both values can get stored in MultiXactState, and we can
> adjust the logic elsewhere to disregard oldestOffset when the
> accompanying flag is false.

Sounds good.  I think I prefer that multixact creation is rejected
altogether if the new flag is false.  Is that what you mean when you say
"adjust the logic"?

> This still leaves open an ugly possibility: can we reach normal
> running without a valid oldestOffset?  If so, until the next
> checkpoint happens, autovacuum has no clue whether it needs to worry.
> There's got to be a fix for that, but it escapes me at the moment.

I think the fix to that issue is to set the oldest offset on
TrimMultiXact.  That way, once WAL replay finished we're certain that we
have a valid oldest offset to create new multixacts with.

I'm also wondering whether the call to DetermineSafeOldestOffset on
StartupMultiXact is good.  At that point, we haven't replayed any WAL
yet, so the oldest multi might be pointing at a file that has already
been removed -- again considering the pg_basebackup scenario where the
multixact files are copied much later than pg_control, so the checkpoint
to replay is old but the pg_multixact contents have already been
truncated in the master and are copied truncated.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-general by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1