Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date
Msg-id 20150602160535.GA62378@tornado.leadboat.com
Whole thread Raw
In response to Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:

> > Granted.  Would it be better to update both functions at the same time, and
> > perhaps to make that a master-only change?  Does the status quo cause more
> > practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?
>
> It does in the case of the call to find_multixact_start().  If that
> fails, we take the server down for no good reason, as demonstrated by
> the original report. I'll revert the changes to the other two things
> in this function that I changed to be protected by did_trim.

Sounds good.

> > There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
> > to 1.  Otherwise, data loss is possible.
>
> Yes, but in that scenario, the log message you propose wouldn't be
> triggered.  If true_minmxid > 2^31, then the stored minmxid will not
> precede the files on disk; it will follow it (assuming the older stuff
> hasn't been truncated, as is likely).

Ah, quite right.

> >   "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"
>
> This seems misleading.  In the known failure case, it's not that the
> pg_multixact files are unexpectedly missing; it's that we incorrectly
> think that they should still be there.  Maybe:
>
> oldest MultiXactId on disk %u follows expected oldest MultiXact %u

Your wording is better.

> > For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
> > more than one segment:
> >
> >   "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"
>
> So, this scenario will happen whenever the system was interrupted in
> the middle of a truncation, or when the system is started from a base
> backup and a truncation happened after files were copied.  I'm wary of
> giving users the idea that this is an atypical event.  Perhaps a
> message at DEBUG1?

DEBUG1 works for me, or feel free to leave it out.

> >> I can see that there might be an issue there, but I can't quite put my
> >> finger on it well enough to say that it definitely is an issue.  This
> >> code is examining the offsets space rather than the members space, and
> >> the protections against offsets wraparound have been there since the
> >> original commit of this feature
> >> (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
> >> no concrete evidence that there's ever been a problem in this area.
> >> It seems like it might be safer to rejigger that code so that it
> >> considers distance-behind-current rather than using the wrapped
> >> comparison logic, but I'm reluctant to start rejiggering more things
> >> without knowing what specifically I'm fixing.
> >
> > Anything that could cause the pg_multixact/offsets tail to rotate from being
> > in the past to being in the future poses this risk.  (This is the tail from
> > the perspective of files on disk; pg_control, datminmxid, and MultiXactState
> > notions of the tail play no part here.)  I had in mind that the pg_upgrade
> > datminmxid=1 bug could be a tool for achieving that, but I've been
> > unsuccessful so far at building a credible thought experiment around it.  Near
> > the beginning of your reply, you surmised that this could happen between a
> > VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
> > Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
> > currently ignores the unlink() return value; the only harm has been some disk
> > usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
> > mandatory to forestall the appearance of a wrapped state.
>
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

We could notice the problem and halt.  You mentioned above the possibility to
have SlruScanDirCbFindEarliest() check "distance-behind-current".  Suppose it
used the current oldestMulti from pg_control as a reference point and
discovered the multixact on disk (a) most far behind that reference point and
(b) most far ahead of that reference point.  If MultiXactIdPrecedes(a, b),
we're good; adopt (a) as the new datminmxid.  Account for the possibility that
the space isn't really wrapped, but out reference point was totally wrong.
Once we determine that the space is wrapped, bail.

As you discuss downthread, the mxids actually present in t_max define the ones
we need.  We could scan them all and set authoritative datminmxid and
pg_control entries.

> There are some things that we could do that might help a little.  For
> example, if the control file's oldest-multi value points to a file
> that is not on disk, we could advance it a file at a time until it
> lands on a file that is present.  This would help if the oldest-multi
> value is pointing to the future, but everything on disk is in the
> past.

That's worth examining further.

> Another idea is that we could rely on the "on disk" value only until
> MultiXactState->lastCheckpointedOldest actually points to a real file.

When wrapping confuses the "on disk" value, I think any reliance on that value
is a dead end.

> After we get out of this kind of trouble, we should never get back
> into it.

Oh, we excel at that.


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Next
From: Robert Haas
Date:
Subject: Re: WIP: Enhanced ALTER OPERATOR