Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 - Mailing list pgsql-general

From Robert Haas
Subject Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date
Msg-id CA+Tgmoa-pd_brvS8iX25Wm58TY37p+1iDTGseWiTFDX053xZNw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Noah Misch <noah@leadboat.com>)
List pgsql-general
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund <andres@anarazel.de> wrote:
>>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>>> > that's actually ok; because the value at the beginning of the segment
>>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>>> > filled the page with zeros.
>>> >
>>> > I think that should be harmless, the worst that can happen is that
>>> > oldestOffset errorneously is 0, which should be correct, even though
>>> > possibly overly conservative, in these cases.
>>>
>>> Uh oh.  That seems like a real bad problem for this approach.  What
>>> keeps that from being the opposite of too conservative?  There's no
>>> "safe" value in a circular numbering space.
>>
>> I think it *might* (I'm really jetlagged) be fine because that'll only
>> happen after a upgrade from < 9.3. And in that case we initialize
>> nextOffset to 0. That ought to safe us?
>
> That's pretty much betting the farm on the bugs we know about today
> being the only ones there are.  That seems imprudent.

So here's a patch taking a different approach.  In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid).  Instead, we
attempt to cope with the unknown.  That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point.  This means
that if you're in this situation, you run without member wraparound
protection until it's corrected.  A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach.  As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk.  It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1