Thread: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Steve Kehlet
Date:
I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we just dropped new binaries in place) but it wouldn't start up. I found this in the logs:

waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:  database system was shut down at 2015-05-27 13:12:55 PDT
2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is starting up
.2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of transaction 1
2015-05-27 13:13:00 PDT [27341]: [3-1] DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.
2015-05-27 13:13:00 PDT [27340]: [1-1] LOG:  startup process (PID 27341) exited with exit code 1
2015-05-27 13:13:00 PDT [27340]: [2-1] LOG:  aborting startup due to startup process failure
 stopped waiting

I poked around in pg_multixact/offsets and there are lots of other files in there, just not 0000.

I tried launching postgres in standalone mode to try some debugging but it yields the same error.

AFAICT, there's nothing wrong with hardware, it's a VM, and it's been working okay. About a week ago we upgraded it from 9.3.5 to 9.4.1 via pg_upgrade --link, which worked amazingly well.

I found [this report from a couple days ago](https://bugs.archlinux.org/task/45071) from someone else that looks like the same problem.

I put this box's [postgresql.conf up on a gist](https://gist.github.com/skehlet/3589b0d83f2cafe19624).

What can I try next? Thanks!

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Steve Kehlet wrote:
> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
> just dropped new binaries in place) but it wouldn't start up. I found this
> in the logs:
>
> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>  database system was shut down at 2015-05-27 13:12:55 PDT
> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
> starting up
> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
> transaction 1

I am debugging today a problem currently that looks very similar to
this.  AFAICT the problem is that WAL replay of an online checkpoint in
which multixact files are removed fails because replay tries to read a
file that has already been removed.

(I was nervous about removing the check to omit reading pg_multixact
files while on recovery.  Looks like my hunch was right, though the
actual problem is not what I was fearing.)

I think the fix to this is to verify whether the file exists on disk
before reading it; if it doesn't, assume the truncation has already
happened and that it's not necessary to remove it.

> I found [this report from a couple days ago](
> https://bugs.archlinux.org/task/45071) from someone else that looks like
> the same problem.

Right :-(

I think a patch like this should be able to fix it ... not tested yet.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Steve Kehlet
Date:
On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I think a patch like this should be able to fix it ... not tested yet.

Thanks Alvaro. I got a compile error, so looked for other uses of SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look right?

+ (!InRecovery || SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno)))

It compiled then, but it didn't seem to change anything: postgres gave the same errors and still would not start up. 

I'm recompiling it now just to be sure I didn't make any mistakes. I have to leave for the day in a few minutes so I thought I'd check on the above argument with you. Thanks again!

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Steve Kehlet wrote:
> On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>
> > I think a patch like this should be able to fix it ... not tested yet.
> >
>
> Thanks Alvaro. I got a compile error, so looked for other uses of
> SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look
> right?
>
> +  (!InRecovery || SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl,
> pageno)))

Meh, I sent you the wrong version of the patch.  Yeah, that's obviously
better.

> It compiled then, but it didn't seem to change anything: postgres gave the
> same errors and still would not start up.

I'm setting up a reproducer for the problem to verify that the patch
fixes it (evidently not).  Maybe I've messed up the conditional or
something else ...

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Hmm, so what exactly is the sequence of events here?  It's possible
that I'm not thinking clearly just now, but it seems to me that if
we're replaying the same checkpoint we replayed previously, the offset
of the oldest multixact will be the first file that we didn't remove.
However, I can see that there could be a problem if we try to replay
an older checkpoint after having already replayed a new one - for
example, if a standby replays checkpoint A truncating the members
multixact and performs a restart point, and then replays checkpoint B
truncating the members multixact again but without performing a
restartpoint, and then is shut down, it will resume replay from
checkpoint A, and trouble will ensue.  Is that the scenario, or is
there something else?

> I think the fix to this is to verify whether the file exists on disk
> before reading it; if it doesn't, assume the truncation has already
> happened and that it's not necessary to remove it.

That might be an OK fix, but this implementation doesn't seem very
clean.  If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid.  Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.

I think we should modify the API for find_multixact_start.  Let's have
it return a Boolean and return oldestOffset via an out parameter.  If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter.  Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.

This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset?  If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas wrote:
> On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Steve Kehlet wrote:
> >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
> >> just dropped new binaries in place) but it wouldn't start up. I found this
> >> in the logs:
> >>
> >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
> >>  database system was shut down at 2015-05-27 13:12:55 PDT
> >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
> >> starting up
> >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
> >> transaction 1
> >
> > I am debugging today a problem currently that looks very similar to
> > this.  AFAICT the problem is that WAL replay of an online checkpoint in
> > which multixact files are removed fails because replay tries to read a
> > file that has already been removed.
>
> Hmm, so what exactly is the sequence of events here?  It's possible
> that I'm not thinking clearly just now, but it seems to me that if
> we're replaying the same checkpoint we replayed previously, the offset
> of the oldest multixact will be the first file that we didn't remove.

Well I'm not very clear on what's the problematic case.  The scenario I
actually saw this first reported was a pg_basebackup taken on a very
large database, so the master could have truncated multixact and the
standby receives a truncated directory but actually tries to apply a
checkpoint that is much older than what the master currently has
transmitted as pg_multixact contents.

> > I think the fix to this is to verify whether the file exists on disk
> > before reading it; if it doesn't, assume the truncation has already
> > happened and that it's not necessary to remove it.
>
> That might be an OK fix, but this implementation doesn't seem very
> clean.  If we're going to remove the invariant that
> MultiXactState->oldestOffset will always be valid after replaying a
> checkpoint, then we should be explicit about that and add a flag
> indicating whether or not it's currently valid.  Shoving nextOffset in
> there and hoping that's good enough seems like a bad idea to me.
>
> I think we should modify the API for find_multixact_start.  Let's have
> it return a Boolean and return oldestOffset via an out parameter.  If
> !InRecovery, it will always return true and set the out parameter; but
> if in recovery, it is allowed to return false without setting the out
> parameter.  Both values can get stored in MultiXactState, and we can
> adjust the logic elsewhere to disregard oldestOffset when the
> accompanying flag is false.

Sounds good.  I think I prefer that multixact creation is rejected
altogether if the new flag is false.  Is that what you mean when you say
"adjust the logic"?

> This still leaves open an ugly possibility: can we reach normal
> running without a valid oldestOffset?  If so, until the next
> checkpoint happens, autovacuum has no clue whether it needs to worry.
> There's got to be a fix for that, but it escapes me at the moment.

I think the fix to that issue is to set the oldest offset on
TrimMultiXact.  That way, once WAL replay finished we're certain that we
have a valid oldest offset to create new multixacts with.

I'm also wondering whether the call to DetermineSafeOldestOffset on
StartupMultiXact is good.  At that point, we haven't replayed any WAL
yet, so the oldest multi might be pointing at a file that has already
been removed -- again considering the pg_basebackup scenario where the
multixact files are copied much later than pg_control, so the checkpoint
to replay is old but the pg_multixact contents have already been
truncated in the master and are copied truncated.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Well I'm not very clear on what's the problematic case.  The scenario I
> actually saw this first reported was a pg_basebackup taken on a very
> large database, so the master could have truncated multixact and the
> standby receives a truncated directory but actually tries to apply a
> checkpoint that is much older than what the master currently has
> transmitted as pg_multixact contents.

OK, that makes sense.

>> That might be an OK fix, but this implementation doesn't seem very
>> clean.  If we're going to remove the invariant that
>> MultiXactState->oldestOffset will always be valid after replaying a
>> checkpoint, then we should be explicit about that and add a flag
>> indicating whether or not it's currently valid.  Shoving nextOffset in
>> there and hoping that's good enough seems like a bad idea to me.
>>
>> I think we should modify the API for find_multixact_start.  Let's have
>> it return a Boolean and return oldestOffset via an out parameter.  If
>> !InRecovery, it will always return true and set the out parameter; but
>> if in recovery, it is allowed to return false without setting the out
>> parameter.  Both values can get stored in MultiXactState, and we can
>> adjust the logic elsewhere to disregard oldestOffset when the
>> accompanying flag is false.
>
> Sounds good.  I think I prefer that multixact creation is rejected
> altogether if the new flag is false.  Is that what you mean when you say
> "adjust the logic"?

No.  I'm not sure quite what you mean here.  We can't reject multixact
creation during normal running, and during recovery, we won't create
any really new mulitxacts, but we must replay the creation of
multixacts.  What I meant was stuff like this:

    if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) ||
        (MultiXactState->nextOffset - MultiXactState->oldestOffset
            > MULTIXACT_MEMBER_SAFE_THRESHOLD))

I meant that we'd change the second prong of the test to check
multiXactState->nextOffsetValid && MultiXactState->nextOffset -
MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD.  And
likewise change anything else that relies on oldestOffset.  Or else we
guarantee that we can't reach those points until the oldestOffset is
valid, and then check that it is with an Assert() or elog().

>> This still leaves open an ugly possibility: can we reach normal
>> running without a valid oldestOffset?  If so, until the next
>> checkpoint happens, autovacuum has no clue whether it needs to worry.
>> There's got to be a fix for that, but it escapes me at the moment.
>
> I think the fix to that issue is to set the oldest offset on
> TrimMultiXact.  That way, once WAL replay finished we're certain that we
> have a valid oldest offset to create new multixacts with.
>
> I'm also wondering whether the call to DetermineSafeOldestOffset on
> StartupMultiXact is good.  At that point, we haven't replayed any WAL
> yet, so the oldest multi might be pointing at a file that has already
> been removed -- again considering the pg_basebackup scenario where the
> multixact files are copied much later than pg_control, so the checkpoint
> to replay is old but the pg_multixact contents have already been
> truncated in the master and are copied truncated.

Moving the call from StartupMultiXact() to TrimMultiXact() seems like
a good idea.  I'm not sure why we didn't do that before.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Steve: Can you tell us more about how you shut down the old cluster?
Did you by any chance perform an immediate shutdown?  Do you have the
actual log messages that were written when the system was shut down
for the upgrade?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Wait a minute, wait a minute.  There's a serious problem with this
theory, at least in Steve's scenario.  This message:

2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
down at 2015-05-27

That message implies a *clean shutdown*.  If he had performed an
immediate shutdown or just pulled the plug, it would have said
"database system was interrupted" or some such.

There may be bugs in redo, also, but they don't explain what happened to Steve.

Steve, is there any chance we can get your pg_controldata output and a
list of all the files in pg_clog?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 28, 2015 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Steve Kehlet wrote:
>>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>>> just dropped new binaries in place) but it wouldn't start up. I found this
>>> in the logs:
>>>
>>> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>>  database system was shut down at 2015-05-27 13:12:55 PDT
>>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>>> starting up
>>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>>> transaction 1
>>
>> I am debugging today a problem currently that looks very similar to
>> this.  AFAICT the problem is that WAL replay of an online checkpoint in
>> which multixact files are removed fails because replay tries to read a
>> file that has already been removed.
>
> Wait a minute, wait a minute.  There's a serious problem with this
> theory, at least in Steve's scenario.  This message:
>
> 2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
> down at 2015-05-27
>
> That message implies a *clean shutdown*.  If he had performed an
> immediate shutdown or just pulled the plug, it would have said
> "database system was interrupted" or some such.
>
> There may be bugs in redo, also, but they don't explain what happened to Steve.
>
> Steve, is there any chance we can get your pg_controldata output and a
> list of all the files in pg_clog?

Err, make that pg_multixact/members, which I assume is at issue here.
You didn't show us the DETAIL line from this message, which would
presumably clarify:

FATAL:  could not access status of transaction 1

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 28, 2015 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Steve, is there any chance we can get your pg_controldata output and a
>> list of all the files in pg_clog?
>
> Err, make that pg_multixact/members, which I assume is at issue here.
> You didn't show us the DETAIL line from this message, which would
> presumably clarify:
>
> FATAL:  could not access status of transaction 1

And I'm still wrong, probably.  The new code in 9.4.2 cares about
being able to look at an *offsets* file to find the corresponding
member offset.  So most likely it is an offsets file that is missing
here.  The question is, how are we ending up with an offsets file that
is referenced by the control file but not actually present on disk?

It seems like it would be good to compare the pg_controldata output to
what is actually present in pg_multixact/offsets (hopefully that's the
right directory, now that I'm on my third try) and try to understand
what is going on here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> [ speculation ]

OK, I finally managed to reproduce this, after some off-list help from
Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
do it:

1. Install any pre-9.3 version of the server and generate enough
multixacts to create at least TWO new segments.  When you shut down
the server, all segments except for the most current one will be
removed.  At this point, the only thing in
$PGDATA/pg_multixact/offsets should be a single file, and the name of
that file should not be 0000 or 0001.

2. Use pg_upgrade to upgrade to 9.3.4.  It is possible that versions <
9.3.4 will also work here, but you must not use 9.3.5 or higher,
because 9.3.5 includes Bruce's commit 3d2e18510, which arranged to
preserve relminmxid and datminmxid values.   At this point,
pg_controldata on the new cluster should show an oldestMultiXid value
greater than 1 (copied from the old cluster), but all the datminmxid
values are 1.  Also, initdb will have left behind a bogus 0000 file in
pg_multixact/offsets.

3. Move to 9.3.5 (or 9.3.6), not via pg_upgrade, but just by dropping
in the new binaries.  Follow the instructions in the 9.3.5 release
notes; since you created at least TWO new segments in step one, there
will be no 0001 file, and the query there will say that you should
remove the bogus 0000 file.  So do that, leaving just the good file in
pg_multixact/offsets.  At this point, pg_multixact/offsets is OK, and
pg_controldata still says that oldestMultiXid > 1, so that is also OK.
The only problem is that we've got some bogus datminmxid values
floating around.  Our next step will be to convince vacuum to
propagate the bogus datminmxid values back into pg_controldata.

4. Consume at least one transaction ID (e.g. SELECT txid_current())
and then do this:

postgres=# set vacuum_freeze_min_age = 0;
SET
postgres=# set vacuum_freeze_table_age = 0;
SET
postgres=# vacuum;
VACUUM

Setting the GUCs forces full table scans, so that we advance
relfrozenxid.  But notice that we were careful not to just run VACUUM
FREEZE, which would have also advanced relminmxid, which, for purposes
of reproducing this bug, is not what we want to happen.  So relminmxid
is still (incorrectly) set to 1 for every database.  However, since
the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
which will call SetMultiXactIdLimit, which will propagate the bogus
datminmxid = 1 setting into shared memory.

(In my testing, this step doesn't work if performed on 9.3.4; you have
to do it on 9.3.5.  I think that's because of Tom's commit 78db307bb,
but I believe in a more complex test scenario you might be able to get
this to happen on 9.3.4 also.)

I believe it's the case that an autovacuum of even a single table can
substitute for this step if it happens to advance relfrozenxid but not
relminmxid.

5. The next checkpoint, or the shutdown checkpoint in any event, will
propagate the bogus value of 1 from shared memory back into the
control file.

6. Now try to start 9.3.7.  It will see the bogus oldestMultiXid = 1
value in the control file, attempt to read the corresponding offsets
file, and die.

In the process of investigating this, we found a few other things that
seem like they may also be bugs:

- As noted upthread, replaying an older checkpoint after a newer
checkpoint has already happened may lead to similar problems.  This
may be possible when starting from an online base backup; or when
restarting a standby that did not perform a restartpoint when
replaying the last checkpoint before the shutdown.

- pg_upgrade sets datminmxid =
old_cluster.controldata.chkpnt_nxtmulti, which is correct only if
there are ZERO multixacts in use at the time of the upgrade.  It would
be best, I think, to set this to the same value it had in the old
cluster, but if we're going to use a blanket value, I think it needs
to be chkpnt_oldstMulti.

- There's a third possible problem related to boundary cases in
SlruScanDirCbRemoveMembers, but I don't understand that one well
enough to explain it.  Maybe Thomas can jump in here and explain the
concern.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 05/28/2015 12:56 PM, Robert Haas wrote:
>

FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Sincerely,

jD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


Robert Haas wrote:
> On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > [ speculation ]
>
> OK, I finally managed to reproduce this, after some off-list help from
> Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
> do it:

It's a long list of steps, but if you consider them carefully, it
becomes clear that they are natural steps that a normal production
system would go through -- essentially the only one that's really
time-critical is the decision to pg_upgrade with a version before 9.3.5.

> In the process of investigating this, we found a few other things that
> seem like they may also be bugs:
>
> - As noted upthread, replaying an older checkpoint after a newer
> checkpoint has already happened may lead to similar problems.  This
> may be possible when starting from an online base backup; or when
> restarting a standby that did not perform a restartpoint when
> replaying the last checkpoint before the shutdown.

I'm going through this one now, as it's closely related and caused
issues for us.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Fri, May 29, 2015 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 28, 2015 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> [ speculation ]
>
> [...]  However, since
> the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
> which will call SetMultiXactIdLimit, which will propagate the bogus
> datminmxid = 1 setting into shared memory.

Ah!

> [...]
>
> - There's a third possible problem related to boundary cases in
> SlruScanDirCbRemoveMembers, but I don't understand that one well
> enough to explain it.  Maybe Thomas can jump in here and explain the
> concern.

I noticed something in passing which is probably not harmful, and not
relevant to this bug report, it was just a bit confusing while
testing:  SlruScanDirCbRemoveMembers never deletes any files if
rangeStart == rangeEnd.  In practice, if you have an idle cluster with
a lot of multixact data and you VACUUM FREEZE all databases and then
CHECKPOINT, you might be surprised to see no member files going away
quite yet, but they'll eventually be truncated by a future checkpoint,
once rangeEnd has had a chance to advance to the next page due to more
multixacts being created.

If we want to fix this one day, maybe the right thing to do is to
treat the rangeStart == rangeEnd case the same way we treat rangeStart
< rangeEnd, that is, to assume that the range of pages isn't
wrapped/inverted in this case.  Although we don't have the actual
start and end offset values to compare here, we know that for them to
fall on the same page, the start offset index must be <= the end
offset index (since we added the new error to prevent member space
wrapping, we never allow the end to get close enough to the start to
fall on the same page).  Like this (not tested):

diff --git a/src/backend/access/transam/multixact.c
b/src/backend/access/transam/multixact.c
index 9568ff1..4d0bcc4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char
*filename, int segpage,
  /* Recheck the deletion condition.  If it still holds, perform deletion */
  if ((range->rangeStart > range->rangeEnd &&
  segpage > range->rangeEnd && segpage < range->rangeStart) ||
- (range->rangeStart < range->rangeEnd &&
+ (range->rangeStart <= range->rangeEnd &&
  (segpage < range->rangeStart || segpage > range->rangeEnd)))
  SlruDeleteSegment(ctl, filename);

--
Thomas Munro
http://www.enterprisedb.com


On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Thanks!  I really appreciate the kind words.

So, in thinking through this situation further, it seems to me that
the situation is pretty dire:

1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
or datminmxid values which are 1 instead of the correct value.
Setting the value to 1 was too far in the past if your MXID counter is
< 2B, and too far in the future if your MXID counter is > 2B.

2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are two new.

3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
have the first problem for tables in template databases, and the
second one for the rest. (See 866f3017a.)

4. Wrong relminmxid or datminmxid values can eventually propagate into
the control file, as demonstrated in my previous post.  Therefore, we
can't count on relminmxid to be correct, we can't count on datminmxid
to be correct, and we can't count on the control file to be correct.
That's a sack of sad.

5. If the values are too far in the past, then nothing really terrible
will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the
system will refuse to start.  Forcing a VACUUM FREEZE on every
database, including the unconnectable ones, should fix this and allow
you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2
fix a different set of multixact data loss bugs.

6. If the values are too far in the future, the system may fail to
prevent wraparound, leading to data loss.  I am not totally clear on
whether a VACUUM FREEZE will fix this problem.  It seems like the
chances are better if you are running at least 9.3.5+ or 9.4.X,
because of 78db307bb.  But I'm not sure how complete a fix that is.

So what do we do about this?  I have a few ideas:

A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem.  That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)

B. We need to change find_multixact_start() to fail softly.  This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:

- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState->oldestOffset, which is used to determine how
aggressively to vacuum.  If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.

- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit.  If it fails here, we don't know when
to refuse multixact creation to prevent wraparound.  Again, in
recovery, that's fine.  If it happens in normal running, it's not
clear what to do.  Refusing multixact creation is an awfully blunt
instrument.  Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point.  Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails.  It might be more robust than what we
have now.

- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU.  If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.

C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members.  As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes.  That seem like an
unnecessary risk.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas wrote:

> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
> values which are equal to the next-mxid counter instead of the correct
> value; in other words, they are too new.

What you describe is what happens if you upgrade from 9.2 or earlier.
For this case we use this call:

        exec_prog(UTILITY_LOG_FILE, NULL, true,
                  "\"%s/pg_resetxlog\" -m %u,%u \"%s\"",
                  new_cluster.bindir,
                  old_cluster.controldata.chkpnt_nxtmulti + 1,
                  old_cluster.controldata.chkpnt_nxtmulti,
                  new_cluster.pgdata);

This uses the old cluster's nextMulti value as oldestMulti in the new
cluster, and that value+1 is used as nextMulti.  This is correct: we
don't want to preserve any of the multixact state from the previous
cluster; anything before that value can be truncated with no loss of
critical data.  In fact, there is no critical data before that value at
all.

If you upgrade from 9.3, this other call is used instead:

        /*
         * we preserve all files and contents, so we must preserve both "next"
         * counters here and the oldest multi present on system.
         */
        exec_prog(UTILITY_LOG_FILE, NULL, true,
                  "\"%s/pg_resetxlog\" -O %u -m %u,%u \"%s\"",
                  new_cluster.bindir,
                  old_cluster.controldata.chkpnt_nxtmxoff,
                  old_cluster.controldata.chkpnt_nxtmulti,
                  old_cluster.controldata.chkpnt_oldstMulti,
                  new_cluster.pgdata);

In this case we use the oldestMulti from the old cluster as oldestMulti
in the new cluster, which is also correct.


> A. Most obviously, we should fix pg_upgrade so that it installs
> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
> that we stop creating new instances of this problem.  That won't get
> us out of the hole we've dug for ourselves, but we can at least try to
> stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
> wrong thing - anyone want to double-check me on that one?)

I don't think there's anything that we need to fix here.


> B. We need to change find_multixact_start() to fail softly.  This is
> important because it's legitimate for it to fail in recovery, as
> discussed upthread, and also because we probably want to eliminate the
> fail-to-start hazard introduced in 9.4.2 and 9.3.7.
> find_multixact_start() is used in three places, and they each require
> separate handling:
>
> - In SetMultiXactIdLimit, find_multixact_start() is used to set
> MultiXactState->oldestOffset, which is used to determine how
> aggressively to vacuum.  If find_multixact_start() fails, we don't
> know how aggressively we need to vacuum to prevent members wraparound;
> it's probably best to decide to vacuum as aggressively as possible.
> Of course, if we're in recovery, we won't vacuum either way; the fact
> that it fails softly is good enough.

Sounds good.

> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
> MultiXactState->offsetStopLimit.  If it fails here, we don't know when
> to refuse multixact creation to prevent wraparound.  Again, in
> recovery, that's fine.  If it happens in normal running, it's not
> clear what to do.  Refusing multixact creation is an awfully blunt
> instrument.  Maybe we can scan pg_multixact/offsets to determine a
> workable stop limit: the first file greater than the current file that
> exists, minus two segments, is a good stop point.  Perhaps we ought to
> use this mechanism here categorically, not just when
> find_multixact_start() fails.  It might be more robust than what we
> have now.

Blunt instruments have the desirable property of being simple.  We don't
want any more clockwork here, I think --- this stuff is pretty
complicated already.  As far as I understand, if during normal running
we see that find_multixact_start has failed, sufficient vacuuming should
get it straight eventually with no loss of data.

> - In TruncateMultiXact, find_multixact_start() is used to set the
> truncation point for the members SLRU.  If it fails here, I'm guessing
> the right solution is not to truncate anything - instead, rely on
> intense vacuuming to eventually advance oldestMXact to a value whose
> member data still exists; truncate then.

Fine.

> C. I think we should also change TruncateMultiXact() to truncate
> offsets first, and then members.  As things stand, if we truncate
> members first, we increase the risk of seeing an offset that will fail
> when passed to find_multixact_start(), because TruncateMultiXact()
> might get interrupted before it finishes.  That seem like an
> unnecessary risk.

Not sure about this point.  We did it the way you propose previously,
and found it to be a problem because sometimes we tried to read an
offset file that was no longer there.  Do we really read member files
anywhere?  I thought we only tried to read offset files.  If we remove
member files, what is it that we try to read and find not to be present?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera wrote:
> Robert Haas wrote:
>
> > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
> > values which are equal to the next-mxid counter instead of the correct
> > value; in other words, they are too new.
>
> What you describe is what happens if you upgrade from 9.2 or earlier.

Oh, you're referring to pg_database values, not the ones in pg_control.
Ugh :-(  This invalidates my argument that there's nothing to fix,
obviously ... it's clearly broken as is.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
>> values which are equal to the next-mxid counter instead of the correct
>> value; in other words, they are too new.
>
> [ discussion of how the control file's oldestMultiXid gets set ]

I'm talking about the datminmxid in pg_database.  You're talking about
the contents of pg_control.  Those are two different things.  The
relevant code is not what you quote, but rather this:

        /* set pg_database.datminmxid */
        PQclear(executeQueryOrDie(conn_template1,
                                                          "UPDATE
pg_catalog.pg_database "
                                                          "SET
datminmxid = '%u'",

old_cluster.controldata.chkpnt_nxtmulti));

Tom previously observed this to be wrong, here:

http://www.postgresql.org/message-id/9879.1405877821@sss.pgh.pa.us

Although Tom was correct to note that it's wrong, nothing ever got fixed.  :-(

>> A. Most obviously, we should fix pg_upgrade so that it installs
>> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
>> that we stop creating new instances of this problem.  That won't get
>> us out of the hole we've dug for ourselves, but we can at least try to
>> stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
>> wrong thing - anyone want to double-check me on that one?)
>
> I don't think there's anything that we need to fix here.

I see your followup now agreeing this is broken.  Since I wrote the
previous email, I've had two new ideas that I think are both better
than the above.

1. Figure out the oldest multixact offset that actually exists in
pg_multixacts/offsets, and use that value.  If any older MXIDs still
exist, they won't be able to be looked up anyway, so if they wrap
around, it doesn't matter.  The only value that needs to be reliable
in order to do this is pg_controldata's NextMultiXactId, which to the
best of my knowledge is not implicated in any of these bugs.
pg_upgrade can check that the offsets file containing that value
exists, and if not bail out.  Then, start stepping backwards a file at
a time.  When it hits a missing file, the first multixact in the next
file is a safe value of datfrozenxid for every database in the new
cluster.  If older MXIDs exist, they're unreadable anyway, so if they
wrap, nothing lost.  If the value is older than necessary, the first
vacuum in each database will fix it.  We have to be careful: if we
step back too many files, such that our proposed datfrozenxid might
wrap, then we've got a confusing situation and had better bail out -
or at least think really carefully about what to do.

2. When we're upgrading from a version 9.3 or higher, copy the EXACT
datminmxid from each old database to the corresponding new database.
This seems like it ought to be really simple.

>> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
>> MultiXactState->offsetStopLimit.  If it fails here, we don't know when
>> to refuse multixact creation to prevent wraparound.  Again, in
>> recovery, that's fine.  If it happens in normal running, it's not
>> clear what to do.  Refusing multixact creation is an awfully blunt
>> instrument.  Maybe we can scan pg_multixact/offsets to determine a
>> workable stop limit: the first file greater than the current file that
>> exists, minus two segments, is a good stop point.  Perhaps we ought to
>> use this mechanism here categorically, not just when
>> find_multixact_start() fails.  It might be more robust than what we
>> have now.
>
> Blunt instruments have the desirable property of being simple.  We don't
> want any more clockwork here, I think --- this stuff is pretty
> complicated already.  As far as I understand, if during normal running
> we see that find_multixact_start has failed, sufficient vacuuming should
> get it straight eventually with no loss of data.

Unfortunately, I don't believe that to be true.  If
find_multixact_start() fails, we have no idea how close we are to the
member wraparound point.  Sure, we can start vacuuming, but the user
can be creating new, large multixacts at top speed while we're doing
that, which could cause us to wrap around before we can finish
vacuuming.

Furthermore, if we adopted the blunt instrument, users who are in this
situation would update to 9.4.3 (or whenever these fixes get released)
and find that they can't create new MXIDs for a possibly very
protracted period of time.  That amounts to an outage for which users
won't thank us.

Looking at the files in the directory seems pretty simple in this
case, and quite a bit more fail-safe than what we're doing right now.
The current logic purports to leave a one-file gap in the member
space, but there's no guarantee that the gap really exists on disk the
way we think it does.  With this approach, we can be certain that
there is a gap.  And that is a darned good thing to be certain about.

>> C. I think we should also change TruncateMultiXact() to truncate
>> offsets first, and then members.  As things stand, if we truncate
>> members first, we increase the risk of seeing an offset that will fail
>> when passed to find_multixact_start(), because TruncateMultiXact()
>> might get interrupted before it finishes.  That seem like an
>> unnecessary risk.
>
> Not sure about this point.  We did it the way you propose previously,
> and found it to be a problem because sometimes we tried to read an
> offset file that was no longer there.  Do we really read member files
> anywhere?  I thought we only tried to read offset files.  If we remove
> member files, what is it that we try to read and find not to be present?

Do you have a link to the previous discussion?

I mean, the problem we're having right now is that sometimes we have
an offset, but the corresponding member isn't there.  So clearly
offsets reference members.  Do members also reference offsets?  I
didn't think so, but life is full of surprises.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> A. Most obviously, we should fix pg_upgrade so that it installs
> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
> that we stop creating new instances of this problem.  That won't get
> us out of the hole we've dug for ourselves, but we can at least try to
> stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
> wrong thing - anyone want to double-check me on that one?)

Yes, it seems like this could lead to truncation of multixacts still
referenced by tuples, leading to errors when updating, locking,
vacuuming.  Why don't we have reports of that?

> B. We need to change find_multixact_start() to fail softly.  This is
> important because it's legitimate for it to fail in recovery, as
> discussed upthread, and also because we probably want to eliminate the
> fail-to-start hazard introduced in 9.4.2 and 9.3.7.
> find_multixact_start() is used in three places, and they each require
> separate handling:

Here is an experimental WIP patch that changes StartupMultiXact and
SetMultiXactIdLimit to find the oldest multixact that exists on disk
(by scanning the directory), and uses that if it is more recent than
the oldestMultiXactId from shmem, when calling
DetermineSafeOldestOffset.  I'm not all that happy with it, see below,
but let me know what you think.

Using unpatched master, I reproduced the startup error with a bit of a
short cut:

1.  initdb, generate enough multixacts to get more than one offsets file
2.  ALTER DATABASE template0 ALLOW_CONNECTION = true;, vacuumdb
--freeze --all, CHECKPOINT
3.  verify that pg_control now holds a large oldestMultiXactId, and
note NextMultiXactId
4.  shutdown, pg_resetxlog -m (NextMultiXactId from 3),1 pg_data
5.  start up: fails

Apply this patch, and it starts up successfully.

What are the repro steps for the replay problem?  Is a basebackup of a
large database undergoing truncation and some good timing needed?

> - In SetMultiXactIdLimit, find_multixact_start() is used to set
> MultiXactState->oldestOffset, which is used to determine how
> aggressively to vacuum.  If find_multixact_start() fails, we don't
> know how aggressively we need to vacuum to prevent members wraparound;
> it's probably best to decide to vacuum as aggressively as possible.
> Of course, if we're in recovery, we won't vacuum either way; the fact
> that it fails softly is good enough.

Isn't it enough to use the start offset for the most recent of the
oldest multixact ID and the oldest multixact found by scanning
pg_multixact/offsets?  In this patch, it does that, but I'm not happy
with the time the work is done, it just doesn't seem right for
SetMultiXactIdLimit to be scanning that directory.  The result of that
operation should only change when files have been truncated anyway,
and the truncate code was already doing a filesystem scan.  Maybe the
truncate code should store the earliest multixact ID found on disk in
shared memory, so that SetMultiXactIdLimit can use it for free.  I
tried to get that working but couldn't figure out where it should be
initialised -- StartupMultiXact is too late (StartupXLOG calls
SetMultiXactIdLimit before that), but BootstrapMultiXact and
MultiXactShmemInit didn't seem like the right places either.

> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
> MultiXactState->offsetStopLimit.  If it fails here, we don't know when
> to refuse multixact creation to prevent wraparound.  Again, in
> recovery, that's fine.  If it happens in normal running, it's not
> clear what to do.  Refusing multixact creation is an awfully blunt
> instrument.  Maybe we can scan pg_multixact/offsets to determine a
> workable stop limit: the first file greater than the current file that
> exists, minus two segments, is a good stop point.  Perhaps we ought to
> use this mechanism here categorically, not just when
> find_multixact_start() fails.  It might be more robust than what we
> have now.

Done in this patch -- the truncate code calls
DetermineSafeOldestOffset with the earliest SLRU found by scanning if
that's more recent than the shmem value, and then
DetermineSafeOldestOffset applies the step-back-one-whole-segment
logic to that as before.

> - In TruncateMultiXact, find_multixact_start() is used to set the
> truncation point for the members SLRU.  If it fails here, I'm guessing
> the right solution is not to truncate anything - instead, rely on
> intense vacuuming to eventually advance oldestMXact to a value whose
> member data still exists; truncate then.

TruncateMultiXact already contained logic to do nothing at all if
oldestMXact is older than the earliest it can find on disk.  I moved
that code into find_earliest_multixact_on_disk() to be able to use it
elsewhere too, in this patch.

> C. I think we should also change TruncateMultiXact() to truncate
> offsets first, and then members.  As things stand, if we truncate
> members first, we increase the risk of seeing an offset that will fail
> when passed to find_multixact_start(), because TruncateMultiXact()
> might get interrupted before it finishes.  That seem like an
> unnecessary risk.

I don't see why the order matters.  find_multixact_start() doesn't
read the members, only the offsets SLRU (ie the index into members,
not the contents of members).  As I understand it, the only time we
need to access the members themselves is when we encounter multixacts
in tuple headers (updating, locking or vacuuming).  If you have
truncated multixacts referenced in your tuples then you have a
different form of corruption than the
pg_upgrade-tramples-on-oldestMultiXactId case we're trying to handle
gracefully here.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> B. We need to change find_multixact_start() to fail softly.

> Here is an experimental WIP patch that changes StartupMultiXact and
> SetMultiXactIdLimit to find the oldest multixact that exists on disk
> (by scanning the directory), and uses that if it is more recent than
> the oldestMultiXactId from shmem,

Not sure about the details of this patch, but I was planning to propose
what I think is the same thing: the way to make find_multixact_start()
fail softly is to have it find the oldest actually existing file if the
one that should be there isn't.

This would preserve the idea that we aren't trusting the multixact
tracking data to be completely right, which was the point of 78db307bb
and which evidently is still essential.  It would also obviate the need
for other places to contain similar logic.

            regards, tom lane


On Fri, May 29, 2015 at 10:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> B. We need to change find_multixact_start() to fail softly.
>
>> Here is an experimental WIP patch that changes StartupMultiXact and
>> SetMultiXactIdLimit to find the oldest multixact that exists on disk
>> (by scanning the directory), and uses that if it is more recent than
>> the oldestMultiXactId from shmem,
>
> Not sure about the details of this patch, but I was planning to propose
> what I think is the same thing: the way to make find_multixact_start()
> fail softly is to have it find the oldest actually existing file if the
> one that should be there isn't.

Working on that now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, May 29, 2015 at 12:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Working on that now.

OK, here's a patch.  Actually two patches, differing only in
whitespace, for 9.3 and for master (ha!).  I now think that the root
of the problem here is that DetermineSafeOldestOffset() and
SetMultiXactIdLimit() were largely ignorant of the possibility that
they might be called at points in time when the cluster was
inconsistent.  SetMultiXactIdLimit() bracketed certain parts of its
logic with if (!InRecovery), but those guards were ineffective because
it gets called before InRecovery is set in the first place.

It seems pretty clear that we can't effectively determine anything
about member wraparound until the cluster is consistent.  Before then,
there might be files missing from the offsets or members SLRUs which
get put back during replay.  There could even be gaps in the sequence
of files, with some things having made it to disk before the crash (or
having made it into the backup) and others not.  So all the work of
determining what the safe stop points and vacuum thresholds for
members are needs to be postponed until TrimMultiXact() time.  And
that's fine, because we don't need this information in recovery anyway
- it only affects behavior in normal running.

So this patch does the following:

1. Moves the call to DetermineSafeOldestOffset() that appears in
StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
until we're consistent.  Also, instead of passing
MultiXactState->oldestMultiXactId, pass the newer of that value and
the earliest offset that exists on disk.  That way, it won't try to
read data that's not there.  Note that the second call to
DetermineSafeOldestOffset() in TruncateMultiXact() doesn't need a
similar guard, because we already bail out of that function early if
the multixacts we're going to truncate away don't exist.

2. Adds a new flag MultiXactState->didTrimMultiXact indicate whether
we've finished TrimMultiXact(), and arranges for SetMultiXactIdLimit()
to use that rather than InRecovery to test whether it's safe to do
complicated things that might require that the cluster is consistent.
This is a slight behavior change, since formerly we would have tried
to do that stuff very early in the startup process, and now it won't
happen until somebody completes a vacuum operation.  If that's a
problem, we could consider doing it in TrimMultiXact(), but I don't
think it's safe the way it was.  The new flag also prevents
oldestOffset from being set while in recovery; I think it would be
safe to do that in recovery once we've reached consistency, but I
don't believe it's necessary.

3. Arranges for TrimMultiXact() to set oldestOffset.  This is
necessary because, previously, we relied on SetMultiXactIdLimit doing
that during early startup or during recovery, and that's no longer
true.  Here too we set oldestOffset keeping in mind that our notion of
the oldest multixact may point to something that doesn't exist; if so,
we use the oldest MXID that does.

4. Modifies TruncateMultiXact() so that it doesn't re-scan the SLRU
directory on every call to find the oldest file that exists.  Instead,
it arranges to remember the value from the first scan and then updates
it thereafter to reflect its own truncation activity.  This isn't
absolutely necessary, but because this oldest-file logic is used in
multiple places (TrimMultiXact, SetMultiXactIdLimit, and
TruncateMultiXact all need it directly or indirectly) caching the
value seems like a better idea than recomputing it frequently.

I have tested that this patch fixes Steve Kehlet's problem, or at
least what I believe to be Steve Kehlet's problem based on the
reproduction scenario I described upthread.  I believe it will also
fix the problems with starting up from a base backup with Alvaro
mentioned upthread.  It won't fix the fact that pg_upgrade is putting
a wrong value into everybody's datminmxid field, which should really
be addressed too, but I've been working on this for about three days
virtually non-stop and I don't have the energy to tackle it right now.
If anyone feels the urge to step into that breech, I think what it
needs to do is: when upgrading from a 9.3-or-later instance, copy over
each database's datminmxid into the corresponding database in the new
cluster.

Aside from that, it's very possible that despite my best efforts this
has serious bugs.  Review and testing would be very much appreciated.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
On Thu, May 28, 2015 at 07:24:26PM -0400, Robert Haas wrote:
> On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > FTR: Robert, you have been a Samurai on this issue. Our many thanks.
>
> Thanks!  I really appreciate the kind words.
>
> So, in thinking through this situation further, it seems to me that
> the situation is pretty dire:
>
> 1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
> or datminmxid values which are 1 instead of the correct value.
> Setting the value to 1 was too far in the past if your MXID counter is
> < 2B, and too far in the future if your MXID counter is > 2B.
>
> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
> values which are equal to the next-mxid counter instead of the correct
> value; in other words, they are two new.
>
> 3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
> have the first problem for tables in template databases, and the
> second one for the rest. (See 866f3017a.)

I think we need to step back and look at the brain power required to
unravel the mess we have made regarding multi-xact and fixes.  (I bet
few people can even remember which multi-xact fixes went into which
releases --- I can't.)  Instead of working on actual features, we are
having to do this complex diagnosis because we didn't do a thorough
analysis at the time a pattern of multi-xact bugs started to appear.
Many projects deal with this compound bug debt regularly, but we have
mostly avoided this fate.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


On Fri, May 29, 2015 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
OK, here's a patch.

I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific patch. I rebuilt, installed, and postgres started up successfully!  I did a bunch of checks, had our app run several thousand SQL queries against it, had a colleague check it out, and it looks good. Looking at top and ps, I don't see anything funny (e.g. no processes spinning cpu, etc), things look normal. Let me know if I can provide anything else. 


On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote:
> I think we need to step back and look at the brain power required to
> unravel the mess we have made regarding multi-xact and fixes.  (I bet
> few people can even remember which multi-xact fixes went into which
> releases --- I can't.)  Instead of working on actual features, we are
> having to do this complex diagnosis because we didn't do a thorough
> analysis at the time a pattern of multi-xact bugs started to appear.
> Many projects deal with this compound bug debt regularly, but we have
> mostly avoided this fate.

What is the consequences of that observation you're imagining?


On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
> It seems pretty clear that we can't effectively determine anything
> about member wraparound until the cluster is consistent.

I wonder if this doesn't actually hints at a bigger problem.  Currently,
to determine where we need to truncate SlruScanDirectory() is
used. That, afaics, could actually be a problem during recovery when
we're not consistent.

Consider the scenario where a very large database is copied while
running. At the start of the backup we'll determine at which checkpoint
recovery will start and store it in the label. After that the copy will
start, copying everything slowly. That works because we expect recovery
to fix things up.  The problem I see WRT multixacts is that the copied
state of pg_multixact could be wildly different from the one at the
label's checkpoint. During recovery, before reaching the first
checkpoint, we'll create multixact files as used at the time of the
checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

For clog and such that's not a problem because the truncation
etc. points are all stored in WAL - during recovery we just replay the
truncations that happened on the master, there's no need to look at the
data directory. And we won't access the clog before being consistent.

But for multixacts is different. To avoid ending up with
pg_multixact/*/* directories we have to do truncations during
recovery. As there's currently no truncation records we have to do that
scanning the data directory. But that state could be "from the future".

I considered for a second whether the solution for that could be to not
truncate while inconsistent - but I think that doesn't solve anything as
then we can end up with directories where every single offsets/member
file exists.  We could possibly try to fix that by always truncating
away slru segments in offsets that we know to be too old to exist in a
valid database. But achieving the same for members fries my brain.  It
also seems awfully risky.

I think at least for 9.5+ we should a) invent proper truncation records
for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
The current hack of scanning the directories to get knowledge we should
have is a pretty bad hack, and we should not continue using it forever.
I think we might end up needing to do a) even in the backbranches.


Am I missing something?


This problem isn't conflicting with most of the fixes you describe, so
I'll continue with reviewing those.


Greetings,

Andres Freund


On Fri, May 29, 2015 at 3:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It won't fix the fact that pg_upgrade is putting
> a wrong value into everybody's datminmxid field, which should really
> be addressed too, but I've been working on this for about three days
> virtually non-stop and I don't have the energy to tackle it right now.
> If anyone feels the urge to step into that breech, I think what it
> needs to do is: when upgrading from a 9.3-or-later instance, copy over
> each database's datminmxid into the corresponding database in the new
> cluster.

Bruce was kind enough to spend some time on IM with me this afternoon,
and I think this may actually be OK.  What pg_upgrade does is:

1. First, put next-xid into the relminmxid for all tables, including
catalog tables.  This is the correct behavior for upgrades from a
pre-9.3 release, and is correct for catalog tables in general.

2. Next, restoring the schema dump will set the relminmxid values for
all non-catalog tables to the value dumped from the old cluster.  At
this point, everything is fine provided that we are coming from a
release 9.3 or newer.  But if the old cluster is pre-9.3, it will have
dumped *zero* values for all of its relminmxid values; so all of the
user tables go from the correct value they had after step 1 to an
incorrect value.

3. Finally, if the old cluster is pre-9.3, repeat step 1, undoing the
damage done in step 2.

This is a bit convoluted, but I don't know of a reason why it
shouldn't work.  Sorry for the false alarm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, May 29, 2015 at 9:46 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
>> It seems pretty clear that we can't effectively determine anything
>> about member wraparound until the cluster is consistent.
>
> I wonder if this doesn't actually hints at a bigger problem.  Currently,
> to determine where we need to truncate SlruScanDirectory() is
> used. That, afaics, could actually be a problem during recovery when
> we're not consistent.

I agree.  I actually meant to mention this in my previous email, but,
owing to exhaustion and burnout, didn't.

> I think at least for 9.5+ we should a) invent proper truncation records
> for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> The current hack of scanning the directories to get knowledge we should
> have is a pretty bad hack, and we should not continue using it forever.
> I think we might end up needing to do a) even in the backbranches.

That may be the right thing to do.  I'm concerned that changing the
behavior of master too much will make it every subsequent fix twice as
hard, because we'll have to do one fix in master and another fix in
the back-branches.  I'm also concerned that it will create even more
convoluted failure scenarios. The failure-to-start problem discussed
on this thread requires a chain of four (maybe three) different
PostgreSQL versions in order to create it, and the more things we go
change, the harder it's going to be to reason about this stuff.

The diseased and rotting elephant in the room here is that clusters
with bogus relminmxid, datminmxid, and/or oldestMultiXid values may
exist in the wild and we really have no plan to get rid of them.
78db307bb may have helped somewhat - although I'm haven't grokked what
it's about well enough to be sure - but it's certainly not a complete
solution, as this bug report itself illustrates rather well.  Unless
we figure out some clever solution that is not now apparent to me, or
impose a hard pg_upgrade compatibility break at some point, we
basically can't count on pg_control's "oldest multixact" information
to be correct ever again.  We may be running into clusters 15 years
from now that have problems that are just holdovers from what was
fixed in 9.3.5.

One thing I think we should definitely do is add one or two additional
fields to pg_controldata that get filled in by pg_upgrade.  One of
them should be "the oldest known catversion in the lineage of this
cluster" and the other should be "the most recent catverson in the
lineage of this cluster before this one".   Or maybe we should store
PG_VERSION_NUM values.  Or store both things.  I think that would make
troubleshooting this kind of problem a lot easier - just from the
pg_controldata output, you'd be able to tell whether the cluster had
been pg_upgraded, whether it had been pg_upgraded once or multiple
times, and at least some of the versions involved, without relying on
the user's memory of what they did and when.  Fortunately, Steve
Kellet had a pretty clear idea of what his history was, but not all
users know that kind of thing, and I've wanted it more than once while
troubleshooting.

Another thing I think we should do is add a field to pg_class that is
propagated by pg_upgrade and stores the most recent PG_VERSION_NUM
that is known to have performed a scan_all vacuum of the table.  This
would allow us to do things in the future like (a) force a full-table
vacuum of any table that hasn't been vacuumed since $BUGGYRELEASE or
(b) advise users to manually inspect the values and manually perform
said vacuum or (c) only believe that certain information about a table
is accurate if it's been full-scanned by a vacuum newer than
$BUGGYRELEASE.  It could also be used as part of a strategy for
reclaiming HEAP_MOVED_IN/HEAP_MOVED_OFF; e.g. you can't upgrade to
10.5, which repurposes those bits, unless you've done a scan_all
vacuum on every table with a release new enough to guarantee that
they're not used for their historical purpose.

> This problem isn't conflicting with most of the fixes you describe, so
> I'll continue with reviewing those.

Thank you.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Sat, May 30, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
>> It seems pretty clear that we can't effectively determine anything
>> about member wraparound until the cluster is consistent.
>
> I wonder if this doesn't actually hints at a bigger problem.  Currently,
> to determine where we need to truncate SlruScanDirectory() is
> used. That, afaics, could actually be a problem during recovery when
> we're not consistent.
>
> Consider the scenario where a very large database is copied while
> running. At the start of the backup we'll determine at which checkpoint
> recovery will start and store it in the label. After that the copy will
> start, copying everything slowly. That works because we expect recovery
> to fix things up.  The problem I see WRT multixacts is that the copied
> state of pg_multixact could be wildly different from the one at the
> label's checkpoint. During recovery, before reaching the first
> checkpoint, we'll create multixact files as used at the time of the
> checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

Yes, I think the code in TruncateMultiXact that scans for the earliest
multixact only works when the segment files span at most 2^31 of
multixact space. If they span more than that, MultiXactIdPrecedes is
no long able to provide a total ordering, so of the scan may be wrong,
depending on the order that it encounters the files.

Incidentally, your description of that scenario gave me an idea for
how to reproduce a base backup that 9.4.2 or master can't start.  I
tried this first:

1.  Set up with max_wal_senders = 1, wal_level = hot_standby, initdb
2.  Create enough multixacts to fill a couple of segments in
pg_multixacts/offsets using "explode_mxact_members 99 1000" (create
foo table first)
3.  Start a base backup with logs, but break in
src/backend/replication/basebackup.c after
sendFileWithContent(BACKUP_LABEL_FILE, labelfile); and before sending
the contents of the data dir (including pg_multixacts)... (or just put
a big sleep in there)
4.  UPDATE pg_database SET datallowconn = true; vacuumdb --freeze
--all; CHECKPOINT;, see that offsets/0000 is now gone and
oldestMultiXid is 98001 in pg_control
5.  ... allow the server backend to continue; the basebackup completes.

Inspecting the new data directory, I see that offsets/0000 is not
present as expected, and pg_control contains the oldestMultiXid 98001.

Since pg_control was copied after pg_multixacts and my database didn't
move between those copies, it points to a valid multixact (unlike the
pg_upgrade scenario) and is able to start up, but does something
different again which may or may not be good, I'm not sure:

LOG:  database system was interrupted; last known up at 2015-05-30 14:30:23 NZST
LOG:  file "pg_multixact/offsets/0000" doesn't exist, reading as zeroes
LOG:  redo starts at 0/7000028
LOG:  consistent recovery state reached at 0/70C8898
LOG:  redo done at 0/70C8898
LOG:  last completed transaction was at log time 2015-05-30 14:30:17.261436+12
LOG:  database system is ready to accept connections

My next theory about how to get a FATAL during startup is something
like this:  Break in basebackup.c in between copying pg_multixacts and
copying pg_control (simulating a very large/slow file copy, perhaps if
'base' happens to get copied after 'pg_multixacts', though I don't
know if that's possible), and while it's stopped, generate some
offsets segments, vacuum --freeze --all, checkpoint and then create a
few more multixacts, then checkpoint again (so that oldestMultiXact is
not equal to nextMultiXact).  Continue.  Now pg_control's
oldestMultiXactId now points at a segment file that didn't exist when
pg_multixacts was copied.  I haven't managed to get this to work (ie
produce a FATAL) and I'm out of time for a little while, but wanted to
share this idea in case it helps someone.

--
Thomas Munro
http://www.enterprisedb.com


Andres Freund wrote:

> I considered for a second whether the solution for that could be to not
> truncate while inconsistent - but I think that doesn't solve anything as
> then we can end up with directories where every single offsets/member
> file exists.

Hang on a minute.  We don't need to scan any files to determine the
truncate point for offsets; we have the valid range for them in
pg_control, as nextMulti + oldestMulti.  And using those end points, we
can look for the offsets corresponding to each, and determine the member
files corresponding to the whole set; it doesn't matter what other files
exist, we just remove them all.  In other words, maybe we can get away
with considering truncation separately for offset and members on
recovery: do it like today for offsets (i.e. at each restartpoint), but
do it only in TrimMultiXact for members.

One argument against this idea is that we may not want to keep a full
set of member files on standbys (due to disk space usage), but that's
what will happen unless we truncate during replay.

> I think at least for 9.5+ we should a) invent proper truncation records
> for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> The current hack of scanning the directories to get knowledge we should
> have is a pretty bad hack, and we should not continue using it forever.
> I think we might end up needing to do a) even in the backbranches.

Definitely agree with WAL-logging truncations; also +1 on backpatching
that to 9.3.  We already have experience with adding extra WAL records
on minor releases, and it didn't seem to have bitten too hard.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Bruce Momjian wrote:

> I think we need to step back and look at the brain power required to
> unravel the mess we have made regarding multi-xact and fixes.  (I bet
> few people can even remember which multi-xact fixes went into which
> releases --- I can't.)  Instead of working on actual features, we are
> having to do this complex diagnosis because we didn't do a thorough
> analysis at the time a pattern of multi-xact bugs started to appear.
> Many projects deal with this compound bug debt regularly, but we have
> mostly avoided this fate.

Well, it's pretty obvious that if we had had a glimpse of the nature of
the issues back then, we wouldn't have committed the patch.  The number
of ends that we left loose we now know to be huge, but we didn't know
that back then.  (I, at least, certainly didn't.)

Simon told me when this last one showed up that what we need at this
point is a way to turn the whole thing off to stop it from affecting
users anymore.  I would love to be able to do that, because the whole
situation has become stressing, but I don't see a way.  Heck, if we
could implement Heikki's TED idea or something similar, I would be up
for back-patching it so that people can pg_upgrade from postgresql-9.3
to postgresql-ted-9.3, and just forget any further multixact pain.
Don't think that's really doable, though.  As far as I can see, for
branches 9.3 and 9.4 the best we can do is soldier on and get these bugs
fixed, hoping that this time they are really the last [serious] ones.

For 9.5, I concur with Andres that we'd do good to change the way
truncations are done by WAL-logging more stuff and keep more data in
pg_control, to avoid all these nasty games.  And for 9.6, find a better
representation of the data so that the durable data is stored separately
from the volatile data.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 2015-05-30 00:52:37 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
>
> > I considered for a second whether the solution for that could be to not
> > truncate while inconsistent - but I think that doesn't solve anything as
> > then we can end up with directories where every single offsets/member
> > file exists.
>
> Hang on a minute.  We don't need to scan any files to determine the
> truncate point for offsets; we have the valid range for them in
> pg_control, as nextMulti + oldestMulti.  And using those end points, we
> can look for the offsets corresponding to each, and determine the member
> files corresponding to the whole set; it doesn't matter what other files
> exist, we just remove them all.  In other words, maybe we can get away
> with considering truncation separately for offset and members on
> recovery: do it like today for offsets (i.e. at each restartpoint), but
> do it only in TrimMultiXact for members.

Is oldestMulti, nextMulti - 1 really suitable for this? Are both
actually guaranteed to exist in the offsets slru and be valid?  Hm. I
guess you intend to simply truncate everything else, but just in
offsets?

> One argument against this idea is that we may not want to keep a full
> set of member files on standbys (due to disk space usage), but that's
> what will happen unless we truncate during replay.

I think that argument is pretty much the death-knell.=

> > I think at least for 9.5+ we should a) invent proper truncation records
> > for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> > The current hack of scanning the directories to get knowledge we should
> > have is a pretty bad hack, and we should not continue using it forever.
> > I think we might end up needing to do a) even in the backbranches.
>
> Definitely agree with WAL-logging truncations; also +1 on backpatching
> that to 9.3.  We already have experience with adding extra WAL records
> on minor releases, and it didn't seem to have bitten too hard.

I'm inclined to agree. My only problem is that I'm not sure whether we
can find a way of doing all this without adding a pg_control field. Let
me try to sketch this out:

1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest)
   on the master to find the oldest offsets segment to
   truncate. Alternatively, if we determine it to be safe, we could use
   oldestMulti to find that.
2) SlruScanDirCbRemoveMembers is changed to return the range of members
   to remove, instead of doing itself
3) We wal log [oldest offset segment guaranteed to not be alive,
   nextmulti) for offsets, and [oldest members segment guaranteed to not be alive,
   nextmultioff), and redo truncations for the entire range during
   recovery.

I'm pretty tired right now, but this sounds doable.

Greetings,

Andres Freund


On Sat, May 30, 2015 at 8:55 PM, Andres Freund <andres@anarazel.de> wrote:
> Is oldestMulti, nextMulti - 1 really suitable for this? Are both
> actually guaranteed to exist in the offsets slru and be valid?  Hm. I
> guess you intend to simply truncate everything else, but just in
> offsets?

oldestMulti in theory is the right thing, I think, but in actuality we
know that some people have 1 here instead of the correct value.

>> One argument against this idea is that we may not want to keep a full
>> set of member files on standbys (due to disk space usage), but that's
>> what will happen unless we truncate during replay.
>
> I think that argument is pretty much the death-knell.=

Yes.  Truncating on the standby is really not optional.

>> > I think at least for 9.5+ we should a) invent proper truncation records
>> > for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
>> > The current hack of scanning the directories to get knowledge we should
>> > have is a pretty bad hack, and we should not continue using it forever.
>> > I think we might end up needing to do a) even in the backbranches.
>>
>> Definitely agree with WAL-logging truncations; also +1 on backpatching
>> that to 9.3.  We already have experience with adding extra WAL records
>> on minor releases, and it didn't seem to have bitten too hard.
>
> I'm inclined to agree. My only problem is that I'm not sure whether we
> can find a way of doing all this without adding a pg_control field. Let
> me try to sketch this out:
>
> 1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest)
>    on the master to find the oldest offsets segment to
>    truncate. Alternatively, if we determine it to be safe, we could use
>    oldestMulti to find that.
> 2) SlruScanDirCbRemoveMembers is changed to return the range of members
>    to remove, instead of doing itself
> 3) We wal log [oldest offset segment guaranteed to not be alive,
>    nextmulti) for offsets, and [oldest members segment guaranteed to not be alive,
>    nextmultioff), and redo truncations for the entire range during
>    recovery.
>
> I'm pretty tired right now, but this sounds doable.

I'm probably biased here, but I think we should finish reviewing,
testing, and committing my patch before we embark on designing this.
So far we have no reports of trouble attributable to the lack of the
WAL-logging support discussed here, as opposed to several reports of
trouble from the status quo within days of release.

I'm having trouble reconstructing the series of events where what
you're worried about here really becomes a problem, and I think we
ought to have a very clear idea about that before back-patching
changes of this type.  It's true that if the state of the SLRU
directory is in the future, because recovery was restarted from an
earlier checkpoint, we might replay a checkpoint and remove some of
those files from the future.  But so what?  By the time we've reached
the minimum recovery point, they will have been recreated by the same
WAL records that created them in the first place.  If, in the previous
replay, we had wrapped all the way around, some of the stuff we keep
may actually already have been overwritten by future WAL records, but
they'll be overwritten again.  Now, that could mess up our
determination of which members to remove, I guess, but I'm not clear
that actually matters either: if the offsets space has wrapped around,
the members space will certainly have wrapped around as well, so we
can remove anything we like at this stage and we're still OK.  I agree
this is ugly the way it is, but where is the actual bug?

As far as your actual outline goes, I think if we do this, we need to
be very careful about step #2.  Right now, we decide what we need to
keep and then remove everything else, but that's kind of wonky because
new stuff may be getting created at the same time, so we keep
adjusting our idea of exactly what needs to be removed.  It would be
far better to invert that logic: decide what needs to be removed -
presumably, everything from the oldest member that now exists up until
some later point - and then remove precisely that stuff and nothing
else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Noah Misch
Date:
Incomplete review, done in a relative rush:

On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
> OK, here's a patch.  Actually two patches, differing only in
> whitespace, for 9.3 and for master (ha!).  I now think that the root
> of the problem here is that DetermineSafeOldestOffset() and
> SetMultiXactIdLimit() were largely ignorant of the possibility that
> they might be called at points in time when the cluster was
> inconsistent.

A cause perhaps closer to the root is commit f741300 moving truncation from
VACUUM to checkpoints.  CLOG has given us deep experience with VACUUM-time
truncation.  Commit f6a6c46d and this patch are about bringing CHECKPOINT-time
truncation up to the same level.

Speaking of commit f6a6c46d, it seems logical that updating allocation stop
limits should happen proximate to truncation.  That's currently the case for
CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's
TruncateMultiXact() call does both).  However, pg_multixact/offsets is
truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit.
I did not distill an errant test case, but this is fishy.

> SetMultiXactIdLimit() bracketed certain parts of its
> logic with if (!InRecovery), but those guards were ineffective because
> it gets called before InRecovery is set in the first place.

SetTransactionIdLimit() checks InRecovery for the same things, and it is
called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
sense of whether it is subject to similar problems as a result?

> 1. Moves the call to DetermineSafeOldestOffset() that appears in
> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
> until we're consistent.  Also, instead of passing
> MultiXactState->oldestMultiXactId, pass the newer of that value and
> the earliest offset that exists on disk.  That way, it won't try to
> read data that's not there.

Perhaps emit a LOG message when we do that, since it's our last opportunity to
point to the potential data loss?

> +     * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
> +     * had bugs that could allow users who reached those release through

s/release/releases/

> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
>      SimpleLruTruncate(MultiXactOffsetCtl,
>                        MultiXactIdToOffsetPage(oldestMXact));
>
> +    /* Update oldest-on-disk value in shared memory. */
> +    earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
> +    if (earliest < FirstMultiXactId)
> +        earliest = FirstMultiXactId;
> +    LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> +    Assert(MultiXactState->oldestMultiXactOnDiskValid);
> +    MultiXactState->oldestMultiXactOnDiskValid = earliest;

That last line needs s/Valid//, I presume.  Is it okay that
oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
Valid indicator remaining true?

> +static MultiXactOffset
> +GetOldestMultiXactOnDisk(void)
> +{

> +    SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
> +    earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
> +    if (earliest < FirstMultiXactId)
> +        earliest = FirstMultiXactId;

SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
represent a wrapped state.  When the files do represent a wrapped state,
MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
result is sensitive to readdir() order.  This code exists specifically to help
us make do in the face of wrong catalog and pg_control entries.  We may have
wrapped as a result of those of same catalog/pg_control entries, so I think
this function ought to account for wrap.  I haven't given enough thought to
what exactly it should do.

Thanks,
nm


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Noah Misch
Date:
On Fri, May 29, 2015 at 10:37:57AM +1200, Thomas Munro wrote:
> On Fri, May 29, 2015 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > - There's a third possible problem related to boundary cases in
> > SlruScanDirCbRemoveMembers, but I don't understand that one well
> > enough to explain it.  Maybe Thomas can jump in here and explain the
> > concern.
>
> I noticed something in passing which is probably not harmful, and not
> relevant to this bug report, it was just a bit confusing while
> testing:  SlruScanDirCbRemoveMembers never deletes any files if
> rangeStart == rangeEnd.  In practice, if you have an idle cluster with
> a lot of multixact data and you VACUUM FREEZE all databases and then
> CHECKPOINT, you might be surprised to see no member files going away
> quite yet, but they'll eventually be truncated by a future checkpoint,
> once rangeEnd has had a chance to advance to the next page due to more
> multixacts being created.
>
> If we want to fix this one day, maybe the right thing to do is to
> treat the rangeStart == rangeEnd case the same way we treat rangeStart
> < rangeEnd, that is, to assume that the range of pages isn't
> wrapped/inverted in this case.

I agree.  Because we round rangeStart down to a segment boundary, oldest and
next member offsets falling on the same page typically implies
rangeStart<rangeEnd.  Only when the page they share happens to be the first
page of a segment does one observe rangeStart==RangeEnd.

While testing this (with inconsistent-multixact-fix-master.patch applied,
FWIW), I noticed a nearby bug with a similar symptom.  TruncateMultiXact()
omits the nextMXact==oldestMXact special case found in each other
find_multixact_start() caller, so it reads the offset of a not-yet-created
MultiXactId.  The usual outcome is to get rangeStart==0, so we truncate less
than we could.  This can't make us truncate excessively, because
nextMXact==oldestMXact implies no table contains any mxid.  If nextMXact
happens to be the first of a segment, an error is possible.  Procedure:

1. Make a fresh cluster.
2. UPDATE pg_database SET datallowconn = true
3. Consume precisely 131071 mxids.  Number of offsets per mxid is unimportant.
4. vacuumdb --freeze --all

Expected state after those steps:
$ pg_controldata | grep NextMultiXactId
Latest checkpoint's NextMultiXactId:  131072

Checkpoint will fail like this:
26699 2015-05-31 17:22:33.134 GMT LOG:  statement: checkpoint
26661 2015-05-31 17:22:33.134 GMT DEBUG:  performing replication slot checkpoint
26661 2015-05-31 17:22:33.136 GMT ERROR:  could not access status of transaction 131072
26661 2015-05-31 17:22:33.136 GMT DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file or directory.
26699 2015-05-31 17:22:33.234 GMT ERROR:  checkpoint request failed
26699 2015-05-31 17:22:33.234 GMT HINT:  Consult recent messages in the server log for details.
26699 2015-05-31 17:22:33.234 GMT STATEMENT:  checkpoint

This does not block startup, and creating one mxid hides the problem again.
Thus, it is not a top-priority bug like some other parts of this thread.  I
mention it today mostly so it doesn't surprise hackers testing other fixes.

Thanks,
nm


On 2015-05-31 07:51:59 -0400, Robert Haas wrote:
> > 1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest)
> >    on the master to find the oldest offsets segment to
> >    truncate. Alternatively, if we determine it to be safe, we could use
> >    oldestMulti to find that.
> > 2) SlruScanDirCbRemoveMembers is changed to return the range of members
> >    to remove, instead of doing itself
> > 3) We wal log [oldest offset segment guaranteed to not be alive,
> >    nextmulti) for offsets, and [oldest members segment guaranteed to not be alive,
> >    nextmultioff), and redo truncations for the entire range during
> >    recovery.
> >
> > I'm pretty tired right now, but this sounds doable.
>
> I'm probably biased here, but I think we should finish reviewing,
> testing, and committing my patch before we embark on designing this.

Probably, yes. I am wondering whether doing this immediately won't end
up making some things simpler and more robust though.

> So far we have no reports of trouble attributable to the lack of the
> WAL-logging support discussed here, as opposed to several reports of
> trouble from the status quo within days of release.

The lack of WAL logging actually has caused problems in the 9.3.3 (?)
era, where we didn't do any truncation during recovery...

> By the time we've reached the minimum recovery point, they will have
> been recreated by the same WAL records that created them in the first
> place.

I'm not sure that's true. I think we could end up errorneously removing
files that were included in the base backup. Anyway, let's focus on your
patch for now.

> If, in the previous
> replay, we had wrapped all the way around, some of the stuff we keep
> may actually already have been overwritten by future WAL records, but
> they'll be overwritten again.  Now, that could mess up our
> determination of which members to remove, I guess, but I'm not clear
> that actually matters either: if the offsets space has wrapped around,
> the members space will certainly have wrapped around as well, so we
> can remove anything we like at this stage and we're still OK.  I agree
> this is ugly the way it is, but where is the actual bug?

I'm more worried about the cases where we didn't ever actually "badly
wrap around" (i.e. overwrite needed data); but where that's not clear on
the standby because the base backup isn't in a consistent state.

Greetings,

Andres Freund


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@leadboat.com> wrote:
> Incomplete review, done in a relative rush:

Thanks.

> On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> OK, here's a patch.  Actually two patches, differing only in
>> whitespace, for 9.3 and for master (ha!).  I now think that the root
>> of the problem here is that DetermineSafeOldestOffset() and
>> SetMultiXactIdLimit() were largely ignorant of the possibility that
>> they might be called at points in time when the cluster was
>> inconsistent.
>
> A cause perhaps closer to the root is commit f741300 moving truncation from
> VACUUM to checkpoints.  CLOG has given us deep experience with VACUUM-time
> truncation.  Commit f6a6c46d and this patch are about bringing CHECKPOINT-time
> truncation up to the same level.
>
> Speaking of commit f6a6c46d, it seems logical that updating allocation stop
> limits should happen proximate to truncation.  That's currently the case for
> CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's
> TruncateMultiXact() call does both).  However, pg_multixact/offsets is
> truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit.
> I did not distill an errant test case, but this is fishy.

Good point.  Because we confine ourselves to using half the offset
space, it seems much harder for us to get into trouble here than it is
with members. The first scenario that occurred to me is that the SLRU
might actually wrap.  That seems tough, though: between one checkpoint
and the next, vacuum would need to advance oldest_datminmxid by 2^31
MXIDs while generating 2^31 new ones, or something like that.  That
doesn't seem real plausible. But then it occurred to me that it's
probably sufficient to advance the head of the SLRU far enough that
TruncateMultiXact things that the tail is in the future instead of in
the past.  I see no reason why that couldn't happen.  Then we'd end up
leaving some files behind that we should have removed.  I'm not sure
exactly what problem that would cause; would they just get overwritten
on the next pass through the space, or would they cause errors?  I
have not had time to check.

>> SetMultiXactIdLimit() bracketed certain parts of its
>> logic with if (!InRecovery), but those guards were ineffective because
>> it gets called before InRecovery is set in the first place.
>
> SetTransactionIdLimit() checks InRecovery for the same things, and it is
> called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
> sense of whether it is subject to similar problems as a result?

Well, I think it's pretty weird that those things will get done before
beginning recovery, even on an inconsistent cluster, but not during
recovery.  That is pretty strange.  I don't see that it can actually
do any worse than emit a few log messages at the start of recovery
that won't show up again until the end of recovery, though.

>> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> until we're consistent.  Also, instead of passing
>> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> the earliest offset that exists on disk.  That way, it won't try to
>> read data that's not there.
>
> Perhaps emit a LOG message when we do that, since it's our last opportunity to
> point to the potential data loss?

If the problem is just that somebody minmxid got set to 1 instead of
the real value, I think that there is no data loss, because none of
those older values are actually present there.  But we could add a LOG
message anyway.  How do you suggest that we phrase that?

>> +      * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
>> +      * had bugs that could allow users who reached those release through
>
> s/release/releases/

Fixed.

>> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
>>       SimpleLruTruncate(MultiXactOffsetCtl,
>>                                         MultiXactIdToOffsetPage(oldestMXact));
>>
>> +     /* Update oldest-on-disk value in shared memory. */
>> +     earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
>> +     if (earliest < FirstMultiXactId)
>> +             earliest = FirstMultiXactId;
>> +     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>> +     Assert(MultiXactState->oldestMultiXactOnDiskValid);
>> +     MultiXactState->oldestMultiXactOnDiskValid = earliest;
>
> That last line needs s/Valid//, I presume.  Is it okay that
> oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
> Valid indicator remaining true?

Ay yai yay.  Yes, s/Valid//.

I'm not sure what you mean about it becoming too old.  At least with
that fix, it should get updated to exactly the first file that we
didn't remove.  Isn't that right?

>> +static MultiXactOffset
>> +GetOldestMultiXactOnDisk(void)
>> +{
>
>> +     SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
>> +     earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
>> +     if (earliest < FirstMultiXactId)
>> +             earliest = FirstMultiXactId;
>
> SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
> represent a wrapped state.  When the files do represent a wrapped state,
> MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
> result is sensitive to readdir() order.  This code exists specifically to help
> us make do in the face of wrong catalog and pg_control entries.  We may have
> wrapped as a result of those of same catalog/pg_control entries, so I think
> this function ought to account for wrap.  I haven't given enough thought to
> what exactly it should do.

I can see that there might be an issue there, but I can't quite put my
finger on it well enough to say that it definitely is an issue.  This
code is examining the offsets space rather than the members space, and
the protections against offsets wraparound have been there since the
original commit of this feature
(0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
no concrete evidence that there's ever been a problem in this area.
It seems like it might be safer to rejigger that code so that it
considers distance-behind-current rather than using the wrapped
comparison logic, but I'm reluctant to start rejiggering more things
without knowing what specifically I'm fixing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres@anarazel.de> wrote:
>> I'm probably biased here, but I think we should finish reviewing,
>> testing, and committing my patch before we embark on designing this.
>
> Probably, yes. I am wondering whether doing this immediately won't end
> up making some things simpler and more robust though.

I'm open to being convinced of that, but as of this moment I'm not
seeing any clear-cut evidence that we need to go so far.

>> So far we have no reports of trouble attributable to the lack of the
>> WAL-logging support discussed here, as opposed to several reports of
>> trouble from the status quo within days of release.
>
> The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> era, where we didn't do any truncation during recovery...

Right, but now we're piggybacking on the checkpoint records, and I
don't have any evidence that this approach can't be made robust.  It's
possible that it can't be made robust, but that's not currently clear.

>> By the time we've reached the minimum recovery point, they will have
>> been recreated by the same WAL records that created them in the first
>> place.
>
> I'm not sure that's true. I think we could end up errorneously removing
> files that were included in the base backup. Anyway, let's focus on your
> patch for now.

OK, but I am interested in discussing the other thing too.  I just
can't piece together the scenario myself - there may well be one.  The
base backup will begin replay from the checkpoint caused by
pg_start_backup() and remove anything that wasn't there at the start
of the backup.  But all of that stuff should get recreated by the time
we reach the minimum recovery point (end of backup).

>> If, in the previous
>> replay, we had wrapped all the way around, some of the stuff we keep
>> may actually already have been overwritten by future WAL records, but
>> they'll be overwritten again.  Now, that could mess up our
>> determination of which members to remove, I guess, but I'm not clear
>> that actually matters either: if the offsets space has wrapped around,
>> the members space will certainly have wrapped around as well, so we
>> can remove anything we like at this stage and we're still OK.  I agree
>> this is ugly the way it is, but where is the actual bug?
>
> I'm more worried about the cases where we didn't ever actually "badly
> wrap around" (i.e. overwrite needed data); but where that's not clear on
> the standby because the base backup isn't in a consistent state.

I agree. The current patch tries to make it so that we never call
find_multixact_start() while in recovery, but it doesn't quite
succeed: the call in TruncateMultiXact still happens during recovery,
but only once we're sure that the mxact we plan to call it on actually
exists on disk.  That won't be called until we replay the first
checkpoint, but that might still be prior to consistency.

Since I forgot to attach the revised patch with fixes for the points
Noah mentioned to that email, here it is attached to this one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
Thomas Munro wrote:

> > - There's a third possible problem related to boundary cases in
> > SlruScanDirCbRemoveMembers, but I don't understand that one well
> > enough to explain it.  Maybe Thomas can jump in here and explain the
> > concern.
>
> I noticed something in passing which is probably not harmful, and not
> relevant to this bug report, it was just a bit confusing while
> testing:

Another thing I noticed in passing is that
SimpleLruDoesPhysicalFileExist uses SLRU_OPEN_FAILED in slru_errcase for
the case where lseek() fails, rather than SLRU_SEEK_FAILED.  This is
probably completely harmless, because that's lseek(..., SEEK_END), but
seems rather odd anyway.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera wrote:
> Robert Haas wrote:

> > In the process of investigating this, we found a few other things that
> > seem like they may also be bugs:
> >
> > - As noted upthread, replaying an older checkpoint after a newer
> > checkpoint has already happened may lead to similar problems.  This
> > may be possible when starting from an online base backup; or when
> > restarting a standby that did not perform a restartpoint when
> > replaying the last checkpoint before the shutdown.
>
> I'm going through this one now, as it's closely related and caused
> issues for us.

FWIW I've spent a rather long while trying to reproduce the issue, but
haven't been able to figure out.  Thomas already commented on it: the
server notices that a file is missing and, instead of failing, it "reads
the file as zeroes".  This is because of this hack in slru.c, which is
there to cover for a pg_clog replay consideration:

    /*
     * In a crash-and-restart situation, it's possible for us to receive
     * commands to set the commit status of transactions whose bits are in
     * already-truncated segments of the commit log (see notes in
     * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
     * where the file doesn't exist, and return zeroes instead.
     */
    fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
    if (fd < 0)
    {
        if (errno != ENOENT || !InRecovery)
        {
            slru_errcause = SLRU_OPEN_FAILED;
            slru_errno = errno;
            return false;
        }

        ereport(LOG,
                (errmsg("file \"%s\" doesn't exist, reading as zeroes",
                        path)));
        MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
        return true;
    }

I was able to cause an actual problem by #ifdefing out the "if
errno/inRcovery" line, of course, but how can I be sure that fixing that
would also fix my customer's problem?  That's no good.

Anyway here's a quick script to almost-reproduce the problem.  I
constructed many variations of this, trying to find the failing one, to
no avail.  Note I'm using the pg_burn_multixact() function to create
multixacts quickly (this is cheating in itself, but it seems to me that
if I'm not able to reproduce the problem with this, I'm not able to do
so without it either.)  This script takes an exclusive backup
(cp -pr) excluding pg_multixact, then does some multixact stuff
(including truncation), then copies pg_multixact.  Evidently I'm missing
some critical element but I can't see what it is.

Another notable variant was to copy pg_multixact first, then do stuff
(create lots of multixacts, then mxact-freeze/checkpoint),
then copy the rest of the data dir.  No joy either: when replay starts,
the missing multixacts are created on the standby and so by the time we
reach the checkpoint record the pg_multixact/offset file has already
grown to the necessary size.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera wrote:

> Anyway here's a quick script to almost-reproduce the problem.

Meh.  Really attached now.

I also wanted to post the error messages we got:

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset
147456:Success. 
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure

We pg_xlogdumped the offending segment and see that there is a
checkpoint record with oldestMulti=4624559.  Curiously, there are no
other records with rmgr=MultiXact in that segment.

Note that the file exists (otherwise the error would say "could not
open" rather than "could not read"); this is also why errno says
"Success" rather than some actual error; this is the code:

    errno = 0;
    if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
    {
        slru_errcause = SLRU_READ_FAILED;
        slru_errno = errno;
        CloseTransientFile(fd);
        return false;
    }

My guess is that the file existed, and perhaps had one or more pages,
but the wanted page doesn't exist, so we tried to read but got 0 bytes
back.  read() returns 0 in this case but doesn't set errno.

I didn't find a way to set things so that the file exists but is of
shorter contents than oldestMulti by the time the checkpoint record is
replayed.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Noah Misch
Date:
On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
> >> SetMultiXactIdLimit() bracketed certain parts of its
> >> logic with if (!InRecovery), but those guards were ineffective because
> >> it gets called before InRecovery is set in the first place.
> >
> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
> > called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
> > sense of whether it is subject to similar problems as a result?
>
> Well, I think it's pretty weird that those things will get done before
> beginning recovery, even on an inconsistent cluster, but not during
> recovery.  That is pretty strange.  I don't see that it can actually
> do any worse than emit a few log messages at the start of recovery
> that won't show up again until the end of recovery, though.

Granted.  Would it be better to update both functions at the same time, and
perhaps to make that a master-only change?  Does the status quo cause more
practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
> >> until we're consistent.  Also, instead of passing
> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
> >> the earliest offset that exists on disk.  That way, it won't try to
> >> read data that's not there.
> >
> > Perhaps emit a LOG message when we do that, since it's our last opportunity to
> > point to the potential data loss?
>
> If the problem is just that somebody minmxid got set to 1 instead of
> the real value, I think that there is no data loss, because none of
> those older values are actually present there.  But we could add a LOG
> message anyway.  How do you suggest that we phrase that?

There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
to 1.  Otherwise, data loss is possible.  I don't hope for an actionable
message, but we might want a reporter to grep logs for it when we diagnose
future reports.  Perhaps this:

  "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
more than one segment:

  "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

> >> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
> >>       SimpleLruTruncate(MultiXactOffsetCtl,
> >>                                         MultiXactIdToOffsetPage(oldestMXact));
> >>
> >> +     /* Update oldest-on-disk value in shared memory. */
> >> +     earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
> >> +     if (earliest < FirstMultiXactId)
> >> +             earliest = FirstMultiXactId;
> >> +     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> >> +     Assert(MultiXactState->oldestMultiXactOnDiskValid);
> >> +     MultiXactState->oldestMultiXactOnDiskValid = earliest;
> >
> > That last line needs s/Valid//, I presume.  Is it okay that
> > oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
> > Valid indicator remaining true?
>
> Ay yai yay.  Yes, s/Valid//.
>
> I'm not sure what you mean about it becoming too old.  At least with
> that fix, it should get updated to exactly the first file that we
> didn't remove.  Isn't that right?

Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
only in that it never reads or writes the cache.  I might expect
oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
does hold most of the time.  It does not always hold between the start of the
quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
That's because raw_GOMXOD() changes at the instant we unlink the oldest
segment, but oldestMultiXactOnDisk changes later.  Any backend may call
GetOldestMultiXactOnDisk() via SetMultiXactIdLimit().  If a backend does that
concurrent with the checkpointer running TruncateMultiXact() and sees a stale
oldestMultiXactOnDisk, is that harmless?

> >> +static MultiXactOffset
> >> +GetOldestMultiXactOnDisk(void)
> >> +{
> >
> >> +     SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
> >> +     earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
> >> +     if (earliest < FirstMultiXactId)
> >> +             earliest = FirstMultiXactId;
> >
> > SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
> > represent a wrapped state.  When the files do represent a wrapped state,
> > MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
> > result is sensitive to readdir() order.  This code exists specifically to help
> > us make do in the face of wrong catalog and pg_control entries.  We may have
> > wrapped as a result of those of same catalog/pg_control entries, so I think
> > this function ought to account for wrap.  I haven't given enough thought to
> > what exactly it should do.
>
> I can see that there might be an issue there, but I can't quite put my
> finger on it well enough to say that it definitely is an issue.  This
> code is examining the offsets space rather than the members space, and
> the protections against offsets wraparound have been there since the
> original commit of this feature
> (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
> no concrete evidence that there's ever been a problem in this area.
> It seems like it might be safer to rejigger that code so that it
> considers distance-behind-current rather than using the wrapped
> comparison logic, but I'm reluctant to start rejiggering more things
> without knowing what specifically I'm fixing.

Anything that could cause the pg_multixact/offsets tail to rotate from being
in the past to being in the future poses this risk.  (This is the tail from
the perspective of files on disk; pg_control, datminmxid, and MultiXactState
notions of the tail play no part here.)  I had in mind that the pg_upgrade
datminmxid=1 bug could be a tool for achieving that, but I've been
unsuccessful so far at building a credible thought experiment around it.  Near
the beginning of your reply, you surmised that this could happen between a
VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
currently ignores the unlink() return value; the only harm has been some disk
usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
mandatory to forestall the appearance of a wrapped state.

Thanks,
nm


On 2015-06-01 14:22:32 -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres@anarazel.de> wrote:
> > The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> > era, where we didn't do any truncation during recovery...
>
> Right, but now we're piggybacking on the checkpoint records, and I
> don't have any evidence that this approach can't be made robust.  It's
> possible that it can't be made robust, but that's not currently clear.

Well, it's possible that it can be made work without problems. But I
think robust is something different. Having to look at slrus, during
recovery, to find out what to truncate puts more intelligence/complexity
in the recovery path than I'm comfortable with.

> >> By the time we've reached the minimum recovery point, they will have
> >> been recreated by the same WAL records that created them in the first
> >> place.
> >
> > I'm not sure that's true. I think we could end up errorneously removing
> > files that were included in the base backup. Anyway, let's focus on your
> > patch for now.
>
> OK, but I am interested in discussing the other thing too.  I just
> can't piece together the scenario myself - there may well be one.  The
> base backup will begin replay from the checkpoint caused by
> pg_start_backup() and remove anything that wasn't there at the start
> of the backup.  But all of that stuff should get recreated by the time
> we reach the minimum recovery point (end of backup).

I'm not sure if it's reprouceably borked. What about this scenario:
1) pg_start_backup() is called, creates a checkpoint.
2) 2**31 multixacts are created, possibly with several checkpoints
   inbetween
3) pg_multixact is copied
4) basebackup finishes

Unless I'm missing something this will lead to a crash recovery startup
where the first call to TruncateMultiXact() will have
MultiXactState->lastCheckpointedOldest wildly inconsistent with
GetOldestMultiXactOnDisk() return value. Possibly leading to truncation
being skipped errorneously.  Whether that's a problem I'm not yet
entirely sure.

But what *definitely* looks wrong to me is that a TruncateMultiXact() in
this scenario now (since a couple weeks ago) does a
SimpleLruReadPage_ReadOnly() in the members slru via
find_multixact_start(). That just won't work acceptably when we're not
yet consistent. There very well could not be a valid members segment at
that point?  Am I missing something?

> > I'm more worried about the cases where we didn't ever actually "badly
> > wrap around" (i.e. overwrite needed data); but where that's not clear on
> > the standby because the base backup isn't in a consistent state.
>
> I agree. The current patch tries to make it so that we never call
> find_multixact_start() while in recovery, but it doesn't quite
> succeed: the call in TruncateMultiXact still happens during recovery,
> but only once we're sure that the mxact we plan to call it on actually
> exists on disk.  That won't be called until we replay the first
> checkpoint, but that might still be prior to consistency.

It'll pretty much *always* be before we reach consistency, right? It'll
called on the checkpoint created by pg_start_backup()?

I don't think the presence check (that's GetOldestMultiXactOnDisk() in
TruncateMultiXact(), right) helps very much. There's no guarantee at all
that offsets and members are in any way consistent with each other. Or
in themselves for that matter, the copy could very well have been in the
middle of a write the slru page.

I think at the very least we'll have to skip this step while not yet
consistent. That really sucks, because we'll possibly end up with
multixacts that are completely filled by the time we've reached
consistency.

Greetings,

Andres Freund


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
>> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@leadboat.com> wrote:
>> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> >> SetMultiXactIdLimit() bracketed certain parts of its
>> >> logic with if (!InRecovery), but those guards were ineffective because
>> >> it gets called before InRecovery is set in the first place.
>> >
>> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
>> > called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
>> > sense of whether it is subject to similar problems as a result?
>>
>> Well, I think it's pretty weird that those things will get done before
>> beginning recovery, even on an inconsistent cluster, but not during
>> recovery.  That is pretty strange.  I don't see that it can actually
>> do any worse than emit a few log messages at the start of recovery
>> that won't show up again until the end of recovery, though.
>
> Granted.  Would it be better to update both functions at the same time, and
> perhaps to make that a master-only change?  Does the status quo cause more
> practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

It does in the case of the call to find_multixact_start().  If that
fails, we take the server down for no good reason, as demonstrated by
the original report. I'll revert the changes to the other two things
in this function that I changed to be protected by did_trim.  There's
no special reason to think that's a necessary change.

>> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> >> until we're consistent.  Also, instead of passing
>> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> >> the earliest offset that exists on disk.  That way, it won't try to
>> >> read data that's not there.
>> >
>> > Perhaps emit a LOG message when we do that, since it's our last opportunity to
>> > point to the potential data loss?
>>
>> If the problem is just that somebody minmxid got set to 1 instead of
>> the real value, I think that there is no data loss, because none of
>> those older values are actually present there.  But we could add a LOG
>> message anyway.  How do you suggest that we phrase that?
>
> There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
> to 1.  Otherwise, data loss is possible.

Yes, but in that scenario, the log message you propose wouldn't be
triggered.  If true_minmxid > 2^31, then the stored minmxid will not
precede the files on disk; it will follow it (assuming the older stuff
hasn't been truncated, as is likely).  So the message would be
essentially:

LOG: you didn't lose data.  but if exactly the opposite of what this
message is telling you about had happened, then you would have.
DETAIL: Have a nice day.

> I don't hope for an actionable
> message, but we might want a reporter to grep logs for it when we diagnose
> future reports.  Perhaps this:
>
>   "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

This seems misleading.  In the known failure case, it's not that the
pg_multixact files are unexpectedly missing; it's that we incorrectly
think that they should still be there.  Maybe:

oldest MultiXactId on disk %u follows expected oldest MultiXact %u

> For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
> more than one segment:
>
>   "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

So, this scenario will happen whenever the system was interrupted in
the middle of a truncation, or when the system is started from a base
backup and a truncation happened after files were copied.  I'm wary of
giving users the idea that this is an atypical event.  Perhaps a
message at DEBUG1?

>> I'm not sure what you mean about it becoming too old.  At least with
>> that fix, it should get updated to exactly the first file that we
>> didn't remove.  Isn't that right?
>
> Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
> only in that it never reads or writes the cache.  I might expect
> oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
> does hold most of the time.  It does not always hold between the start of the
> quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
> That's because raw_GOMXOD() changes at the instant we unlink the oldest
> segment, but oldestMultiXactOnDisk changes later.  Any backend may call
> GetOldestMultiXactOnDisk() via SetMultiXactIdLimit().  If a backend does that
> concurrent with the checkpointer running TruncateMultiXact() and sees a stale
> oldestMultiXactOnDisk, is that harmless?

As far as I can tell, it's pretty much harmless.  I mean, we've
already discussed the risk that the head and tail could get too far
apart, because really it should be TruncateMultiXact(), not
SetMultiXactIdLimit(), that establishes the new stop point.  But that
problem exists independent of what you're talking about here.

>> I can see that there might be an issue there, but I can't quite put my
>> finger on it well enough to say that it definitely is an issue.  This
>> code is examining the offsets space rather than the members space, and
>> the protections against offsets wraparound have been there since the
>> original commit of this feature
>> (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
>> no concrete evidence that there's ever been a problem in this area.
>> It seems like it might be safer to rejigger that code so that it
>> considers distance-behind-current rather than using the wrapped
>> comparison logic, but I'm reluctant to start rejiggering more things
>> without knowing what specifically I'm fixing.
>
> Anything that could cause the pg_multixact/offsets tail to rotate from being
> in the past to being in the future poses this risk.  (This is the tail from
> the perspective of files on disk; pg_control, datminmxid, and MultiXactState
> notions of the tail play no part here.)  I had in mind that the pg_upgrade
> datminmxid=1 bug could be a tool for achieving that, but I've been
> unsuccessful so far at building a credible thought experiment around it.  Near
> the beginning of your reply, you surmised that this could happen between a
> VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
> Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
> currently ignores the unlink() return value; the only harm has been some disk
> usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
> mandatory to forestall the appearance of a wrapped state.

I'm having trouble figuring out what to do about this.  I mean, the
essential principle of this patch is that if we can't count on
relminmxid, datminmxid, or the control file to be accurate, we can at
least look at what is present on the disk.  If we also cannot count on
that to be accurate, we are left without any reliable source of
information.  Consider a hypothetical cluster where all our stored
minmxids of whatever form are corrupted (say, all change to 1) and in
addition there are stray files in pg_multixact.  I don't think there's
really any way to get ourselves out of trouble in that scenario.

There are some things that we could do that might help a little.  For
example, if the control file's oldest-multi value points to a file
that is not on disk, we could advance it a file at a time until it
lands on a file that is present.  This would help if the oldest-multi
value is pointing to the future, but everything on disk is in the
past.

Another idea is that we could rely on the "on disk" value only until
MultiXactState->lastCheckpointedOldest actually points to a real file.
After we get out of this kind of trouble, we should never get back
into it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

If we were to truncate after vacuum, and only on the primary (via WAL
logging), we could, afaics, just rely on all the values to be
recomputed. I mean relminmxid will be recomputed after a vacuum, and
thus, after some time, will datminmxid and the control file value.  We
could just force a value of 1 to always trigger anti-wraparound vacuums
(or wait for that to happen implicitly, to delay the impact?). That'll
then should then fix the problem in a relatively short amount of time?


On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund <andres@anarazel.de> wrote:
> But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> this scenario now (since a couple weeks ago) does a
> SimpleLruReadPage_ReadOnly() in the members slru via
> find_multixact_start(). That just won't work acceptably when we're not
> yet consistent. There very well could not be a valid members segment at
> that point?  Am I missing something?

Yes: that code isn't new.

TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
9.3.0 and every subsequent release until 9.3.7/9.4.2.  The only thing
that's changed is that we've moved that logic into a function called
find_multixact_start() instead of having it directly inside that
function.  We did that because we needed to use the same logic in some
other places.  The reason why 9.3.7/9.4.2 are causing problems for
people that they didn't have previously is because those new,
additional call sites were poorly chosen and didn't include adequate
protection against calling that function with an invalid input value.
What this patch is about is getting back to the situation that we were
in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did
the thing that you're complaining about here but no one else did.

From my point of view, I think that you are absolutely right to
question what's going on in TruncateMultiXact().  It's kooky, and
there may well be bugs buried there.  But I don't think fixing that
should be the priority right now, because we have zero reports of
problems attributable to that logic.  I think the priority should be
on undoing the damage that we did in 9.3.7/9.4.2, when we made other
places to do the same thing.  We started getting trouble reports
attributable to those changes *almost immediately*, which means that
whether or not TruncateMultiXact() is broken, these new call sites
definitely are.  I think we really need to fix those new places ASAP.

> I think at the very least we'll have to skip this step while not yet
> consistent. That really sucks, because we'll possibly end up with
> multixacts that are completely filled by the time we've reached
> consistency.

That would be a departure from the behavior of every existing release
that includes this code based on, to my knowledge, zero trouble
reports.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 2015-06-02 11:29:24 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund <andres@anarazel.de> wrote:
> > But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> > this scenario now (since a couple weeks ago) does a
> > SimpleLruReadPage_ReadOnly() in the members slru via
> > find_multixact_start(). That just won't work acceptably when we're not
> > yet consistent. There very well could not be a valid members segment at
> > that point?  Am I missing something?
>
> Yes: that code isn't new.

Good point.

> TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
> 9.3.0 and every subsequent release until 9.3.7/9.4.2.

But back then TruncateMultiXact() wasn't called during recovery. But
you're right in that it didn't seem to have reproduced attributable
bugreprorts since 9.3.2 where vacuuming during recovery was
introduced. So it indeed doesn't seem as urgent as fixing the new
callsites.

> That would be a departure from the behavior of every existing release
> that includes this code based on, to my knowledge, zero trouble
> reports.

On the other hand we're now at about bug #5 attributeable to the odd way
truncation works for multixacts. It's obviously complex and hard to get
right. It makes it harder to cope with the wrong values left in
datminxid etc. So I'm still wondering whether fixing this for good isn't
the better approach.

Greetings,

Andres Freund


On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
>> I'm having trouble figuring out what to do about this.  I mean, the
>> essential principle of this patch is that if we can't count on
>> relminmxid, datminmxid, or the control file to be accurate, we can at
>> least look at what is present on the disk.  If we also cannot count on
>> that to be accurate, we are left without any reliable source of
>> information.  Consider a hypothetical cluster where all our stored
>> minmxids of whatever form are corrupted (say, all change to 1) and in
>> addition there are stray files in pg_multixact.  I don't think there's
>> really any way to get ourselves out of trouble in that scenario.
>
> If we were to truncate after vacuum, and only on the primary (via WAL
> logging), we could, afaics, just rely on all the values to be
> recomputed. I mean relminmxid will be recomputed after a vacuum, and
> thus, after some time, will datminmxid and the control file value.  We
> could just force a value of 1 to always trigger anti-wraparound vacuums
> (or wait for that to happen implicitly, to delay the impact?). That'll
> then should then fix the problem in a relatively short amount of time?

The exact circumstances under which we're willing to replace a
relminmxid with a newly-computed one that differs are not altogether
clear to me, but there's an "if" statement protecting that logic, so
there are some circumstances in which we'll leave the existing value
intact.  If we force non-stop vacuuming in that scenario, autovacuum
will just run like crazy without accomplishing anything, which
wouldn't be good.  It would similarly do so when the oldest MXID
reference in the relation is in fact 1, but that value can't be
vacuumed away yet.

Also, the database might be really big.  Even if it were true that a
full scan of every table would get us out of this state, describing
the time that it would take to do that as "relatively short" seems to
me to be considerably understating the impact of a full-cluster
VACUUM.

With regard to the more general question of WAL-logging this, are you
going to work on that?  Are you hoping Alvaro or I will work on that?
Should we draw straws?  It seems like somebody needs to do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> The exact circumstances under which we're willing to replace a
> relminmxid with a newly-computed one that differs are not altogether
> clear to me, but there's an "if" statement protecting that logic, so
> there are some circumstances in which we'll leave the existing value
> intact.

I guess we'd have to change that then.

> It would similarly do so when the oldest MXID reference in the
> relation is in fact 1, but that value can't be vacuumed away yet.

I'd thought of just forcing consumption of one multixact in that
case. Not pretty, but imo acceptable.

> Also, the database might be really big.  Even if it were true that a
> full scan of every table would get us out of this state, describing
> the time that it would take to do that as "relatively short" seems to
> me to be considerably understating the impact of a full-cluster
> VACUUM.

Well. We're dealing with a corrupted cluster. Having a way out that's
done automatically, even if it takes a long while, is pretty good
already. In many cases the corruption (i.e. pg_upgrade) happened long
ago, so the table's relminmxid will already have been recomputed.  I
think that's acceptable.

> With regard to the more general question of WAL-logging this, are you
> going to work on that?  Are you hoping Alvaro or I will work on that?
> Should we draw straws?  It seems like somebody needs to do it.

I'm willing to invest the time to develop an initial version, but I'll
need help evaluating it. I don't have many testing resources available
atm, and I'm not going to trust stuff I developed while travelling by
just looking at the code.

Greetings,

Andres Freund


On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote:
>> That would be a departure from the behavior of every existing release
>> that includes this code based on, to my knowledge, zero trouble
>> reports.
>
> On the other hand we're now at about bug #5 attributeable to the odd way
> truncation works for multixacts. It's obviously complex and hard to get
> right. It makes it harder to cope with the wrong values left in
> datminxid etc. So I'm still wondering whether fixing this for good isn't
> the better approach.

It may well be.  But I think we should do something more surgical
first.  Perhaps we can justify the pain and risk of making changes to
the WAL format in the back-branches, but let's not do it in a rush.
If we can get this patch to a state where it undoes the damage
inflicted in 9.3.7/9.4.2, then we will be in a state where we have as
much reliability as we had in 9.3.6 plus the protections against
member-space wraparound added in 9.3.7 - which, like the patch I'm
proposing now, were directly motivated by multiple, independent bug
reports.  That seems like a good place to get to.  If nothing else, it
will buy us some time to figure out what else we want to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
>> The exact circumstances under which we're willing to replace a
>> relminmxid with a newly-computed one that differs are not altogether
>> clear to me, but there's an "if" statement protecting that logic, so
>> there are some circumstances in which we'll leave the existing value
>> intact.
>
> I guess we'd have to change that then.

Yeah, but first we'd need to assess why it's like that.  Tom was the
one who installed the current logic, but I haven't been able to fully
understand it.

>> It would similarly do so when the oldest MXID reference in the
>> relation is in fact 1, but that value can't be vacuumed away yet.
>
> I'd thought of just forcing consumption of one multixact in that
> case. Not pretty, but imo acceptable.

What if multixact 1 still has living members?

>> Also, the database might be really big.  Even if it were true that a
>> full scan of every table would get us out of this state, describing
>> the time that it would take to do that as "relatively short" seems to
>> me to be considerably understating the impact of a full-cluster
>> VACUUM.
>
> Well. We're dealing with a corrupted cluster. Having a way out that's
> done automatically, even if it takes a long while, is pretty good
> already. In many cases the corruption (i.e. pg_upgrade) happened long
> ago, so the table's relminmxid will already have been recomputed.  I
> think that's acceptable.

I'm a long way from being convinced the automated recovery is
possible.  There are so many different scenarios here that it's very
difficult to reason generally about what the "right" thing to do is.
I agree it would be nice if we had it, though.

>> With regard to the more general question of WAL-logging this, are you
>> going to work on that?  Are you hoping Alvaro or I will work on that?
>> Should we draw straws?  It seems like somebody needs to do it.
>
> I'm willing to invest the time to develop an initial version, but I'll
> need help evaluating it. I don't have many testing resources available
> atm, and I'm not going to trust stuff I developed while travelling by
> just looking at the code.

I'm willing to help with that.  Hopefully I'm not the only one, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Noah Misch
Date:
On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:

> > Granted.  Would it be better to update both functions at the same time, and
> > perhaps to make that a master-only change?  Does the status quo cause more
> > practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?
>
> It does in the case of the call to find_multixact_start().  If that
> fails, we take the server down for no good reason, as demonstrated by
> the original report. I'll revert the changes to the other two things
> in this function that I changed to be protected by did_trim.

Sounds good.

> > There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
> > to 1.  Otherwise, data loss is possible.
>
> Yes, but in that scenario, the log message you propose wouldn't be
> triggered.  If true_minmxid > 2^31, then the stored minmxid will not
> precede the files on disk; it will follow it (assuming the older stuff
> hasn't been truncated, as is likely).

Ah, quite right.

> >   "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"
>
> This seems misleading.  In the known failure case, it's not that the
> pg_multixact files are unexpectedly missing; it's that we incorrectly
> think that they should still be there.  Maybe:
>
> oldest MultiXactId on disk %u follows expected oldest MultiXact %u

Your wording is better.

> > For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
> > more than one segment:
> >
> >   "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"
>
> So, this scenario will happen whenever the system was interrupted in
> the middle of a truncation, or when the system is started from a base
> backup and a truncation happened after files were copied.  I'm wary of
> giving users the idea that this is an atypical event.  Perhaps a
> message at DEBUG1?

DEBUG1 works for me, or feel free to leave it out.

> >> I can see that there might be an issue there, but I can't quite put my
> >> finger on it well enough to say that it definitely is an issue.  This
> >> code is examining the offsets space rather than the members space, and
> >> the protections against offsets wraparound have been there since the
> >> original commit of this feature
> >> (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
> >> no concrete evidence that there's ever been a problem in this area.
> >> It seems like it might be safer to rejigger that code so that it
> >> considers distance-behind-current rather than using the wrapped
> >> comparison logic, but I'm reluctant to start rejiggering more things
> >> without knowing what specifically I'm fixing.
> >
> > Anything that could cause the pg_multixact/offsets tail to rotate from being
> > in the past to being in the future poses this risk.  (This is the tail from
> > the perspective of files on disk; pg_control, datminmxid, and MultiXactState
> > notions of the tail play no part here.)  I had in mind that the pg_upgrade
> > datminmxid=1 bug could be a tool for achieving that, but I've been
> > unsuccessful so far at building a credible thought experiment around it.  Near
> > the beginning of your reply, you surmised that this could happen between a
> > VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
> > Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
> > currently ignores the unlink() return value; the only harm has been some disk
> > usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
> > mandatory to forestall the appearance of a wrapped state.
>
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

We could notice the problem and halt.  You mentioned above the possibility to
have SlruScanDirCbFindEarliest() check "distance-behind-current".  Suppose it
used the current oldestMulti from pg_control as a reference point and
discovered the multixact on disk (a) most far behind that reference point and
(b) most far ahead of that reference point.  If MultiXactIdPrecedes(a, b),
we're good; adopt (a) as the new datminmxid.  Account for the possibility that
the space isn't really wrapped, but out reference point was totally wrong.
Once we determine that the space is wrapped, bail.

As you discuss downthread, the mxids actually present in t_max define the ones
we need.  We could scan them all and set authoritative datminmxid and
pg_control entries.

> There are some things that we could do that might help a little.  For
> example, if the control file's oldest-multi value points to a file
> that is not on disk, we could advance it a file at a time until it
> lands on a file that is present.  This would help if the oldest-multi
> value is pointing to the future, but everything on disk is in the
> past.

That's worth examining further.

> Another idea is that we could rely on the "on disk" value only until
> MultiXactState->lastCheckpointedOldest actually points to a real file.

When wrapping confuses the "on disk" value, I think any reliance on that value
is a dead end.

> After we get out of this kind of trouble, we should never get back
> into it.

Oh, we excel at that.


On 2015-06-02 11:49:56 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> >> The exact circumstances under which we're willing to replace a
> >> relminmxid with a newly-computed one that differs are not altogether
> >> clear to me, but there's an "if" statement protecting that logic, so
> >> there are some circumstances in which we'll leave the existing value
> >> intact.
> >
> > I guess we'd have to change that then.
>
> Yeah, but first we'd need to assess why it's like that.  Tom was the
> one who installed the current logic, but I haven't been able to fully
> understand it.

We're talking about:
    /* Similarly for relminmxid */
    if (MultiXactIdIsValid(minmulti) &&
        pgcform->relminmxid != minmulti &&
        (MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
         MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
    {
        pgcform->relminmxid = minmulti;
        dirty = true;
    }

right? Before that change (78db307bb/87f830e0ce) we only updated
relminmxid if the new value was newer than the old one. That's to avoid
values from going backwards, e.g. when a relation is first VACUUM
FREEZEd and then a normal VACUUM is performed (these values are
unfortunately based on the cutoff values instead of the observed
minima). The new thing is the || MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)
which is there to recognize values from the future. E.g. the 1
errorneously left in place by pg_upgrade.

Makes sense?

> >> It would similarly do so when the oldest MXID reference in the
> >> relation is in fact 1, but that value can't be vacuumed away yet.
> >
> > I'd thought of just forcing consumption of one multixact in that
> > case. Not pretty, but imo acceptable.
>
> What if multixact 1 still has living members?

I think we should just trigger the logic if 1 is below the multi
freezing horizon - in which case a) the 1 will automatically be
replaced, because the horizon is newer b) it can't have a living member
anyway.

- Andres


On 2015-06-01 14:22:32 -0400, Robert Haas wrote:

> commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   Fri May 29 14:35:53 2015 -0400
>
>     foo

Hehe!

> diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
> index 9568ff1..aca829d 100644
> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -199,8 +199,9 @@ typedef struct MultiXactStateData
>      MultiXactOffset nextOffset;
>
>      /*
> -     * Oldest multixact that is still on disk.  Anything older than this
> -     * should not be consulted.  These values are updated by vacuum.
> +     * Oldest multixact that may still be referenced from a relation.
> +     * Anything older than this should not be consulted.  These values are
> +     * updated by vacuum.
>       */
>      MultiXactId oldestMultiXactId;
>      Oid            oldestMultiXactDB;
> @@ -213,6 +214,18 @@ typedef struct MultiXactStateData
>       */
>      MultiXactId lastCheckpointedOldest;
>
> +    /*
> +     * This is the oldest file that actually exist on the disk.  This value
> +     * is initialized by scanning pg_multixact/offsets, and subsequently
> +     * updated each time we complete a truncation.  We need a flag to
> +     * indicate whether this has been initialized yet.
> +     */
> +    MultiXactId oldestMultiXactOnDisk;
> +    bool        oldestMultiXactOnDiskValid;
> +
> +    /* Has TrimMultiXact been called yet? */
> +    bool        didTrimMultiXact;

I'm not really convinced tying things closer to having done trimming is
easier to understand than tying things to recovery having finished.

E.g.
    if (did_trim)
        oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
imo is harder to understand than if (!InRecovery).

Maybe we could just name it finishedStartup and rename the functions accordingly?

> @@ -1956,12 +1971,6 @@ StartupMultiXact(void)
>       */
>      pageno = MXOffsetToMemberPage(offset);
>      MultiXactMemberCtl->shared->latest_page_number = pageno;
> -
> -    /*
> -     * compute the oldest member we need to keep around to avoid old member
> -     * data overrun.
> -     */
> -    DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
>  }

Maybe it's worthwhile to add a 'NB: At this stage the data directory is
not yet necessarily consistent' StartupMultiXact's comments, to avoid
reintroducing problems like this?

>      /*
> +     * We can read this without a lock, because it only changes when nothing
> +     * else is running.
> +     */
> +    did_trim = MultiXactState->didTrimMultiXact;

Err, Hot Standby? It might be ok to not lock, but the comment is
definitely wrong. I'm inclined to simply use locking, this doesn't look
sufficiently critical performancewise.

> +static MultiXactOffset
> +GetOldestReferencedOffset(MultiXactId oldestMXact)
> +{
> +    MultiXactId        earliest;
> +    MultiXactOffset oldestOffset;
> +
> +    /*
> +     * Because of bugs in early 9.3.X and 9.4.X releases (see comments in
> +     * TrimMultiXact for details), oldest_datminmxid might point to a
> +     * nonexistent multixact.  If so, use the oldest one that actually
> +     * exists.  Anything before this can't be successfully used anyway.
> +     */
> +    earliest = GetOldestMultiXactOnDisk();
> +    if (MultiXactIdPrecedes(oldestMXact, earliest))
> +        oldestMXact = earliest;

Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
the disk it'll always get one at a segment boundary, right? I'm not sure
that's actually ok; because the value at the beginning of the segment
can very well end up being a 0, as MaybeExtendOffsetSlru() will have
filled the page with zeros.

I think that should be harmless, the worst that can happen is that
oldestOffset errorneously is 0, which should be correct, even though
possibly overly conservative, in these cases.

Greetings,

Andres Freund


On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund <andres@anarazel.de> wrote:
> I'm not really convinced tying things closer to having done trimming is
> easier to understand than tying things to recovery having finished.
>
> E.g.
>         if (did_trim)
>                 oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
> imo is harder to understand than if (!InRecovery).
>
> Maybe we could just name it finishedStartup and rename the functions accordingly?

Basing that particular call site on InRecovery doesn't work, because
InRecovery isn't set early enough.  But I'm fine to rename it to
whatever.

> Maybe it's worthwhile to add a 'NB: At this stage the data directory is
> not yet necessarily consistent' StartupMultiXact's comments, to avoid
> reintroducing problems like this?

Sure.

>>       /*
>> +      * We can read this without a lock, because it only changes when nothing
>> +      * else is running.
>> +      */
>> +     did_trim = MultiXactState->didTrimMultiXact;
>
> Err, Hot Standby? It might be ok to not lock, but the comment is
> definitely wrong. I'm inclined to simply use locking, this doesn't look
> sufficiently critical performancewise.

/me nods.  Good point.

> Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> the disk it'll always get one at a segment boundary, right? I'm not sure
> that's actually ok; because the value at the beginning of the segment
> can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> filled the page with zeros.
>
> I think that should be harmless, the worst that can happen is that
> oldestOffset errorneously is 0, which should be correct, even though
> possibly overly conservative, in these cases.

Uh oh.  That seems like a real bad problem for this approach.  What
keeps that from being the opposite of too conservative?  There's no
"safe" value in a circular numbering space.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> > the disk it'll always get one at a segment boundary, right? I'm not sure
> > that's actually ok; because the value at the beginning of the segment
> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> > filled the page with zeros.
> >
> > I think that should be harmless, the worst that can happen is that
> > oldestOffset errorneously is 0, which should be correct, even though
> > possibly overly conservative, in these cases.
>
> Uh oh.  That seems like a real bad problem for this approach.  What
> keeps that from being the opposite of too conservative?  There's no
> "safe" value in a circular numbering space.

I think it *might* (I'm really jetlagged) be fine because that'll only
happen after a upgrade from < 9.3. And in that case we initialize
nextOffset to 0. That ought to safe us?

Greetings,

Andres Freund


On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> My guess is that the file existed, and perhaps had one or more pages,
> but the wanted page doesn't exist, so we tried to read but got 0 bytes
> back.  read() returns 0 in this case but doesn't set errno.
>
> I didn't find a way to set things so that the file exists but is of
> shorter contents than oldestMulti by the time the checkpoint record is
> replayed.

I'm just starting to learn about the recovery machinery, so forgive me
if I'm missing something basic here, but I just don't get this.  As I
understand it, offsets/0046 should either have been copied with that
page present in it if it existed before the backup started (apparently
not in this case), or extended to contain it by WAL records that come
after the backup label but before the checkpoint record that
references it (also apparently not in this case).  If neither of these
things happened then that is completely different from the
segment-does-not-exist case where we read zeroes if in recovery on the
assumption that later WAL records must be about to delete the file.
There is no way that future WAL records will make an existing segment
file shorter! So at this point don't we know that there is something
wrong with the backup itself?

Put another way, if you bring this up under 9.4.1, won't it also be
unable to access multixact 4624559 at this point?  Of course it won't
try to do so during recovery like 9.4.2 does, but I'm just trying to
understand how this is supposed to work for 9.4.1 if it needs to
access that multixact for other reasons once normal running is reached
(say you recover up to that checkpoint, and then run
pg_get_multixact_members, or a row has that xmax and its members to be
looked up by a vacuum or any normal transaction).  In other words,
isn't this a base backup that is somehow broken, not at all like the
pg_upgrade corruption case which involved the specific case of
multixact 1 and an entirely missing segment file, and 9.4.2 just tells
you about it sooner than 9.4.1?

For what it's worth, I've also spent a lot of time trying to reproduce
basebackup problems with multixact creation, vacuums and checkpoints
injected at various points between copying backup label, pg_multixact,
and pg_control.  I have so far failed to produce anything more
interesting than the 'reading zeroes' case (see attached
"copy-after-trunction.sh") and a case where the control file points at
a segment that doesn't exist, but it doesn't matter because the backup
label points at a checkpoint from a time when it did and
oldestMultiXactId is updated from there, and then procedes exactly as
it should (see "copy-before-truncation.sh").  I updated my scripts to
look a bit more like your nicely automated example (though mine use a
different trick to create small quantities of multixacts so they run
against unpatched master).  I have also been considering a scenario
where multixact ID wraparound occurs during basebackup with some
ordering that causes trouble, but I don't yet see why it would break
if you replay the WAL from the backup label checkpoint (and I think
the repro would take days/weeks to run...)

--
Thomas Munro
http://www.enterprisedb.com

Attachment
Thomas Munro wrote:
> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > My guess is that the file existed, and perhaps had one or more pages,
> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > back.  read() returns 0 in this case but doesn't set errno.
> >
> > I didn't find a way to set things so that the file exists but is of
> > shorter contents than oldestMulti by the time the checkpoint record is
> > replayed.
>
> I'm just starting to learn about the recovery machinery, so forgive me
> if I'm missing something basic here, but I just don't get this.  As I
> understand it, offsets/0046 should either have been copied with that
> page present in it if it existed before the backup started (apparently
> not in this case), or extended to contain it by WAL records that come
> after the backup label but before the checkpoint record that
> references it (also apparently not in this case).

Exactly --- that's the spot at which I am, also.  I have had this
spinning in my head for three days now, and tried every single variation
that I could think of, but like you I was unable to reproduce the issue.
However, our customer took a second base backup and it failed in exactly
the same way, module some changes to the counters (the file that
didn't exist was 004B rather than 0046).  I'm still at a loss at what
the failure mode is.  We must be missing some crucial detail ...


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > My guess is that the file existed, and perhaps had one or more pages,
>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>> > back.  read() returns 0 in this case but doesn't set errno.
>> >
>> > I didn't find a way to set things so that the file exists but is of
>> > shorter contents than oldestMulti by the time the checkpoint record is
>> > replayed.
>>
>> I'm just starting to learn about the recovery machinery, so forgive me
>> if I'm missing something basic here, but I just don't get this.  As I
>> understand it, offsets/0046 should either have been copied with that
>> page present in it if it existed before the backup started (apparently
>> not in this case), or extended to contain it by WAL records that come
>> after the backup label but before the checkpoint record that
>> references it (also apparently not in this case).
>
> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I have finally reproduced that error!  See attached repro shell script.

The conditions are:

1.  next multixact == oldest multixact (no active multixacts, pointing
past the end)
2.  next multixact would be the first item on a new page (multixact % 2048 == 0)
3.  the page must not be the first in a segment (or we'd get the
read-zeroes case)

That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
vacuum followed by no multixact creations right before your backup
checkpoint).  That seems like reasonably low odds... if it happened
twice in a row, maybe I'm missing something here and there is some
other way to get this...

I realise now that this is actually a symptom of a problem spotted by
Noah recently:

http://www.postgresql.org/message-id/20150601045534.GB23587@tornado.leadboat.com

He noticed the problem for segment boundaries, when not in recovery.
In recovery, segment boundaries don't raise an error (the read-zeroes
case applies), but page boundaries do.  The fix is probably to do
nothing if they are the same, as we do elsewhere, like in the attached
patch.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Thomas Munro wrote:
>>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> > My guess is that the file existed, and perhaps had one or more pages,
>>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>>> > back.  read() returns 0 in this case but doesn't set errno.
>>> >
>>> > I didn't find a way to set things so that the file exists but is of
>>> > shorter contents than oldestMulti by the time the checkpoint record is
>>> > replayed.
>>>
>>> I'm just starting to learn about the recovery machinery, so forgive me
>>> if I'm missing something basic here, but I just don't get this.  As I
>>> understand it, offsets/0046 should either have been copied with that
>>> page present in it if it existed before the backup started (apparently
>>> not in this case), or extended to contain it by WAL records that come
>>> after the backup label but before the checkpoint record that
>>> references it (also apparently not in this case).
>>
>> Exactly --- that's the spot at which I am, also.  I have had this
>> spinning in my head for three days now, and tried every single variation
>> that I could think of, but like you I was unable to reproduce the issue.
>> However, our customer took a second base backup and it failed in exactly
>> the same way, module some changes to the counters (the file that
>> didn't exist was 004B rather than 0046).  I'm still at a loss at what
>> the failure mode is.  We must be missing some crucial detail ...
>
> I have finally reproduced that error!  See attached repro shell script.
>
> The conditions are:
>
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
>
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...
>
> I realise now that this is actually a symptom of a problem spotted by
> Noah recently:
>
> http://www.postgresql.org/message-id/20150601045534.GB23587@tornado.leadboat.com
>
> He noticed the problem for segment boundaries, when not in recovery.
> In recovery, segment boundaries don't raise an error (the read-zeroes
> case applies), but page boundaries do.  The fix is probably to do
> nothing if they are the same, as we do elsewhere, like in the attached
> patch.

Actually, we still need to call DetermineSafeOldestOffset in that
case.  Otherwise, if someone goes from lots of multixacts to none, the
stop point won't advance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>> > that's actually ok; because the value at the beginning of the segment
>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>> > filled the page with zeros.
>> >
>> > I think that should be harmless, the worst that can happen is that
>> > oldestOffset errorneously is 0, which should be correct, even though
>> > possibly overly conservative, in these cases.
>>
>> Uh oh.  That seems like a real bad problem for this approach.  What
>> keeps that from being the opposite of too conservative?  There's no
>> "safe" value in a circular numbering space.
>
> I think it *might* (I'm really jetlagged) be fine because that'll only
> happen after a upgrade from < 9.3. And in that case we initialize
> nextOffset to 0. That ought to safe us?

That's pretty much betting the farm on the bugs we know about today
being the only ones there are.  That seems imprudent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Thomas Munro wrote:

> I have finally reproduced that error!  See attached repro shell script.
>
> The conditions are:
>
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
>
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...

Thanks, this is pretty cool (as far as these things go), but it's not
the scenario I see, in which the complained-about segment is very old by
the time it's truncated away by a checkpoint after freeze.  Segment
requested by checkpoint.oldestMulti is 0046 (offset 140k something --
just to clarify it's not at segment boundary), and the base backup
contains segments from 004B to 00D5.  My problem scenario has
oldestMulti close to 5 million and nextMulti close to 14 million.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> Thomas Munro wrote:
> > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > My guess is that the file existed, and perhaps had one or more pages,
> > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > back.  read() returns 0 in this case but doesn't set errno.
> > >
> > > I didn't find a way to set things so that the file exists but is of
> > > shorter contents than oldestMulti by the time the checkpoint record is
> > > replayed.
> >
> > I'm just starting to learn about the recovery machinery, so forgive me
> > if I'm missing something basic here, but I just don't get this.  As I
> > understand it, offsets/0046 should either have been copied with that
> > page present in it if it existed before the backup started (apparently
> > not in this case), or extended to contain it by WAL records that come
> > after the backup label but before the checkpoint record that
> > references it (also apparently not in this case).

That's not necessarily the case though, given how the code currently
works. In a bunch of places the SLRUs are accessed *before* having been
made consistent by WAL replay. Especially if several checkpoints/vacuums
happened during the base backup the assumed state (i.e. the mxacts
checkpoints refer to) of the data directory soon after the initial
start, and the state of pg_multixact/ won't necessarily match at all.

> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I might have missed it in this already long thread. Could you share a
bunch of details about hte case? It'd be very interesting to see the
contents of the backup label (to see where start/end are), the contents
of the initial checkpoint (to see which mxacts we assume to exist at
start) and what the initial contents of pg_multixact are (to match up).

Greetings,

Andres Freund


Andres Freund wrote:
> On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> > Thomas Munro wrote:
> > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > > My guess is that the file existed, and perhaps had one or more pages,
> > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > > back.  read() returns 0 in this case but doesn't set errno.
> > > >
> > > > I didn't find a way to set things so that the file exists but is of
> > > > shorter contents than oldestMulti by the time the checkpoint record is
> > > > replayed.
> > >
> > > I'm just starting to learn about the recovery machinery, so forgive me
> > > if I'm missing something basic here, but I just don't get this.  As I
> > > understand it, offsets/0046 should either have been copied with that
> > > page present in it if it existed before the backup started (apparently
> > > not in this case), or extended to contain it by WAL records that come
> > > after the backup label but before the checkpoint record that
> > > references it (also apparently not in this case).
>
> That's not necessarily the case though, given how the code currently
> works. In a bunch of places the SLRUs are accessed *before* having been
> made consistent by WAL replay. Especially if several checkpoints/vacuums
> happened during the base backup the assumed state (i.e. the mxacts
> checkpoints refer to) of the data directory soon after the initial
> start, and the state of pg_multixact/ won't necessarily match at all.

Well, the log extract is quite simple; it says that as soon as the
standby starts replay WAL, it fetches exactly one WAL segment, which
contains exactly one online checkpoint record; this record contains
oldestMulti=4624559, which belongs in offset file 0046.  But the
basebackup only contains files starting from 004B onwards.  The base
backup takes a long time, so my guess of what happened is that the
backup label points to the start of the WAL stream (obviously), then the
data files are copied to the standby; during this long process, a
multixact truncation happens.  So by the time the base backup reaches
the pg_multixact directory, the 0046 file has been removed.

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset
147456:Success. 
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure

One idea I had was: what if the oldestMulti pointed to another multi
earlier in the same 0046 file, so that it is read-as-zeroes (and the
file is created), and then a subsequent multixact truncate tries to read
a later page in the file.  In SlruPhysicalReadPage() this would give a
change for open() to not fail, and then read() can fail as above.
However, we would have an earlier LOG message about "reading as zeroes".

Really, the whole question of how this code goes past the open() failure
in SlruPhysicalReadPage baffles me.  I don't see any possible way for
the file to be created ...

> > Exactly --- that's the spot at which I am, also.  I have had this
> > spinning in my head for three days now, and tried every single variation
> > that I could think of, but like you I was unable to reproduce the issue.
> > However, our customer took a second base backup and it failed in exactly
> > the same way, module some changes to the counters (the file that
> > didn't exist was 004B rather than 0046).  I'm still at a loss at what
> > the failure mode is.  We must be missing some crucial detail ...
>
> I might have missed it in this already long thread. Could you share a
> bunch of details about hte case? It'd be very interesting to see the
> contents of the backup label (to see where start/end are), the contents
> of the initial checkpoint (to see which mxacts we assume to exist at
> start) and what the initial contents of pg_multixact are (to match up).

pg_xlogdump says about the checkpoint record:

rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 73DD/AD834470, prev 73DD/AD8343B8, bkp: 0000,
desc:checkpoint: redo 73DD/4003F648; tli 1; prev tli 1; fpw true; xid 1/3530575042; oid 81303440; multi 14003656;
offset66311341; oldest xid 2530472730 in DB 17081243; oldest multi 4624559 in DB 17081243; oldest running xid
3530551992;online 

I don't have the backup label.  But a listing of the files in
pg_multixact/offsets in the master has this:

./offsets:
total 35816
drwx------ 2 postgres postgres 4096 May 27 16:25 .
drwx------ 4 postgres postgres 4096 Jan 3 03:32 ..
-rw------- 1 postgres postgres 262144 Apr 20 13:07 004C
-rw------- 1 postgres postgres 262144 Apr 20 14:36 004D
-rw------- 1 postgres postgres 262144 Apr 20 14:39 004E
-rw------- 1 postgres postgres 262144 Apr 20 14:43 004F
-rw------- 1 postgres postgres 262144 Apr 20 15:06 0050
-rw------- 1 postgres postgres 262144 Apr 20 15:08 0051
-rw------- 1 postgres postgres 262144 Apr 20 15:11 0052
-rw------- 1 postgres postgres 262144 Apr 20 15:15 0053
-rw------- 1 postgres postgres 262144 Apr 20 15:20 0054
-rw------- 1 postgres postgres 262144 Apr 20 15:23 0055
-rw------- 1 postgres postgres 262144 Apr 20 15:25 0056
-rw------- 1 postgres postgres 262144 Apr 20 15:28 0057
-rw------- 1 postgres postgres 262144 Apr 20 15:31 0058
-rw------- 1 postgres postgres 262144 Apr 20 15:33 0059
-rw------- 1 postgres postgres 262144 Apr 20 15:38 005A
-rw------- 1 postgres postgres 262144 Apr 20 15:39 005B
-rw------- 1 postgres postgres 262144 Apr 20 15:41 005C
-rw------- 1 postgres postgres 262144 Apr 20 15:42 005D
-rw------- 1 postgres postgres 262144 Apr 20 15:47 005E
-rw------- 1 postgres postgres 262144 Apr 20 16:12 005F
-rw------- 1 postgres postgres 262144 Apr 20 16:41 0060
-rw------- 1 postgres postgres 262144 Apr 20 16:50 0061
-rw------- 1 postgres postgres 262144 Apr 20 17:51 0062
-rw------- 1 postgres postgres 262144 Apr 20 19:23 0063
-rw------- 1 postgres postgres 262144 Apr 20 22:08 0064
-rw------- 1 postgres postgres 262144 Apr 21 07:29 0065
-rw------- 1 postgres postgres 262144 Apr 21 11:43 0066
-rw------- 1 postgres postgres 262144 Apr 21 14:35 0067
-rw------- 1 postgres postgres 262144 Apr 21 16:02 0068
-rw------- 1 postgres postgres 262144 Apr 21 21:51 0069
-rw------- 1 postgres postgres 262144 Apr 22 08:40 006A
-rw------- 1 postgres postgres 262144 Apr 22 12:18 006B
-rw------- 1 postgres postgres 262144 Apr 22 16:22 006C
-rw------- 1 postgres postgres 262144 Apr 23 00:01 006D
-rw------- 1 postgres postgres 262144 Apr 23 09:08 006E
-rw------- 1 postgres postgres 262144 Apr 23 13:28 006F
-rw------- 1 postgres postgres 262144 Apr 23 17:35 0070
-rw------- 1 postgres postgres 262144 Apr 24 06:37 0071
-rw------- 1 postgres postgres 262144 Apr 24 12:27 0072
-rw------- 1 postgres postgres 262144 Apr 24 16:27 0073
-rw------- 1 postgres postgres 262144 Apr 25 06:37 0074
-rw------- 1 postgres postgres 262144 Apr 25 17:18 0075
-rw------- 1 postgres postgres 262144 Apr 26 15:05 0076
-rw------- 1 postgres postgres 262144 Apr 27 07:48 0077
-rw------- 1 postgres postgres 262144 Apr 27 13:11 0078
-rw------- 1 postgres postgres 262144 Apr 27 18:13 0079
-rw------- 1 postgres postgres 262144 Apr 28 08:41 007A
-rw------- 1 postgres postgres 262144 Apr 28 14:36 007B
-rw------- 1 postgres postgres 262144 Apr 29 00:34 007C
-rw------- 1 postgres postgres 262144 Apr 29 09:44 007D
-rw------- 1 postgres postgres 262144 Apr 29 14:49 007E
-rw------- 1 postgres postgres 262144 Apr 29 19:53 007F
-rw------- 1 postgres postgres 262144 Apr 30 08:43 0080
-rw------- 1 postgres postgres 262144 Apr 30 12:41 0081
-rw------- 1 postgres postgres 262144 Apr 30 15:16 0082
-rw------- 1 postgres postgres 262144 Apr 30 15:21 0083
-rw------- 1 postgres postgres 262144 Apr 30 17:15 0084
-rw------- 1 postgres postgres 262144 May 1 07:45 0085
-rw------- 1 postgres postgres 262144 May 1 17:06 0086
-rw------- 1 postgres postgres 262144 May 2 08:37 0087
-rw------- 1 postgres postgres 262144 May 3 00:14 0088
-rw------- 1 postgres postgres 262144 May 3 18:46 0089
-rw------- 1 postgres postgres 262144 May 4 06:33 008A
-rw------- 1 postgres postgres 262144 May 4 09:29 008B
-rw------- 1 postgres postgres 262144 May 4 13:42 008C
-rw------- 1 postgres postgres 262144 May 4 18:08 008D
-rw------- 1 postgres postgres 262144 May 5 06:56 008E
-rw------- 1 postgres postgres 262144 May 5 11:06 008F
-rw------- 1 postgres postgres 262144 May 5 14:47 0090
-rw------- 1 postgres postgres 262144 May 5 19:39 0091
-rw------- 1 postgres postgres 262144 May 6 07:36 0092
-rw------- 1 postgres postgres 262144 May 6 10:30 0093
-rw------- 1 postgres postgres 262144 May 6 15:23 0094
-rw------- 1 postgres postgres 262144 May 6 22:18 0095
-rw------- 1 postgres postgres 262144 May 7 08:22 0096
-rw------- 1 postgres postgres 262144 May 7 11:31 0097
-rw------- 1 postgres postgres 262144 May 7 14:18 0098
-rw------- 1 postgres postgres 262144 May 7 16:34 0099
-rw------- 1 postgres postgres 262144 May 7 18:39 009A
-rw------- 1 postgres postgres 262144 May 7 21:42 009B
-rw------- 1 postgres postgres 262144 May 8 07:57 009C
-rw------- 1 postgres postgres 262144 May 8 12:34 009D
-rw------- 1 postgres postgres 262144 May 8 17:53 009E
-rw------- 1 postgres postgres 262144 May 9 07:22 009F
-rw------- 1 postgres postgres 262144 May 9 16:42 00A0
-rw------- 1 postgres postgres 262144 May 10 08:43 00A1
-rw------- 1 postgres postgres 262144 May 10 19:23 00A2
-rw------- 1 postgres postgres 262144 May 11 08:20 00A3
-rw------- 1 postgres postgres 262144 May 11 08:48 00A4
-rw------- 1 postgres postgres 262144 May 11 12:23 00A5
-rw------- 1 postgres postgres 262144 May 11 15:36 00A6
-rw------- 1 postgres postgres 262144 May 11 21:24 00A7
-rw------- 1 postgres postgres 262144 May 12 08:15 00A8
-rw------- 1 postgres postgres 262144 May 12 09:20 00A9
-rw------- 1 postgres postgres 262144 May 12 09:42 00AA
-rw------- 1 postgres postgres 262144 May 12 11:04 00AB
-rw------- 1 postgres postgres 262144 May 12 11:06 00AC
-rw------- 1 postgres postgres 262144 May 12 11:09 00AD
-rw------- 1 postgres postgres 262144 May 12 11:26 00AE
-rw------- 1 postgres postgres 262144 May 12 11:26 00AF
-rw------- 1 postgres postgres 262144 May 12 11:31 00B0
-rw------- 1 postgres postgres 262144 May 12 12:06 00B1
-rw------- 1 postgres postgres 262144 May 12 12:37 00B2
-rw------- 1 postgres postgres 262144 May 12 13:18 00B3
-rw------- 1 postgres postgres 262144 May 12 14:34 00B4
-rw------- 1 postgres postgres 262144 May 12 14:29 00B5
-rw------- 1 postgres postgres 262144 May 12 14:44 00B6
-rw------- 1 postgres postgres 262144 May 12 15:16 00B7
-rw------- 1 postgres postgres 262144 May 12 15:26 00B8
-rw------- 1 postgres postgres 262144 May 12 15:35 00B9
-rw------- 1 postgres postgres 262144 May 12 15:47 00BA
-rw------- 1 postgres postgres 262144 May 12 16:02 00BB
-rw------- 1 postgres postgres 262144 May 12 16:18 00BC
-rw------- 1 postgres postgres 262144 May 12 17:00 00BD
-rw------- 1 postgres postgres 262144 May 12 18:48 00BE
-rw------- 1 postgres postgres 262144 May 12 19:59 00BF
-rw------- 1 postgres postgres 262144 May 12 21:39 00C0
-rw------- 1 postgres postgres 262144 May 13 05:31 00C1
-rw------- 1 postgres postgres 262144 May 13 07:35 00C2
-rw------- 1 postgres postgres 262144 May 13 08:16 00C3
-rw------- 1 postgres postgres 262144 May 13 08:30 00C4
-rw------- 1 postgres postgres 262144 May 13 09:00 00C5
-rw------- 1 postgres postgres 262144 May 13 09:19 00C6
-rw------- 1 postgres postgres 262144 May 13 09:41 00C7
-rw------- 1 postgres postgres 262144 May 13 10:03 00C8
-rw------- 1 postgres postgres 262144 May 13 10:17 00C9
-rw------- 1 postgres postgres 262144 May 13 10:40 00CA
-rw------- 1 postgres postgres 262144 May 13 11:05 00CB
-rw------- 1 postgres postgres 262144 May 13 11:28 00CC
-rw------- 1 postgres postgres 262144 May 13 12:02 00CD
-rw------- 1 postgres postgres 262144 May 13 12:58 00CE
-rw------- 1 postgres postgres 262144 May 13 13:06 00CF
-rw------- 1 postgres postgres 262144 May 13 13:12 00D0
-rw------- 1 postgres postgres 262144 May 13 13:21 00D1
-rw------- 1 postgres postgres 262144 May 13 13:31 00D2
-rw------- 1 postgres postgres 262144 May 13 13:40 00D3
-rw------- 1 postgres postgres 262144 May 13 13:53 00D4
-rw------- 1 postgres postgres 188416 May 27 16:40 00D5


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera wrote:

> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Hmm, the checkpointer can call TruncateMultiXact when in recovery, on
restartpoints. I wonder if in recovery this could trigger file creation.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > That's not necessarily the case though, given how the code currently
> > works. In a bunch of places the SLRUs are accessed *before* having been
> > made consistent by WAL replay. Especially if several checkpoints/vacuums
> > happened during the base backup the assumed state (i.e. the mxacts
> > checkpoints refer to) of the data directory soon after the initial
> > start, and the state of pg_multixact/ won't necessarily match at all.
>
> Well, the log extract is quite simple; it says that as soon as the
> standby starts replay WAL, it fetches exactly one WAL segment, which
> contains exactly one online checkpoint record; this record contains
> oldestMulti=4624559, which belongs in offset file 0046.  But the
> basebackup only contains files starting from 004B onwards.  The base
> backup takes a long time, so my guess of what happened is that the
> backup label points to the start of the WAL stream (obviously), then the
> data files are copied to the standby; during this long process, a
> multixact truncation happens.  So by the time the base backup reaches
> the pg_multixact directory, the 0046 file has been removed.

Yes. That's precisely what I was concerned about above and elsewhere in
the thread. It's simply not ok to access a SLRU while not yet
consistent...

> One idea I had was: what if the oldestMulti pointed to another multi
> earlier in the same 0046 file, so that it is read-as-zeroes (and the
> file is created), and then a subsequent multixact truncate tries to read
> a later page in the file.  In SlruPhysicalReadPage() this would give a
> change for open() to not fail, and then read() can fail as above.
> However, we would have an earlier LOG message about "reading as zeroes".
>
> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Wouldn't a previous WAL record zeroing another part of that segment
explain this? A zero sized segment pretty much would lead to this error,
right? Or were you able to check how things look after the failure?

> 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
> 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
> 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at
offset147456: Success. 
> 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
> 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure

From this isn't not entirely clear where this error was triggered from.

Greetings,

Andres Freund


Andres Freund wrote:
> On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:

> > One idea I had was: what if the oldestMulti pointed to another multi
> > earlier in the same 0046 file, so that it is read-as-zeroes (and the
> > file is created), and then a subsequent multixact truncate tries to read
> > a later page in the file.  In SlruPhysicalReadPage() this would give a
> > change for open() to not fail, and then read() can fail as above.
> > However, we would have an earlier LOG message about "reading as zeroes".
> >
> > Really, the whole question of how this code goes past the open() failure
> > in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> > the file to be created ...
>
> Wouldn't a previous WAL record zeroing another part of that segment
> explain this? A zero sized segment pretty much would lead to this error,
> right? Or were you able to check how things look after the failure?

But why would there be a previous WAL record zeroing another part of
that segment?  Note that this segment is very old -- hasn't been written
in quite a while, it's certainly not in slru buffers anymore.

> > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
> > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
> > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at
offset147456: Success. 
> > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
> > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure
>
> From this isn't not entirely clear where this error was triggered from.

Well, reading code, it seems reasonable that to assume that replay of
the checkpoint record I mentioned leads to that error message when the
file exists but is not long enough to contain the given offset.  There
are not MultiXact wal records in the segment.  Also note that there's no
other "restored log file" message after the "entering standby mode"
message.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund <andres@anarazel.de> wrote:
>>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>>> > that's actually ok; because the value at the beginning of the segment
>>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>>> > filled the page with zeros.
>>> >
>>> > I think that should be harmless, the worst that can happen is that
>>> > oldestOffset errorneously is 0, which should be correct, even though
>>> > possibly overly conservative, in these cases.
>>>
>>> Uh oh.  That seems like a real bad problem for this approach.  What
>>> keeps that from being the opposite of too conservative?  There's no
>>> "safe" value in a circular numbering space.
>>
>> I think it *might* (I'm really jetlagged) be fine because that'll only
>> happen after a upgrade from < 9.3. And in that case we initialize
>> nextOffset to 0. That ought to safe us?
>
> That's pretty much betting the farm on the bugs we know about today
> being the only ones there are.  That seems imprudent.

So here's a patch taking a different approach.  In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid).  Instead, we
attempt to cope with the unknown.  That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point.  This means
that if you're in this situation, you run without member wraparound
protection until it's corrected.  A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach.  As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk.  It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
Robert Haas wrote:

> So here's a patch taking a different approach.

I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
would have a problem with me backpatching a pgindent run of multixact.c?

Also, you have a new function SlruPageExists, but we already have
SimpleLruDoesPhysicalPageExist.  No need for two copies of pretty much
the same code, I think.  Your code uses fstat() instead of
lseek(.., SEEK_END) but the exact technique used is probably not
relevant.

I think I like this approach better than your other patch, FWIW --
mainly because it seems simpler.

Will review.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Thomas Munro
Date:
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch <noah@leadboat.com> wrote:
> While testing this (with inconsistent-multixact-fix-master.patch applied,
> FWIW), I noticed a nearby bug with a similar symptom.  TruncateMultiXact()
> omits the nextMXact==oldestMXact special case found in each other
> find_multixact_start() caller, so it reads the offset of a not-yet-created
> MultiXactId.  The usual outcome is to get rangeStart==0, so we truncate less
> than we could.  This can't make us truncate excessively, because
> nextMXact==oldestMXact implies no table contains any mxid.  If nextMXact
> happens to be the first of a segment, an error is possible.  Procedure:
>
> 1. Make a fresh cluster.
> 2. UPDATE pg_database SET datallowconn = true
> 3. Consume precisely 131071 mxids.  Number of offsets per mxid is unimportant.
> 4. vacuumdb --freeze --all
>
> Expected state after those steps:
> $ pg_controldata | grep NextMultiXactId
> Latest checkpoint's NextMultiXactId:  131072
>
> Checkpoint will fail like this:
> 26699 2015-05-31 17:22:33.134 GMT LOG:  statement: checkpoint
> 26661 2015-05-31 17:22:33.134 GMT DEBUG:  performing replication slot checkpoint
> 26661 2015-05-31 17:22:33.136 GMT ERROR:  could not access status of transaction 131072
> 26661 2015-05-31 17:22:33.136 GMT DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file or
directory.
> 26699 2015-05-31 17:22:33.234 GMT ERROR:  checkpoint request failed
> 26699 2015-05-31 17:22:33.234 GMT HINT:  Consult recent messages in the server log for details.
> 26699 2015-05-31 17:22:33.234 GMT STATEMENT:  checkpoint
>
> This does not block startup, and creating one mxid hides the problem again.
> Thus, it is not a top-priority bug like some other parts of this thread.  I
> mention it today mostly so it doesn't surprise hackers testing other fixes.

Thanks.   As mentioned elsewhere in the thread, I discovered that the
same problem exists for page boundaries, with a different error
message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
and master with the same results:

FATAL:  could not access status of transaction 2048
DETAIL:  Could not read from file "pg_multixact/offsets/0000" at
offset 8192: Undefined error: 0.

FATAL:  could not access status of transaction 131072
DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
or directory.

But, yeah, this isn't the bug we're looking for.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Noah Misch
Date:
On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote:
> So here's a patch taking a different approach.  In this approach, if
> the multixact whose members we want to look up doesn't exist, we don't
> use a later one (that might or might not be valid).  Instead, we
> attempt to cope with the unknown.  That means:
>
> 1. In TruncateMultiXact(), we don't truncate.

I like that change a lot.  It's much easier to seek forgiveness for wasting <=
28 GiB of disk than for deleting visibility information wrongly.

> 2. If setting the offset stop limit (the point where we refuse to
> create new multixact space), we don't arm the stop point.  This means
> that if you're in this situation, you run without member wraparound
> protection until it's corrected.  A message gets logged once per
> checkpoint telling you that you have this problem, and another message
> gets logged when things get straightened out and the guards are
> enabled.
>
> 3. If setting the vacuum force point, we assume that it's appropriate
> to immediately force vacuum.

Those seem reasonable, too.

> I've only tested this very lightly - this is just to see what you and
> Noah and others think of the approach.  As compared with the previous
> approach, it has the advantage of making minimal assumptions about the
> sanity of what's on disk.  It has the disadvantage that, for some
> people, the member-wraparound guard won't be enabled at startup -- but
> note that those people can't start 9.3.7/9.4.2 *at all*, so currently
> they are either running without member wraparound protection anyway
> (if they haven't upgraded to those releases) or they're down entirely.

That disadvantage is negligible, considering.

> Another disadvantage is that we'll be triggering what may be quite a
> bit of autovacuum activity for some people, which could be painful.
> On the plus side, they'll hopefully end up with sane relminmxid and
> datminmxid guards afterwards.

That sounds good so long as each table requires just one successful emergency
autovacuum.  I'm not seeing code to ensure that the launched autovacuum will
indeed perform a full-table scan and update relminmxid; is it there?

For sites that can't tolerate an autovacuum storm, what alternative can we
provide?  Is "SET vacuum_multixact_freeze_table_age = 0; VACUUM <table>" of
every table, done before applying the minor update, sufficient?

>  static void
> -DetermineSafeOldestOffset(MultiXactId oldestMXact)
> +DetermineSafeOldestOffset(MultiXactOffset oldestMXact)

Leftover change from an earlier iteration?  The values passed continue to be
MultiXactId values.

>      /* move back to start of the corresponding segment */
> -    oldestOffset -= oldestOffset %
> -        (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
> +    offsetStopLimit = oldestOffset - (oldestOffset %
> +        (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
> +    /* always leave one segment before the wraparound point */
> +    offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
> +
> +    /* if nothing has changed, we're done */
> +    if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
> +        return;
>
>      LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> -    /* always leave one segment before the wraparound point */
> -    MultiXactState->offsetStopLimit = oldestOffset -
> -        (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
> +    MultiXactState->offsetStopLimit = oldestOffset;

That last line needs s/oldestOffset/offsetStopLimit/, I presume.

Thanks,
nm


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch <noah@leadboat.com> wrote:
> I like that change a lot.  It's much easier to seek forgiveness for wasting <=
> 28 GiB of disk than for deleting visibility information wrongly.

I'm glad you like it.  I concur.

>> 2. If setting the offset stop limit (the point where we refuse to
>> create new multixact space), we don't arm the stop point.  This means
>> that if you're in this situation, you run without member wraparound
>> protection until it's corrected.  A message gets logged once per
>> checkpoint telling you that you have this problem, and another message
>> gets logged when things get straightened out and the guards are
>> enabled.
>>
>> 3. If setting the vacuum force point, we assume that it's appropriate
>> to immediately force vacuum.
>
> Those seem reasonable, too.

Cool.

>> I've only tested this very lightly - this is just to see what you and
>> Noah and others think of the approach.  As compared with the previous
>> approach, it has the advantage of making minimal assumptions about the
>> sanity of what's on disk.  It has the disadvantage that, for some
>> people, the member-wraparound guard won't be enabled at startup -- but
>> note that those people can't start 9.3.7/9.4.2 *at all*, so currently
>> they are either running without member wraparound protection anyway
>> (if they haven't upgraded to those releases) or they're down entirely.
>
> That disadvantage is negligible, considering.

All right.

>> Another disadvantage is that we'll be triggering what may be quite a
>> bit of autovacuum activity for some people, which could be painful.
>> On the plus side, they'll hopefully end up with sane relminmxid and
>> datminmxid guards afterwards.
>
> That sounds good so long as each table requires just one successful emergency
> autovacuum.  I'm not seeing code to ensure that the launched autovacuum will
> indeed perform a full-table scan and update relminmxid; is it there?

No.  Oops.

> For sites that can't tolerate an autovacuum storm, what alternative can we
> provide?  Is "SET vacuum_multixact_freeze_table_age = 0; VACUUM <table>" of
> every table, done before applying the minor update, sufficient?

I don't know.  In practical terms, they probably need to ensure that
if pg_multixact/offsets/0000 does not exist, no relations have
relminmxid = 1 and no remaining databases have datminmxid = 1.
Exactly what it will take to get there is possibly dependent on which
minor release you are running; on current minor releases, I am hopeful
that what you propose is sufficient.

>>  static void
>> -DetermineSafeOldestOffset(MultiXactId oldestMXact)
>> +DetermineSafeOldestOffset(MultiXactOffset oldestMXact)
>
> Leftover change from an earlier iteration?  The values passed continue to be
> MultiXactId values.

Oopsie.

>>       /* move back to start of the corresponding segment */
>> -     oldestOffset -= oldestOffset %
>> -             (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
>> +     offsetStopLimit = oldestOffset - (oldestOffset %
>> +             (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
>> +     /* always leave one segment before the wraparound point */
>> +     offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
>> +
>> +     /* if nothing has changed, we're done */
>> +     if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
>> +             return;
>>
>>       LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>> -     /* always leave one segment before the wraparound point */
>> -     MultiXactState->offsetStopLimit = oldestOffset -
>> -             (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
>> +     MultiXactState->offsetStopLimit = oldestOffset;
>
> That last line needs s/oldestOffset/offsetStopLimit/, I presume.

Another oops.

Thanks for the review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks for the review.

Here's a new version.  I've fixed the things Alvaro and Noah noted,
and some compiler warnings about set but unused variables.

I also tested it, and it doesn't quite work as hoped.  If started on a
cluster where oldestMultiXid is incorrectly set to 1, it starts up and
indicates that the member wraparound guards are disabled.  But even
after everything is fixed, they don't get enabled until after the next
full restart.  I think that's because TruncateMultiXact() bails out
too early, without calling DetermineSafeOldestOffset.

My attempt at a quick fix for that problem didn't work out, so I'm
posting this version for now to facilitate further review and testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
Hi,

On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
> +    /*
> +     * Do we need an emergency autovacuum?  If we're not sure, assume yes.
> +     */
> +    return !oldestOffsetKnown ||
> +        (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);

I think without teaching autovac about those rules, this might just lead
to lots of autovac processes starting without knowing they should do
something? They know about autovacuum_multixact_freeze_age, but they
know neither about !oldestOffsetKnown nor, afaics, about pending offset
wraparounds?

> -static MultiXactOffset
> -find_multixact_start(MultiXactId multi)
> +static bool
> +find_multixact_start(MultiXactId multi, MultiXactOffset *result)
>  {
>      MultiXactOffset offset;
>      int            pageno;
> @@ -2630,6 +2741,9 @@ find_multixact_start(MultiXactId multi)
>      pageno = MultiXactIdToOffsetPage(multi);
>      entryno = MultiXactIdToOffsetEntry(multi);
>
> +    if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
> +        return false;
> +
>      /* lock is acquired by SimpleLruReadPage_ReadOnly */
>      slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
>      offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
> @@ -2642,25 +2756,31 @@ find_multixact_start(MultiXactId multi)
>

I think it'd be a good idea to also return false in case of a
InvalidMultiXactId - that'll be returned if the page has been zeroed.


Andres


On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
>> +     /*
>> +      * Do we need an emergency autovacuum?  If we're not sure, assume yes.
>> +      */
>> +     return !oldestOffsetKnown ||
>> +             (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
>
> I think without teaching autovac about those rules, this might just lead
> to lots of autovac processes starting without knowing they should do
> something? They know about autovacuum_multixact_freeze_age, but they
> know neither about !oldestOffsetKnown nor, afaics, about pending offset
> wraparounds?

You're right, but that's why the latest patch has changes in
MultiXactMemberFreezeThreshold.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Alvaro Herrera wrote:
> Robert Haas wrote:
>
> > So here's a patch taking a different approach.
>
> I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
> would have a problem with me backpatching a pgindent run of multixact.c?

Done.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Thanks for the review.
>
> Here's a new version.  I've fixed the things Alvaro and Noah noted,
> and some compiler warnings about set but unused variables.
>
> I also tested it, and it doesn't quite work as hoped.  If started on a
> cluster where oldestMultiXid is incorrectly set to 1, it starts up and
> indicates that the member wraparound guards are disabled.  But even
> after everything is fixed, they don't get enabled until after the next
> full restart.  I think that's because TruncateMultiXact() bails out
> too early, without calling DetermineSafeOldestOffset.
>
> My attempt at a quick fix for that problem didn't work out, so I'm
> posting this version for now to facilitate further review and testing.

Here's a new version with some more fixes and improvements:

- SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset
when the oldest offset became known if the now-known value happened to
be zero.  Fixed.

- SetOffsetVacuumLimit now logs useful information at the DEBUG1
level, so that you can see that it's doing what it's supposed to.

- TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
offsetStopLimit even if it can't truncate anything.  This seems
useless, but it's not, because it may be that the last checkpoint
advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
value, and now we can actually set offsetStopLimit properly.

- TruncateMultiXact no longer calls find_multixact_start when there
are no remaining multixacts.  This is actually a completely separate
bug that goes all the way back to 9.3.0 and can potentially cause
TruncateMultiXact to remove every file in pg_multixact/offsets.
Restarting the cluster becomes impossible because TrimMultiXact barfs.

- TruncateMultiXact now logs a message if the oldest multixact does
not precede the earliest one on disk and is not equal to the next
multixact and yet does not exist.  The value of the log message is
that it discovered the bug mentioned in the previous line, so I think
it's earning its keep.

With this version, I'm able to see that when you start up a
9.3.latest+this patch with a cluster that has a bogus value of 1 in
relminmxid, datminmxid, and the control file, autovacuum vacuums
everything in sight, all the values get set back to the right thing,
and the next checkpoint enables the member-wraparound guards.  This
works with both autovacuum=on and autovacuum=off; the emergency
mechanism kicks in as intended.  We'll want to warn people with big
databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
want to pre-vacuum those tables before upgrading to avoid a vacuum
storm.  But generally I'm pretty happy with this: forcing those values
to get fixed so that we can guard against member-space wraparound
seems like the right thing to do.

So, to summarize, this patch does the following:

- Fixes the failure-to-start problems introduced in 9.4.2 in
complicated pg_upgrade scenarios.
- Prevents the new calls to find_multixact_start we added in 9.4.2
from happening during recovery, where they can only create failure
scenarios.  The call in TruncateMultiXact that has been there all
along is not eliminated, but now handles failure more gracefully.
- Fixes possible incorrect removal of every single
pg_multixact/offsets file when no multixacts exist; one file should be
kept.
- Forces aggressive autovacuuming when the control file's
oldestMultiXid doesn't point to a valid MultiXact and enables member
wraparound at the next checkpoint following the correction of that
problem.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> - Forces aggressive autovacuuming when the control file's
> oldestMultiXid doesn't point to a valid MultiXact and enables member
> wraparound at the next checkpoint following the correction of that
> problem.

Err, enables member wraparound *protection* at the next checkpoint,
not the wraparound itself.

It's worth noting that every startup will now include one of the
following two messages:

LOG:  MultiXact member wraparound protections are now enabled

Or:

LOG:  MultiXact member wraparound protections are disabled because
oldest checkpointed MultiXact %u does not exist on disk
...where %u is probably 1

If you get the second one, you'll get the first one later after vacuum
has done its thing and a checkpoint has happened.

This is, obviously, some log chatter for people who don't have a
problem and never have, but I think it's worth emitting the first
message at startup even when there's no problem, so that people don't
have to make inferences from the absence of a message.  We can tell
people very simply that (1) if they see the first message, everything
is fine; (2) if they see the second message, autovacuum is going to
clean things up and they will eventually see the first message; and
(3) if they see neither message, they haven't upgraded to a fixed
version yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Thomas Munro
Date:
On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Here's a new version with some more fixes and improvements:
>
> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset
> when the oldest offset became known if the now-known value happened to
> be zero.  Fixed.
>
> - SetOffsetVacuumLimit now logs useful information at the DEBUG1
> level, so that you can see that it's doing what it's supposed to.
>
> - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
> offsetStopLimit even if it can't truncate anything.  This seems
> useless, but it's not, because it may be that the last checkpoint
> advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
> value, and now we can actually set offsetStopLimit properly.
>
> - TruncateMultiXact no longer calls find_multixact_start when there
> are no remaining multixacts.  This is actually a completely separate
> bug that goes all the way back to 9.3.0 and can potentially cause
> TruncateMultiXact to remove every file in pg_multixact/offsets.
> Restarting the cluster becomes impossible because TrimMultiXact barfs.
>
> - TruncateMultiXact now logs a message if the oldest multixact does
> not precede the earliest one on disk and is not equal to the next
> multixact and yet does not exist.  The value of the log message is
> that it discovered the bug mentioned in the previous line, so I think
> it's earning its keep.
>
> With this version, I'm able to see that when you start up a
> 9.3.latest+this patch with a cluster that has a bogus value of 1 in
> relminmxid, datminmxid, and the control file, autovacuum vacuums
> everything in sight, all the values get set back to the right thing,
> and the next checkpoint enables the member-wraparound guards.  This
> works with both autovacuum=on and autovacuum=off; the emergency
> mechanism kicks in as intended.  We'll want to warn people with big
> databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
> want to pre-vacuum those tables before upgrading to avoid a vacuum
> storm.  But generally I'm pretty happy with this: forcing those values
> to get fixed so that we can guard against member-space wraparound
> seems like the right thing to do.
>
> So, to summarize, this patch does the following:
>
> - Fixes the failure-to-start problems introduced in 9.4.2 in
> complicated pg_upgrade scenarios.
> - Prevents the new calls to find_multixact_start we added in 9.4.2
> from happening during recovery, where they can only create failure
> scenarios.  The call in TruncateMultiXact that has been there all
> along is not eliminated, but now handles failure more gracefully.
> - Fixes possible incorrect removal of every single
> pg_multixact/offsets file when no multixacts exist; one file should be
> kept.
> - Forces aggressive autovacuuming when the control file's
> oldestMultiXid doesn't point to a valid MultiXact and enables member
> wraparound at the next checkpoint following the correction of that
> problem.

With this patch, when I run the script
"checkpoint-segment-boundary.sh" from
http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=UDg@mail.gmail.com
I see the following during shutdown checkpoint:

LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound

That message comes from SimpleLruTruncate.

--
Thomas Munro
http://www.enterprisedb.com


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Thomas Munro
Date:
On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Here's a new version with some more fixes and improvements:
>>
>> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset
>> when the oldest offset became known if the now-known value happened to
>> be zero.  Fixed.
>>
>> - SetOffsetVacuumLimit now logs useful information at the DEBUG1
>> level, so that you can see that it's doing what it's supposed to.
>>
>> - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
>> offsetStopLimit even if it can't truncate anything.  This seems
>> useless, but it's not, because it may be that the last checkpoint
>> advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
>> value, and now we can actually set offsetStopLimit properly.
>>
>> - TruncateMultiXact no longer calls find_multixact_start when there
>> are no remaining multixacts.  This is actually a completely separate
>> bug that goes all the way back to 9.3.0 and can potentially cause
>> TruncateMultiXact to remove every file in pg_multixact/offsets.
>> Restarting the cluster becomes impossible because TrimMultiXact barfs.
>>
>> - TruncateMultiXact now logs a message if the oldest multixact does
>> not precede the earliest one on disk and is not equal to the next
>> multixact and yet does not exist.  The value of the log message is
>> that it discovered the bug mentioned in the previous line, so I think
>> it's earning its keep.
>>
>> With this version, I'm able to see that when you start up a
>> 9.3.latest+this patch with a cluster that has a bogus value of 1 in
>> relminmxid, datminmxid, and the control file, autovacuum vacuums
>> everything in sight, all the values get set back to the right thing,
>> and the next checkpoint enables the member-wraparound guards.  This
>> works with both autovacuum=on and autovacuum=off; the emergency
>> mechanism kicks in as intended.  We'll want to warn people with big
>> databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
>> want to pre-vacuum those tables before upgrading to avoid a vacuum
>> storm.  But generally I'm pretty happy with this: forcing those values
>> to get fixed so that we can guard against member-space wraparound
>> seems like the right thing to do.
>>
>> So, to summarize, this patch does the following:
>>
>> - Fixes the failure-to-start problems introduced in 9.4.2 in
>> complicated pg_upgrade scenarios.
>> - Prevents the new calls to find_multixact_start we added in 9.4.2
>> from happening during recovery, where they can only create failure
>> scenarios.  The call in TruncateMultiXact that has been there all
>> along is not eliminated, but now handles failure more gracefully.
>> - Fixes possible incorrect removal of every single
>> pg_multixact/offsets file when no multixacts exist; one file should be
>> kept.
>> - Forces aggressive autovacuuming when the control file's
>> oldestMultiXid doesn't point to a valid MultiXact and enables member
>> wraparound at the next checkpoint following the correction of that
>> problem.
>
> With this patch, when I run the script
> "checkpoint-segment-boundary.sh" from
> http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=UDg@mail.gmail.com
> I see the following during shutdown checkpoint:
>
> LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>
> That message comes from SimpleLruTruncate.

Suggested patch attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Noah Misch
Date:
On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote:
> Here's a new version with some more fixes and improvements:

I read through this version and found nothing to change.  I encourage other
hackers to study the patch, though.  The surrounding code is challenging.

> With this version, I'm able to see that when you start up a
> 9.3.latest+this patch with a cluster that has a bogus value of 1 in
> relminmxid, datminmxid, and the control file, autovacuum vacuums
> everything in sight, all the values get set back to the right thing,
> and the next checkpoint enables the member-wraparound guards.

Nice.


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Thomas Munro
Date:
On Fri, Jun 5, 2015 at 1:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Here's a new version with some more fixes and improvements:
>>> [...]
>>
>> With this patch, when I run the script
>> "checkpoint-segment-boundary.sh" from
>> http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=UDg@mail.gmail.com
>> I see the following during shutdown checkpoint:
>>
>> LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>>
>> That message comes from SimpleLruTruncate.
>
> Suggested patch attached.

Is it a problem that we don't drop/forget page buffers from the
members SLRU (unlike SimpleLruTruncate, which is used for the offsets
SLRU)?

I may be missing something but it seems to me that it isn't, because
(1) CheckPointMultiXact is called to flush any dirty pages to disk
before TruncateMultiXact is called and (2) no pages older than the one
holding the oldest offset should be dirtied after CheckPointMultiXact
runs (member space is 'append only', at least until it is recycled),
so any pages in the SLRU whose underlying file has been truncated
should just naturally fall out of the LRU slots.  So they can't create
problems by being written to disk after the unlink.

--
Thomas Munro
http://www.enterprisedb.com


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Robert Haas
Date:
On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote:
> On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote:
>> Here's a new version with some more fixes and improvements:
>
> I read through this version and found nothing to change.  I encourage other
> hackers to study the patch, though.  The surrounding code is challenging.

Andres tested this and discovered that my changes to
find_multixact_start() were far more creative than intended.
Committed and back-patched with a trivial fix for that stupidity and a
novel-length explanation of the changes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote:
>> I read through this version and found nothing to change.  I encourage other
>> hackers to study the patch, though.  The surrounding code is challenging.

> Andres tested this and discovered that my changes to
> find_multixact_start() were far more creative than intended.
> Committed and back-patched with a trivial fix for that stupidity and a
> novel-length explanation of the changes.

So where are we on this?  Are we ready to schedule a new set of
back-branch releases?  If not, what issues remain to be looked at?

            regards, tom lane


On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote:
> >> I read through this version and found nothing to change.  I encourage other
> >> hackers to study the patch, though.  The surrounding code is challenging.
>
> > Andres tested this and discovered that my changes to
> > find_multixact_start() were far more creative than intended.
> > Committed and back-patched with a trivial fix for that stupidity and a
> > novel-length explanation of the changes.
>
> So where are we on this?  Are we ready to schedule a new set of
> back-branch releases?  If not, what issues remain to be looked at?

We're currently still doing bad things while the database is in an
inconsistent state (i.e. read from SLRUs and truncate based on the
results of that). It's quite easy to reproduce base backup startup
failures.

On the other hand, that's not new. And the fix requires, afaics, a new
type of WAL record (issued very infrequently). I'll post a first version
of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
guess we can then decide what we'd like to do.


On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>> > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote:
>> >> I read through this version and found nothing to change.  I encourage other
>> >> hackers to study the patch, though.  The surrounding code is challenging.
>>
>> > Andres tested this and discovered that my changes to
>> > find_multixact_start() were far more creative than intended.
>> > Committed and back-patched with a trivial fix for that stupidity and a
>> > novel-length explanation of the changes.
>>
>> So where are we on this?  Are we ready to schedule a new set of
>> back-branch releases?  If not, what issues remain to be looked at?
>
> We're currently still doing bad things while the database is in an
> inconsistent state (i.e. read from SLRUs and truncate based on the
> results of that). It's quite easy to reproduce base backup startup
> failures.
>
> On the other hand, that's not new. And the fix requires, afaics, a new
> type of WAL record (issued very infrequently). I'll post a first version
> of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
> guess we can then decide what we'd like to do.

There are at least two other known issues that seem like they should
be fixed before we release:

1. The problem that we might truncate an SLRU members page away when
it's in the buffers, but not drop it from the buffers, leading to a
failure when we try to write it later.

2. Thomas's bug fix for another longstanding but that occurs when you
run his checkpoint-segment-boundary.sh script.

I think we might want to try to fix one or both of those before
cutting a new release.  I'm less sold on the idea of installing
WAL-logging in this minor release.  That probably needs to be done,
but right now we've got stuff that worked in early 9.3.X release and
is now broken, and I'm in favor of fixing that first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
>>> So where are we on this?  Are we ready to schedule a new set of
>>> back-branch releases?  If not, what issues remain to be looked at?

>> We're currently still doing bad things while the database is in an
>> inconsistent state (i.e. read from SLRUs and truncate based on the
>> results of that). It's quite easy to reproduce base backup startup
>> failures.
>>
>> On the other hand, that's not new. And the fix requires, afaics, a new
>> type of WAL record (issued very infrequently). I'll post a first version
>> of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
>> guess we can then decide what we'd like to do.

> There are at least two other known issues that seem like they should
> be fixed before we release:

> 1. The problem that we might truncate an SLRU members page away when
> it's in the buffers, but not drop it from the buffers, leading to a
> failure when we try to write it later.

> 2. Thomas's bug fix for another longstanding but that occurs when you
> run his checkpoint-segment-boundary.sh script.

> I think we might want to try to fix one or both of those before
> cutting a new release.  I'm less sold on the idea of installing
> WAL-logging in this minor release.  That probably needs to be done,
> but right now we've got stuff that worked in early 9.3.X release and
> is now broken, and I'm in favor of fixing that first.

Okay, but if we're not committing today to a release wrap on Monday,
I don't see it happening till after PGCon.

            regards, tom lane


Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:

> > There are at least two other known issues that seem like they should
> > be fixed before we release:
>
> > 1. The problem that we might truncate an SLRU members page away when
> > it's in the buffers, but not drop it from the buffers, leading to a
> > failure when we try to write it later.
>
> > 2. Thomas's bug fix for another longstanding but that occurs when you
> > run his checkpoint-segment-boundary.sh script.
>
> > I think we might want to try to fix one or both of those before
> > cutting a new release.  I'm less sold on the idea of installing
> > WAL-logging in this minor release.  That probably needs to be done,
> > but right now we've got stuff that worked in early 9.3.X release and
> > is now broken, and I'm in favor of fixing that first.
>
> Okay, but if we're not committing today to a release wrap on Monday,
> I don't see it happening till after PGCon.

In that case, I think we should get a release out next week.  The
current situation is rather badly broken and dangerous, and the above
two bugs are nowhere as problematic.  If we can get fixes for these over
the weekend, that would be additional bonus.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > 1. The problem that we might truncate an SLRU members page away when
> > it's in the buffers, but not drop it from the buffers, leading to a
> > failure when we try to write it later.

I've got a fix for this, and about three other issues I found during
development of the new truncation codepath.

I'll commit the fix tomorrow.

> > I think we might want to try to fix one or both of those before
> > cutting a new release.  I'm less sold on the idea of installing
> > WAL-logging in this minor release.  That probably needs to be done,
> > but right now we've got stuff that worked in early 9.3.X release and
> > is now broken, and I'm in favor of fixing that first.

I've implemented this, and so far it removes more code than it
adds. It's imo also a pretty clear win in how understandable the code
is.  The remaining work, besides testing, is primarily going over lots
of comment and updating them. Some of them are outdated by the patch,
and some already were.

Will post tonight, together with the other fixes, after I get back from
climbing.

My gut feeling right now is that it's a significant improvement, and
that it'll be reasonable to include it. But I'd definitely like some
independent testing for it, and I'm not sure if that's doable in time
for the wrap.

> Okay, but if we're not committing today to a release wrap on Monday,
> I don't see it happening till after PGCon.

I wonder if, with all the recent, err, training, we could wrap it on
Tuesday instead. Independent of the truncation rework going in or not,
an additional work day to go over all the changes and do some more
testing would be good from my POV.

Greetings,

Andres Freund


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote:
> >> Here's a new version with some more fixes and improvements:
> >
> > I read through this version and found nothing to change.  I encourage other
> > hackers to study the patch, though.  The surrounding code is challenging.
>
> Andres tested this and discovered that my changes to
> find_multixact_start() were far more creative than intended.
> Committed and back-patched with a trivial fix for that stupidity and a
> novel-length explanation of the changes.

I think novel-length is fine.  The bug itself is pretty complicated, and
so is the solution.  Many thanks for working through this.

FWIW I tested with the (attached) reproducer script(*) for my customer's
problem, and it works fine now where it failed before.  One thing which
surprised me a bit, but in hindsight should have been pretty obvious, is
that the "multixact member protections are fully armed" message is only
printed once the standby gets out of recovery, instead of when it
reaches consistent state or some such earlier point.

(*) Actually the script cheats to get past an issue, which I couldn't
actually figure out, that a file can't be "seeked"; I just do a "touch"
to create an empty file there, which causes the same error situation as
on my customer's log.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
On Fri, Jun 5, 2015 at 11:47 AM Andres Freund <andres@anarazel.de> wrote:
But I'd definitely like some
independent testing for it, and I'm not sure if that's doable in time
for the wrap.

I'd be happy to test on my database that was broken, for however much that helps. It's a VM so I can easily revert back as needed. I'm just losing track of all the patches, and what's committed and what I need to manually apply :-). I was about to test what's on REL9_4_STABLE. Let me know if I should do this.

Thanks so much everyone.
On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>> > 1. The problem that we might truncate an SLRU members page away when
>> > it's in the buffers, but not drop it from the buffers, leading to a
>> > failure when we try to write it later.
>
> I've got a fix for this, and about three other issues I found during
> development of the new truncation codepath.
>
> I'll commit the fix tomorrow.

OK.  Then I think we should release next week, so we get the fixes we
have out before PGCon.  The current situation is not good.

>> > I think we might want to try to fix one or both of those before
>> > cutting a new release.  I'm less sold on the idea of installing
>> > WAL-logging in this minor release.  That probably needs to be done,
>> > but right now we've got stuff that worked in early 9.3.X release and
>> > is now broken, and I'm in favor of fixing that first.
>
> I've implemented this, and so far it removes more code than it
> adds. It's imo also a pretty clear win in how understandable the code
> is.  The remaining work, besides testing, is primarily going over lots
> of comment and updating them. Some of them are outdated by the patch,
> and some already were.
>
> Will post tonight, together with the other fixes, after I get back from
> climbing.
>
> My gut feeling right now is that it's a significant improvement, and
> that it'll be reasonable to include it. But I'd definitely like some
> independent testing for it, and I'm not sure if that's doable in time
> for the wrap.

I think we would be foolish to rush that part into the tree.  We
probably got here in the first place by rushing the last round of
fixes too much; let's try not to double down on that mistake.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, Jun 5, 2015 at 2:36 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>
>> > There are at least two other known issues that seem like they should
>> > be fixed before we release:
>>
>> > 1. The problem that we might truncate an SLRU members page away when
>> > it's in the buffers, but not drop it from the buffers, leading to a
>> > failure when we try to write it later.
>>
>> > 2. Thomas's bug fix for another longstanding but that occurs when you
>> > run his checkpoint-segment-boundary.sh script.
>>
>> > I think we might want to try to fix one or both of those before
>> > cutting a new release.  I'm less sold on the idea of installing
>> > WAL-logging in this minor release.  That probably needs to be done,
>> > but right now we've got stuff that worked in early 9.3.X release and
>> > is now broken, and I'm in favor of fixing that first.
>>
>> Okay, but if we're not committing today to a release wrap on Monday,
>> I don't see it happening till after PGCon.
>
> In that case, I think we should get a release out next week.  The
> current situation is rather badly broken and dangerous, and the above
> two bugs are nowhere as problematic.  If we can get fixes for these over
> the weekend, that would be additional bonus.

Yeah, I think I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>> > 1. The problem that we might truncate an SLRU members page away
>when
>>> > it's in the buffers, but not drop it from the buffers, leading to
>a
>>> > failure when we try to write it later.
>>
>> I've got a fix for this, and about three other issues I found during
>> development of the new truncation codepath.
>>
>> I'll commit the fix tomorrow.
>
>OK.  Then I think we should release next week, so we get the fixes we
>have out before PGCon.  The current situation is not good.
>
>>> > I think we might want to try to fix one or both of those before
>>> > cutting a new release.  I'm less sold on the idea of installing
>>> > WAL-logging in this minor release.  That probably needs to be
>done,
>>> > but right now we've got stuff that worked in early 9.3.X release
>and
>>> > is now broken, and I'm in favor of fixing that first.
>>
>> I've implemented this, and so far it removes more code than it
>> adds. It's imo also a pretty clear win in how understandable the code
>> is.  The remaining work, besides testing, is primarily going over
>lots
>> of comment and updating them. Some of them are outdated by the patch,
>> and some already were.
>>
>> Will post tonight, together with the other fixes, after I get back
>from
>> climbing.
>>
>> My gut feeling right now is that it's a significant improvement, and
>> that it'll be reasonable to include it. But I'd definitely like some
>> independent testing for it, and I'm not sure if that's doable in time
>> for the wrap.
>
>I think we would be foolish to rush that part into the tree.  We
>probably got here in the first place by rushing the last round of
>fixes too much; let's try not to double down on that mistake.

My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks. I
havevery little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state
managementin shared memory and recovery are correct. There's just too many non-local subtleties here.  

I don't know what the right thing to do here is.



---
Please excuse brevity and formatting - I am writing this on my mobile phone.


Andres Freund <andres@anarazel.de> writes:
> On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we would be foolish to rush that part into the tree.  We
>> probably got here in the first place by rushing the last round of
>> fixes too much; let's try not to double down on that mistake.

> My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks.
Ihave very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state
managementin shared memory and recovery are correct. There's just too many non-local subtleties here.  

> I don't know what the right thing to do here is.

My gut feeling is that rushing to make a release date is the wrong thing.

If we have confidence that we can ship something on Monday that is
materially more trustworthy than the current releases, then let's aim to
do that; but let's ship only patches we are confident in.  We can do
another set of releases later that incorporate additional fixes.  (As some
wise man once said, there's always another bug.)

If what you're saying is that you don't trust the already-committed patch
very much, then maybe we'd better hold off another couple weeks for more
review and testing.

            regards, tom lane


Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
"Joshua D. Drake"
Date:
On 06/05/2015 01:56 PM, Tom Lane wrote:

> If we have confidence that we can ship something on Monday that is
> materially more trustworthy than the current releases, then let's aim to
> do that; but let's ship only patches we are confident in.  We can do
> another set of releases later that incorporate additional fixes.  (As some
> wise man once said, there's always another bug.)
>
> If what you're saying is that you don't trust the already-committed patch
> very much, then maybe we'd better hold off another couple weeks for more
> review and testing.
>
>             regards, tom lane
>

I believe there are likely quite a few parties willing to help test, if
we knew how?

Sincerely,

jD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Joshua D. Drake wrote:

> I believe there are likely quite a few parties willing to help test, if we
> knew how?

The code involved is related to checkpoints, pg_basebackups that take a
long time to run, and multixact freezing and truncation.  If you can set
up test servers that eat lots of multixacts(*), then have many multixact
freezes and truncations occur, that would probably hit the right spots.
(You can set very frequent freezing by lowering
vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age
settings.  Perhaps changing multixact_freeze_max_age would lead to other
interesting results too.  Truncation occurs during checkpoint, some time
after freezing, so it's probably good that those are frequent too.)

Also, pg_upgrade prior to 9.3.4 is able to produce database with
invalid oldestMulti=1, if you start from a 9.2-or-earlier database that
has already consumed some number of multis.  It would be good to test
starting from those, too, just to make sure the mechanism that deals
with that is good.  There are at least two variations: those that have
nextMulti larger than 65k but less than 2 billion, and those that have
nextMulti closer to 4 billion.  (I think a 9.2 database with nextMulti
less than 65k is uninteresting, because the resulting oldestMulti=1 is
the correct value there.)

(*) Thomas Munro posted a sample program that does that; I believe with
minimal changes you could turn it into infinite looping instead of a
pre-set number of iteration.  Also, perhaps it's possible to come up
with programs that consume multixacts even faster than that, and that
create larger multixacts too.  All variations are useful.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Fri, Jun 5, 2015 at 4:40 PM, Andres Freund <andres@anarazel.de> wrote:
>>I think we would be foolish to rush that part into the tree.  We
>>probably got here in the first place by rushing the last round of
>>fixes too much; let's try not to double down on that mistake.
>
> My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks.
Ihave very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state
managementin shared memory and recovery are correct. There's just too many non-local subtleties here. 
>
> I don't know what the right thing to do here is.

That may be true, but we don't need to get to perfect to be better
than 9.4.2 and 9.4.3, where some people can't start the database.

I will grant you that, if the patch I committed today introduces some
regression that is even worse, life will suck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 2015-06-05 20:47:33 +0200, Andres Freund wrote:
> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > 1. The problem that we might truncate an SLRU members page away when
> > > it's in the buffers, but not drop it from the buffers, leading to a
> > > failure when we try to write it later.
>
> I've got a fix for this, and about three other issues I found during
> development of the new truncation codepath.
>
> I'll commit the fix tomorrow.

I've looked through multixact.c/slru.c and afaics there currently is, as
observed by Thomas, no codepath that exercises the broken behaviour. Due
to the way checkpoints and SLRU truncation are linked "problematic"
pages will have been flushed beforehand.

I think we should fix this either way as it seems like a bad trap, but
I'd rather commit it after the the next minor releases are out.


On 2015-06-05 16:56:18 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I think we would be foolish to rush that part into the tree.  We
> >> probably got here in the first place by rushing the last round of
> >> fixes too much; let's try not to double down on that mistake.
>
> > My problem with that approach is that I think the code has gotten significantly more complex in the least few
weeks.I have very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the
statemanagement in shared memory and recovery are correct. There's just too many non-local subtleties here. 
>
> > I don't know what the right thing to do here is.
>
> My gut feeling is that rushing to make a release date is the wrong thing.
>
> If we have confidence that we can ship something on Monday that is
> materially more trustworthy than the current releases, then let's aim to
> do that; but let's ship only patches we are confident in.  We can do
> another set of releases later that incorporate additional fixes.  (As some
> wise man once said, there's always another bug.)

I've tortured hardware a fair bit with HEAD. So far it looks much better
than 9.4.2+ et al. I've noticed a bunch of, to me at least, new issues:

1) the autovacuum trigger logic isn't perfect yet. I.e. especially with
  autovacuum=off you can get into situations where emergency vacuums
  aren't started when necessary. This is particularly likely to happen
  if either very large multixacts are used, or if the server has been
  shut down while emergency autovacuum where happening. No corruption
  ensues, but it's not easy to get out of.

2) I've managed to corrupt a cluster when a standby performed
  restartpoints less frequently than the master performed
  checkpoints. Because truncations happen in the checkpointer it's not
  that hard to end up with entirely full multixact slrus. This is a
  problem on several fronts. We can IIUC end up truncating away the
  wrong data, and we can be in a bad state upon promotion.  None of that
  is new.

3) It's really confusing that truncation (and thus the limits in shared
  memory) happens in checkpoints. If you hit a limit and manually do all
  the necessary vacuums you'll see a "good" limit in
  pg_database.datminmxid, but you'll still into the error. You manually
  have to force a checkpoint for the truncation to actually
  happen. That's particularly problematic because larger installations,
  where I presume wraparound issues are more likely, often have a large
  checkpoint_timeout setting.

Since none of these are really new, I don't think they should prevent us
from doing a back branch release. While I'm still not convinced we're
better of with 9.4.4 than with 9.4.1, we're certainly better of than
with 9.4.[23] et al.

If we want to go ahead with the release I plan to do a bit more testing
today and tomorrow. If not I'm first going to continue working on fixing
the above.

I've a "good" fix for 1). I'm not 100% sure I'll feel confident with
pushing if we wrap today. I am wondering if we shouldn't at least apply
the portion that unconditionally sends a signal in the ERROR
case. That's still an improvement.


One more thing:
Our testing infrastructure sucks. Without writing C code it's basically
impossible to test wraparounds and such. Even if not particularly useful
for non-devs, I really think we should have functions for creating
burning xids/multixacts in core. Or at least in some extension.


On 2015-06-08 15:15:04 +0200, Andres Freund wrote:
> 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with
>   autovacuum=off you can get into situations where emergency vacuums
>   aren't started when necessary. This is particularly likely to happen
>   if either very large multixacts are used, or if the server has been
>   shut down while emergency autovacuum where happening. No corruption
>   ensues, but it's not easy to get out of.

A first version to address this problem can be found appended to this
email.

Basically it does:
* Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
  autovacuum once per members segment
* For both members and offsets, once hitting the hard limits, signal
  autovacuum everytime. Otherwise we loose the information when
  restarting the database, or when autovac is killed. I ran into this a
  bunch of times while testing.

Regards,

Andres

Attachment

Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Andres Freund wrote:

> A first version to address this problem can be found appended to this
> email.
>
> Basically it does:
> * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
>   autovacuum once per members segment
> * For both members and offsets, once hitting the hard limits, signal
>   autovacuum everytime. Otherwise we loose the information when
>   restarting the database, or when autovac is killed. I ran into this a
>   bunch of times while testing.

I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER only
causes things to happen (i.e. a new worker to be started) when
autovacuum is disabled.  If autovacuum is enabled, postmaster receives
the signal and doesn't do anything about it, because the launcher is
already running.  Of course, regularly scheduled autovac workers will
eventually start running, but perhaps this is not good enough.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>Andres Freund wrote:
>
>> A first version to address this problem can be found appended to this
>> email.
>>
>> Basically it does:
>> * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
>>   autovacuum once per members segment
>> * For both members and offsets, once hitting the hard limits, signal
>>   autovacuum everytime. Otherwise we loose the information when
>>   restarting the database, or when autovac is killed. I ran into this
>a
>>   bunch of times while testing.
>
>I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
>only
>causes things to happen (i.e. a new worker to be started) when
>autovacuum is disabled.  If autovacuum is enabled, postmaster receives
>the signal and doesn't do anything about it, because the launcher is
>already running.  Of course, regularly scheduled autovac workers will
>eventually start running, but perhaps this is not good enough.

Well that's just the same for the plain xid precedent? I'd not mind improving further, but that seems like a separate
thing.

Andres

---
Please excuse brevity and formatting - I am writing this on my mobile phone.


Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
> >only causes things to happen (i.e. a new worker to be started) when
> >autovacuum is disabled.  If autovacuum is enabled, postmaster
> >receives the signal and doesn't do anything about it, because the
> >launcher is already running.  Of course, regularly scheduled autovac
> >workers will eventually start running, but perhaps this is not good
> >enough.
>
> Well that's just the same for the plain xid precedent? I'd not mind
> improving further, but that seems like a separate thing.

Sure.  I just concern that we might be putting excessive trust on
emergency workers being launched at a high pace.  With normally
configured systems (naptime=1min) it shouldn't be a problem, but we have
seen systems with naptime set to one hour or so, and those might feel
some pain; and it would get worse the more databases you have, because
people might feel the need to space the autovac runs even more.

(My personal alarm bells go off when I see autovac_naptime=15min or
more, but apparently not everybody sees things that way.)

> ---
> Please excuse brevity and formatting - I am writing this on my mobile phone.

I wonder if these notices are useful at all.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Andres Freund wrote:
>> On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
>> >only causes things to happen (i.e. a new worker to be started) when
>> >autovacuum is disabled.  If autovacuum is enabled, postmaster
>> >receives the signal and doesn't do anything about it, because the
>> >launcher is already running.  Of course, regularly scheduled autovac
>> >workers will eventually start running, but perhaps this is not good
>> >enough.
>>
>> Well that's just the same for the plain xid precedent? I'd not mind
>> improving further, but that seems like a separate thing.
>
> Sure.  I just concern that we might be putting excessive trust on
> emergency workers being launched at a high pace.  With normally
> configured systems (naptime=1min) it shouldn't be a problem, but we have
> seen systems with naptime set to one hour or so, and those might feel
> some pain; and it would get worse the more databases you have, because
> people might feel the need to space the autovac runs even more.
>
> (My personal alarm bells go off when I see autovac_naptime=15min or
> more, but apparently not everybody sees things that way.)

Uh, I'd echo that sentiment if you did s/15min/1min/

I think Andres's patch is just improving the existing mechanism so
that it's reliable, and you're proposing something notably different
which might be better, but which is really a different proposal
altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 2015-06-08 14:23:32 -0300, Alvaro Herrera wrote:
> Sure.  I just concern that we might be putting excessive trust on
> emergency workers being launched at a high pace.

I'm not sure what to do about that. I mean, it'd not be hard to simply
ignore naptime upon wraparound, but I'm not sure that'd be well
received.

> (My personal alarm bells go off when I see autovac_naptime=15min or
> more, but apparently not everybody sees things that way.)

Understandably so. I'd be alarmed at much lower values than that
actually.

> > ---
> > Please excuse brevity and formatting - I am writing this on my mobile phone.
>
> I wonder if these notices are useful at all.

I only know that I'm less annoyed at reading a untrimmed/badly wrapped
email if it's sent from a mobile phone, where it's hard to impossible to
write a well formatted email, than when sent from a full desktop.
That's why I added the notice...

Andres


Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > (My personal alarm bells go off when I see autovac_naptime=15min or
> > more, but apparently not everybody sees things that way.)
>
> Uh, I'd echo that sentiment if you did s/15min/1min/

Yeah, well, that too I guess.

> I think Andres's patch is just improving the existing mechanism so
> that it's reliable, and you're proposing something notably different
> which might be better, but which is really a different proposal
> altogether.

Fair enough.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote:
> One more thing:
> Our testing infrastructure sucks. Without writing C code it's basically
> impossible to test wraparounds and such. Even if not particularly useful
> for non-devs, I really think we should have functions for creating
> burning xids/multixacts in core. Or at least in some extension.

+1.  This keeps coming up, so it's worth maintaining a verified and speedy
implementation.


On Wed, Jun 10, 2015 at 7:16 PM, Noah Misch <noah@leadboat.com> wrote:
On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote:
> One more thing:
> Our testing infrastructure sucks. Without writing C code it's basically
> impossible to test wraparounds and such. Even if not particularly useful
> for non-devs, I really think we should have functions for creating
> burning xids/multixacts in core. Or at least in some extension.

+1.  This keeps coming up, so it's worth maintaining a verified and speedy
implementation.

+1 from me as well.

Also, I've pretty much given up on testing this area myself, because of the issue pointed out here:


I think this is the same issue as part of Andres' point 1.

It is pretty frustrating and futile to test wrap around when the database doesn't live long enough to wrap around under the high-stress conditions.

I had thought that all changes to ShmemVariableCache except nextXid should be WAL logged at the time they occur, not just at the next checkpoint.  But that wouldn't fix the problem, as the change to ShmemVariableCache has to be transactional with the change to pg_database.  So it would have to be WAL logged inside the commit record or any transaction which changes pg_database.

Cheers,

Jeff
Just wanted to report that I rolled back my VM to where it was with 9.4.2 installed and it wouldn't start. I installed 9.4.4 and now it starts up just fine:

> 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG:  database system was shut down at 2015-05-27 13:12:55 PDT
> 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG:  MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on disk
> 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG:  autovacuum launcher started
> 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG:  database system is ready to accept connections
>  done
> server started

And this is showing up in my serverlog periodically as the emergency autovacuums are running:

> 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG:  MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on disk

**Thank you Robert and all involved for the resolution to this.**

> With the fixes introduced in this release, such a situation will result in immediate emergency autovacuuming until a correct oldestMultiXid value can be determined

Okay, I notice these vacuums are of the "to prevent wraparound" type (like VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good to know, we'll plan our software updates accordingly.

Is there any risk until these autovacuums finish?

Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Andres Freund wrote:

> A first version to address this problem can be found appended to this
> email.
>
> Basically it does:
> * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
>   autovacuum once per members segment
> * For both members and offsets, once hitting the hard limits, signal
>   autovacuum everytime. Otherwise we loose the information when
>   restarting the database, or when autovac is killed. I ran into this a
>   bunch of times while testing.

Sounds reasonable.

I see another hole in this area.  See do_start_worker() -- there we only
consider the offsets limit to determine a database to be in
almost-wrapped-around state (causing emergency attention).  If the
database in members trouble has no pgstat entry, it might get completely
ignored.  I think the way to close this hole is to
find_multixact_start() in the autovac launcher for the database with the
oldest datminmxid, to determine whether we need to activate emergency
mode for it.  (Maybe instead of having this logic in autovacuum, it
should be a new function that receives database datminmulti and returns
a boolean indicating whether the database is in trouble or not.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Fri, Jun 12, 2015 at 7:27 PM, Steve Kehlet <steve.kehlet@gmail.com> wrote:
> Just wanted to report that I rolled back my VM to where it was with 9.4.2
> installed and it wouldn't start. I installed 9.4.4 and now it starts up just
> fine:
>
>> 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG:  database system was shut down
>> at 2015-05-27 13:12:55 PDT
>> 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG:  MultiXact member wraparound
>> protections are disabled because oldest checkpointed MultiXact 1 does not
>> exist on disk
>> 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG:  autovacuum launcher started
>> 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG:  database system is ready to
>> accept connections
>>  done
>> server started
>
> And this is showing up in my serverlog periodically as the emergency
> autovacuums are running:
>
>> 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG:  MultiXact member wraparound
>> protections are disabled because oldest checkpointed MultiXact 1 does not
>> exist on disk
>
> **Thank you Robert and all involved for the resolution to this.**
>
>> With the fixes introduced in this release, such a situation will result in
>> immediate emergency autovacuuming until a correct oldestMultiXid value can
>> be determined
>
> Okay, I notice these vacuums are of the "to prevent wraparound" type (like
> VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good to
> know, we'll plan our software updates accordingly.
>
> Is there any risk until these autovacuums finish?

As long as you see only a modest number of files in
pg_multixact/members, you're OK.  But in theory, until that emergency
autovacuuming finishes, there's nothing keeping that directory from
wrapping around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> I see another hole in this area.  See do_start_worker() -- there we only
> consider the offsets limit to determine a database to be in
> almost-wrapped-around state (causing emergency attention).  If the
> database in members trouble has no pgstat entry, it might get completely
> ignored.

For the record -- it was pointed out to me that this was actually fixed
by 53bb309d2.  \o/

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Alvaro Herrera
Date:
Thomas Munro wrote:

> Thanks.   As mentioned elsewhere in the thread, I discovered that the
> same problem exists for page boundaries, with a different error
> message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
> and master with the same results:
>
> FATAL:  could not access status of transaction 2048
> DETAIL:  Could not read from file "pg_multixact/offsets/0000" at
> offset 8192: Undefined error: 0.
>
> FATAL:  could not access status of transaction 131072
> DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
> or directory.

So I checked this bug against current master, because it's claimed to be
closed.  The first script doesn't emit a message at all; the second
script does emit a message:

LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound

If you start and stop again, there's no more noise in the logs.  That's
pretty innocuous -- great.

But then I modified your script to do two segments instead of one.  Then
after the second cycle is done, start the server and stop it again.  The
end result is a bit surprising: you end up with no files in
pg_multixact/offsets at all!

I applied Andres' patch to WAL-log truncations and rerun the test, and
the files do not automatically disappear anymore.  So after that patch
we might be fine, although I'd like to actually understand the failure
mode here to see whether it's actually fixed.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

From
Thomas Munro
Date:
On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>
>> Thanks.   As mentioned elsewhere in the thread, I discovered that the
>> same problem exists for page boundaries, with a different error
>> message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
>> and master with the same results:
>>
>> FATAL:  could not access status of transaction 2048
>> DETAIL:  Could not read from file "pg_multixact/offsets/0000" at
>> offset 8192: Undefined error: 0.
>>
>> FATAL:  could not access status of transaction 131072
>> DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
>> or directory.
>
> So I checked this bug against current master, because it's claimed to be
> closed.  The first script doesn't emit a message at all; the second
> script does emit a message:
>
> LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>
> If you start and stop again, there's no more noise in the logs.  That's
> pretty innocuous -- great.

Right, I included a fix for this in
https://commitfest.postgresql.org/5/265/ which handles both
pg_subtrans and pg_multixact, since it was lost in the noise in this
thread...  Hopefully someone can review that.

> But then I modified your script to do two segments instead of one.  Then
> after the second cycle is done, start the server and stop it again.  The
> end result is a bit surprising: you end up with no files in
> pg_multixact/offsets at all!

Ouch.  I see why: latest_page_number gets initialised to a different
value when you restart (computed from oldest multixact ID, whereas
during normal running it remembers the last created page number), so
in this case (next == oldest, next % 2048 == 0), restarting the server
moves latest_page_number forwards by one, so SimpleLruTruncate no
longer bails out with the above error message and it happily deletes
all files.  That is conceptually OK (there are no multixacts, so no
files should be OK), but see below...  Applying the page linked above
prevents this problem (it always keeps at least one multixact and
therefore at least one page and therefore at least one segment,
because it steps back one multixact to avoid boundary problems when
oldest == next).

As for whether it's actually OK to have no files in
pg_multixact/offsets, it seems that if you restart *twice* after
running checkpoint-segment-boundary.sh, you finish up with earliest =
4294965248 in TruncateMultiXact, because this code assumes that there
was at least one file found and then proceeds to assign (-1 * 2048) to
earliest (which is unsigned).

        trunc.earliestExistingPage = -1;
        SlruScanDirectory(MultiXactOffsetCtl,
SlruScanDirCbFindEarliest, &trunc);
        earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
        if (earliest < FirstMultiXactId)
                earliest = FirstMultiXactId;

I think this should bail out if earliestExistingPage is still -1 after
the call to SlruScanDirectory.

--
Thomas Munro
http://www.enterprisedb.com