Thread: 9.2.3 crashes during archive recovery

9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Hello, 9.2.3 crashes during archive recovery.

This was also corrected at some point on origin/master with
another problem fixed by the commit below if my memory is
correct. But current HEAD and 9.2.3 crashes during archive
recovery (not on standby) by the 'marking deleted page visible'
problem.

http://www.postgresql.org/message-id/50C7612E.5060008@vmware.com

The script attached replays the situation.

This could be illustrated as below,

1. StartupXLOG() initializes minRecoveryPoint with ControlFile->CheckPoint.redo (virtually) at first of archive
recoveryprocess.
 

2. Consistency state becomes true just before redo starts.

3. Redoing certain XLOG_HEAP2_VISIBLE record causes crash because  the page for visibility map has been alrady removed
by smgr_truncate() who had emitted XLOG_SMGR_TRUNCATE record  after that.
 
  > PANIC: WAL contains references to invalid pages
  After all, the consistency point should be postponed until the  XLOG_SMGR_TRUNCATE.


In this case, the FINAL consistency point is at the
XLOG_SMGR_TRUNCATE record, but current implemet does not record
the consistency point (checkpoint, or commit or smgr_truncate)
itself, so we cannot predict the final consistency point on
starting of recovery.

Recovery was completed successfully with the small and rough
patch below. This allows multiple consistency points but also
kills quick cessation. (And of course no check is done against
replication/promotion until now X-)

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -6029,7 +6029,9 @@ CheckRecoveryConsistency(void)
>           */
>          XLogCheckInvalidPages();
>  
> -        reachedConsistency = true;
> +        //reachedConsistency = true;
> +        minRecoveryPoint = InvalidXLogRecPtr;
> +
>          ereport(LOG,
>                  (errmsg("consistent recovery state reached at %X/%X",
>                          (uint32) (XLogCtl->lastReplayedEndRecPtr >> 32),

On the other hand, updating control file on every commits or
smgr_truncate's should slow the transactions..


Regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


====== Replay script.
#! /bin/sh
PGDATA=$HOME/pgdata
PGARC=$HOME/pgarc
rm -rf $PGDATA/* $PGARC/*
initdb
cat >> $PGDATA/postgresql.conf <<EOF
wal_level = hot_standby
checkpoint_segments = 300
checkpoint_timeout = 1h
archive_mode = on
archive_command = 'cp %p $PGARC/%f'
EOF
pg_ctl -w start
createdb
psql <<EOF
select version();
create table t (a int);
insert into t values (5);
checkpoint;
vacuum;
delete from t;
vacuum;
EOF
pg_ctl -w stop -m i
cat >> $PGDATA/recovery.conf <<EOF
restore_command='if [ -f $PGARC/%f ]; then cp $PGARC/%f %p; fi'
EOF
postgres
=====



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 13.02.2013 09:46, Kyotaro HORIGUCHI wrote:
> In this case, the FINAL consistency point is at the
> XLOG_SMGR_TRUNCATE record, but current implemet does not record
> the consistency point (checkpoint, or commit or smgr_truncate)
> itself, so we cannot predict the final consistency point on
> starting of recovery.

Hmm, what you did was basically:

1. Run server normally.
2. Kill it with "pg_ctl stop -m immediate".
3. Create a recovery.conf file, turning the server into a hot standby.

Without step 3, the server would perform crash recovery, and it would 
work. But because of the recovery.conf file, the server goes into 
archive recovery, and because minRecoveryPoint is not set, it assumes 
that the system is consistent from the start.

Aside from the immediate issue with truncation, the system really isn't 
consistent until the WAL has been replayed far enough, so it shouldn't 
open for hot standby queries. There might be other, later, changes 
already flushed to data files. The system has no way of knowing how far 
it needs to replay the WAL to become consistent.

At least in back-branches, I'd call this a pilot error. You can't turn a 
master into a standby just by creating a recovery.conf file. At least 
not if the master was not shut down cleanly first.

If there's a use case for doing that, maybe we can do something better 
in HEAD. If the control file says that the system was running 
(DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash 
recovery first, until we reach the end of WAL, and go into archive 
recovery mode after that. We'd recover all the WAL files in pg_xlog as 
far as we can, same as in crash recovery, and only start restoring files 
from the archive once we reach the end of WAL in pg_xlog. At that point, 
we'd also consider the system as consistent, and start up for hot standby.

I'm not sure that's worth the trouble, though. Perhaps it would be 
better to just throw an error if the control file state is 
DB_IN_PRODUCTION and a recovery.conf file exists. The admin can always 
start the server normally first, shut it down cleanly, and then create 
the recovery.conf file.

> On the other hand, updating control file on every commits or
> smgr_truncate's should slow the transactions..

To be precise, we'd need to update the control file on every 
XLogFlush(), like we do during archive recovery. That would indeed be 
unacceptable from a performance point of view. Updating the control file 
that often would also be bad for robustness.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> At least in back-branches, I'd call this a pilot error. You can't turn a 
> master into a standby just by creating a recovery.conf file. At least 
> not if the master was not shut down cleanly first.
> ...
> I'm not sure that's worth the trouble, though. Perhaps it would be 
> better to just throw an error if the control file state is 
> DB_IN_PRODUCTION and a recovery.conf file exists.

+1 for that approach, at least until it's clear there's a market for
doing this sort of thing.  I think the error check could be
back-patched, too.
        regards, tom lane



Re: 9.2.3 crashes during archive recovery

From
Simon Riggs
Date:
On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> To be precise, we'd need to update the control file on every XLogFlush(),
> like we do during archive recovery. That would indeed be unacceptable from a
> performance point of view. Updating the control file that often would also
> be bad for robustness.

If those arguments make sense, then why don't they apply to recovery as well?

It sounds like we need to look at something better for use during
archive recovery.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: 9.2.3 crashes during archive recovery

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> To be precise, we'd need to update the control file on every XLogFlush(),
>> like we do during archive recovery. That would indeed be unacceptable from a
>> performance point of view. Updating the control file that often would also
>> be bad for robustness.

> If those arguments make sense, then why don't they apply to recovery as well?

In plain old crash recovery, don't the checks on whether to apply WAL
records based on LSN take care of this?
        regards, tom lane



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 13.02.2013 20:25, Simon Riggs wrote:
> On 13 February 2013 09:04, Heikki Linnakangas<hlinnakangas@vmware.com>  wrote:
>
>> To be precise, we'd need to update the control file on every XLogFlush(),
>> like we do during archive recovery. That would indeed be unacceptable from a
>> performance point of view. Updating the control file that often would also
>> be bad for robustness.
>
> If those arguments make sense, then why don't they apply to recovery as well?

To some degree, they do. The big difference is that during normal 
operation, every commit is XLogFlushed(). During recovery, XLogFlush() 
happens much less frequently - certainly not after replaying each commit 
record.

> It sounds like we need to look at something better for use during
> archive recovery.

Well, no-one's complained about the performance. From a robustness point 
of view, it might be good to keep the minRecoveryPoint value in a 
separate file, for example, to avoid rewriting the control file that 
often. Then again, why fix it when it's not broken.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> Well, no-one's complained about the performance. From a robustness point 
> of view, it might be good to keep the minRecoveryPoint value in a 
> separate file, for example, to avoid rewriting the control file that 
> often. Then again, why fix it when it's not broken.

It would only be broken if someone interrupted a crash recovery
mid-flight and tried to establish a recovery stop point before the end
of WAL, no?  Why don't we just forbid that case?  This would either be
the same as, or a small extension of, the pg_control state vs existence
of recovery.conf error check that was just discussed.
        regards, tom lane



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 13.02.2013 21:21, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> Well, no-one's complained about the performance. From a robustness point
>> of view, it might be good to keep the minRecoveryPoint value in a
>> separate file, for example, to avoid rewriting the control file that
>> often. Then again, why fix it when it's not broken.
>
> It would only be broken if someone interrupted a crash recovery
> mid-flight and tried to establish a recovery stop point before the end
> of WAL, no?  Why don't we just forbid that case? This would either be
> the same as, or a small extension of, the pg_control state vs existence
> of recovery.conf error check that was just discussed.

The problem is when you interrupt archive recovery (kill -9), and 
restart. After restart, the system needs to know how far the WAL was 
replayed before the crash, because it must not open for hot standby 
queries, or allow the database to be started up in master-mode, until 
it's replayed the WAL up to that same point again.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 13.02.2013 21:03, Tom Lane wrote:
> Simon Riggs<simon@2ndQuadrant.com>  writes:
>> On 13 February 2013 09:04, Heikki Linnakangas<hlinnakangas@vmware.com>  wrote:
>>> To be precise, we'd need to update the control file on every XLogFlush(),
>>> like we do during archive recovery. That would indeed be unacceptable from a
>>> performance point of view. Updating the control file that often would also
>>> be bad for robustness.
>
>> If those arguments make sense, then why don't they apply to recovery as well?
>
> In plain old crash recovery, don't the checks on whether to apply WAL
> records based on LSN take care of this?

The problem we're trying to solve is determining how much WAL needs to 
be replayed until the database is consistent again. In crash recovery, 
the answer is "all of it". That's why the CRC in the WAL is essential; 
it's required to determine where the WAL ends. But if we had some other 
mechanism, like if we updated minRecoveryPoint after every XLogFlush() 
like Simon suggested, we wouldn't necessarily need the CRC to detect end 
of WAL (not that I'd suggest removing it anyway), and we could throw an 
error if there is corrupt bit somewhere in the WAL before the true end 
of WAL.

In archive recovery, we can't just say "replay all the WAL", because the 
whole idea of PITR is to not recover all the WAL. So we use 
minRecoveryPoint to keep track of how far the WAL needs to be replayed 
at a minimum, for the database to be consistent.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 13.02.2013 21:21, Tom Lane wrote:
>> It would only be broken if someone interrupted a crash recovery
>> mid-flight and tried to establish a recovery stop point before the end
>> of WAL, no?  Why don't we just forbid that case? This would either be
>> the same as, or a small extension of, the pg_control state vs existence
>> of recovery.conf error check that was just discussed.

> The problem is when you interrupt archive recovery (kill -9), and 
> restart. After restart, the system needs to know how far the WAL was 
> replayed before the crash, because it must not open for hot standby 
> queries, or allow the database to be started up in master-mode, until 
> it's replayed the WAL up to that same point again.

Well, archive recovery is a different scenario --- Simon was questioning
whether we need a minRecoveryPoint mechanism in crash recovery, or at
least that's what I thought he asked.
        regards, tom lane



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 13.02.2013 21:30, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> On 13.02.2013 21:21, Tom Lane wrote:
>>> It would only be broken if someone interrupted a crash recovery
>>> mid-flight and tried to establish a recovery stop point before the end
>>> of WAL, no?  Why don't we just forbid that case? This would either be
>>> the same as, or a small extension of, the pg_control state vs existence
>>> of recovery.conf error check that was just discussed.
>
>> The problem is when you interrupt archive recovery (kill -9), and
>> restart. After restart, the system needs to know how far the WAL was
>> replayed before the crash, because it must not open for hot standby
>> queries, or allow the database to be started up in master-mode, until
>> it's replayed the WAL up to that same point again.
>
> Well, archive recovery is a different scenario --- Simon was questioning
> whether we need a minRecoveryPoint mechanism in crash recovery, or at
> least that's what I thought he asked.

Ah, ok. The short answer to that is "no", because in crash recovery, we 
just replay the WAL all the way to the end. I thought he was questioning 
updating the control file at every XLogFlush() during archive recovery. 
The answer to that is that it's not so bad, because XLogFlush() is 
called so infrequently during recovery.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> The problem we're trying to solve is determining how much WAL needs to 
> be replayed until the database is consistent again. In crash recovery, 
> the answer is "all of it". That's why the CRC in the WAL is essential; 
> it's required to determine where the WAL ends. But if we had some other 
> mechanism, like if we updated minRecoveryPoint after every XLogFlush() 
> like Simon suggested, we wouldn't necessarily need the CRC to detect end 
> of WAL (not that I'd suggest removing it anyway), and we could throw an 
> error if there is corrupt bit somewhere in the WAL before the true end 
> of WAL.

Meh.  I think that would be disastrous from both performance and
reliability standpoints.  (Performance because the whole point of WAL is
to commit with only one disk write in one place, and reliability because
of greatly increasing the number of writes to the utterly-critical
pg_control file.)
        regards, tom lane



Re: 9.2.3 crashes during archive recovery

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 13.02.2013 21:30, Tom Lane wrote:
>> Well, archive recovery is a different scenario --- Simon was questioning
>> whether we need a minRecoveryPoint mechanism in crash recovery, or at
>> least that's what I thought he asked.

> Ah, ok. The short answer to that is "no", because in crash recovery, we 
> just replay the WAL all the way to the end. I thought he was questioning 
> updating the control file at every XLogFlush() during archive recovery. 
> The answer to that is that it's not so bad, because XLogFlush() is 
> called so infrequently during recovery.

Right, and it's not so evil from a reliability standpoint either, partly
because of that and partly because, by definition, this isn't your only
copy of the data.
        regards, tom lane



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 13.02.2013 17:02, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> At least in back-branches, I'd call this a pilot error. You can't turn a
>> master into a standby just by creating a recovery.conf file. At least
>> not if the master was not shut down cleanly first.
>> ...
>> I'm not sure that's worth the trouble, though. Perhaps it would be
>> better to just throw an error if the control file state is
>> DB_IN_PRODUCTION and a recovery.conf file exists.
>
> +1 for that approach, at least until it's clear there's a market for
> doing this sort of thing.  I think the error check could be
> back-patched, too.

Hmm, I just realized a little problem with that approach. If you take a 
base backup using an atomic filesystem backup from a running server, and 
start archive recovery from that, that's essentially the same thing as 
Kyotaro's test case.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Simon Riggs
Date:
On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> Without step 3, the server would perform crash recovery, and it would work.
> But because of the recovery.conf file, the server goes into archive
> recovery, and because minRecoveryPoint is not set, it assumes that the
> system is consistent from the start.
>
> Aside from the immediate issue with truncation, the system really isn't
> consistent until the WAL has been replayed far enough, so it shouldn't open
> for hot standby queries. There might be other, later, changes already
> flushed to data files. The system has no way of knowing how far it needs to
> replay the WAL to become consistent.
>
> At least in back-branches, I'd call this a pilot error. You can't turn a
> master into a standby just by creating a recovery.conf file. At least not if
> the master was not shut down cleanly first.
>
> If there's a use case for doing that, maybe we can do something better in
> HEAD. If the control file says that the system was running
> (DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash
> recovery first, until we reach the end of WAL, and go into archive recovery
> mode after that. We'd recover all the WAL files in pg_xlog as far as we can,
> same as in crash recovery, and only start restoring files from the archive
> once we reach the end of WAL in pg_xlog. At that point, we'd also consider
> the system as consistent, and start up for hot standby.
>
> I'm not sure that's worth the trouble, though. Perhaps it would be better to
> just throw an error if the control file state is DB_IN_PRODUCTION and a
> recovery.conf file exists. The admin can always start the server normally
> first, shut it down cleanly, and then create the recovery.conf file.

Now I've read the whole thing...

The problem is that we startup Hot Standby before we hit the min
recovery point because that isn't recorded. For me, the thing to do is
to make the min recovery point == end of WAL when state is
DB_IN_PRODUCTION. That way we don't need to do any new writes and we
don't need to risk people seeing inconsistent results if they do this.

But I think that still gives you a timeline problem when putting a
master back into a standby.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: 9.2.3 crashes during archive recovery

From
Ants Aasma
Date:
<p dir="ltr"><br /> On Feb 13, 2013 10:29 PM, "Heikki Linnakangas" <<a
href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>>wrote:<br /> > Hmm, I just realized a little
problemwith that approach. If you take a base backup using an atomic filesystem backup from a running server, and start
archiverecovery from that, that's essentially the same thing as Kyotaro's test case.<p dir="ltr">Coincidentally I was
wonderingabout the same thing when I was reviewing our slave provisioning procedures. There didn't seem to be any
communicationchannel from pg_stop_backup for the slave to know when it was safe to allow connections.<p dir="ltr">Maybe
thereshould be some mechanism akin to backup label to communicate the minimum recovery point? When the min recovery
pointfile exists the value inside it is used, when the recovery point is reached the file is removed. pg_basebackup
wouldjust append the file to the archive. Custom backup procedures could also use it to communicate the necessary WAL
location.<pdir="ltr">--<br /> Ants Aasma<br /> 

Re: 9.2.3 crashes during archive recovery

From
Fujii Masao
Date:
On Thu, Feb 14, 2013 at 5:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 13.02.2013 17:02, Tom Lane wrote:
>>
>> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>>>
>>> At least in back-branches, I'd call this a pilot error. You can't turn a
>>> master into a standby just by creating a recovery.conf file. At least
>>> not if the master was not shut down cleanly first.
>>> ...
>>> I'm not sure that's worth the trouble, though. Perhaps it would be
>>> better to just throw an error if the control file state is
>>> DB_IN_PRODUCTION and a recovery.conf file exists.
>>
>>
>> +1 for that approach, at least until it's clear there's a market for
>> doing this sort of thing.  I think the error check could be
>> back-patched, too.
>
>
> Hmm, I just realized a little problem with that approach. If you take a base
> backup using an atomic filesystem backup from a running server, and start
> archive recovery from that, that's essentially the same thing as Kyotaro's
> test case.

Yes. And the resource agent for streaming replication in Pacemaker (it's the
OSS clusterware) is the user of that archive recovery scenario, too. When it
starts up the server, it always creates the recovery.conf and starts the server
as the standby. It cannot start the master directly, IOW the server is always
promoted to the master from the standby. So when it starts up the server
after the server crashes, obviously it executes the same recovery scenario
(i.e., force archive recovery instead of crash one) as Kyotaro described.

The reason why that resource agent cannot start up the master directly is
that it manages three server states, called Master, Slave and Down. It can
move the server state from Down to Slave, and the reverse direction.
Also it can move the state from Slave to Master, and the reverse direction.
But there is no way to move the state between Down and Master directly.
This kind of the state transition model is isolated case in
clusterware, I think.

Regards,

-- 
Fujii Masao



Re: 9.2.3 crashes during archive recovery

From
Fujii Masao
Date:
On Thu, Feb 14, 2013 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
>> Without step 3, the server would perform crash recovery, and it would work.
>> But because of the recovery.conf file, the server goes into archive
>> recovery, and because minRecoveryPoint is not set, it assumes that the
>> system is consistent from the start.
>>
>> Aside from the immediate issue with truncation, the system really isn't
>> consistent until the WAL has been replayed far enough, so it shouldn't open
>> for hot standby queries. There might be other, later, changes already
>> flushed to data files. The system has no way of knowing how far it needs to
>> replay the WAL to become consistent.
>>
>> At least in back-branches, I'd call this a pilot error. You can't turn a
>> master into a standby just by creating a recovery.conf file. At least not if
>> the master was not shut down cleanly first.
>>
>> If there's a use case for doing that, maybe we can do something better in
>> HEAD. If the control file says that the system was running
>> (DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash
>> recovery first, until we reach the end of WAL, and go into archive recovery
>> mode after that. We'd recover all the WAL files in pg_xlog as far as we can,
>> same as in crash recovery, and only start restoring files from the archive
>> once we reach the end of WAL in pg_xlog. At that point, we'd also consider
>> the system as consistent, and start up for hot standby.
>>
>> I'm not sure that's worth the trouble, though. Perhaps it would be better to
>> just throw an error if the control file state is DB_IN_PRODUCTION and a
>> recovery.conf file exists. The admin can always start the server normally
>> first, shut it down cleanly, and then create the recovery.conf file.
>
> Now I've read the whole thing...
>
> The problem is that we startup Hot Standby before we hit the min
> recovery point because that isn't recorded. For me, the thing to do is
> to make the min recovery point == end of WAL when state is
> DB_IN_PRODUCTION. That way we don't need to do any new writes and we
> don't need to risk people seeing inconsistent results if they do this.

+1

And if it's the standby case, the min recovery point can be set to the end
of WAL files located in the standby. IOW, we can regard the database as
consistent when we replay all the WAL files in local and try to connect to
the master.

Regards,

-- 
Fujii Masao



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Sorry, I omitted to show how we found this issue.

In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL
is stopped by 'pg_ctl stop -m i' regardless of situation.

On the other hand, PosrgreSQL RA(Rsource Agent) is obliged to
start the master node via hot standby state because of the
restriction of the state transition of Pacemaker,

So the simply stopping and then starting the master node can fall
into this situation.

> Hmm, I just realized a little problem with that approach. If you take
> a base backup using an atomic filesystem backup from a running server,
> and start archive recovery from that, that's essentially the same
> thing as Kyotaro's test case.

Regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Ants Aasma
Date:
On Wed, Feb 13, 2013 at 10:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The problem is that we startup Hot Standby before we hit the min
> recovery point because that isn't recorded. For me, the thing to do is
> to make the min recovery point == end of WAL when state is
> DB_IN_PRODUCTION. That way we don't need to do any new writes and we
> don't need to risk people seeing inconsistent results if they do this.

While this solution would help solve my issue, it assumes that the
correct amount of WAL files are actually there. Currently the docs for
setting up a standby refer to "24.3.4. Recovering Using a Continuous
Archive Backup", and that step recommends emptying the contents of
pg_xlog. If this is chosen as the solution the docs should be adjusted
to recommend using pg_basebackup -x for setting up the standby. As a
related point, pointing standby setup to that section has confused at
least one of my clients. That chapter is rather scarily complicated
compared to what's usually necessary.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 15.02.2013 13:05, Ants Aasma wrote:
> On Wed, Feb 13, 2013 at 10:52 PM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> The problem is that we startup Hot Standby before we hit the min
>> recovery point because that isn't recorded. For me, the thing to do is
>> to make the min recovery point == end of WAL when state is
>> DB_IN_PRODUCTION. That way we don't need to do any new writes and we
>> don't need to risk people seeing inconsistent results if they do this.
>
> While this solution would help solve my issue, it assumes that the
> correct amount of WAL files are actually there. Currently the docs for
> setting up a standby refer to "24.3.4. Recovering Using a Continuous
> Archive Backup", and that step recommends emptying the contents of
> pg_xlog. If this is chosen as the solution the docs should be adjusted
> to recommend using pg_basebackup -x for setting up the standby.

When the backup is taken using pg_start_backup or pg_basebackup,
minRecoveryPoint is set correctly anyway, and it's OK to clear out
pg_xlog. It's only if you take the backup using an atomic filesystem
snapshot, or just kill -9 the server and take a backup while it's not
running, that we have a problem. In those scenarios, you should not
clear pg_xlog.

Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.

> As a
> related point, pointing standby setup to that section has confused at
> least one of my clients. That chapter is rather scarily complicated
> compared to what's usually necessary.

Yeah, it probably could use some editing, as the underlying code has
evolved a lot since it was written. The suggestion to clear out pg_xlog
seems like an unnecessary complication. It's safe to do so, if you
restore with an archive, but unnecessary.

The "File System Level Backup" chapter
(http://www.postgresql.org/docs/devel/static/backup-file.html) probably
should mention "pg_basebackup -x", too.

Docs patches are welcome..

- Heikki

Attachment

Re: 9.2.3 crashes during archive recovery

From
Ants Aasma
Date:
On Fri, Feb 15, 2013 at 3:49 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> While this solution would help solve my issue, it assumes that the
>> correct amount of WAL files are actually there. Currently the docs for
>> setting up a standby refer to "24.3.4. Recovering Using a Continuous
>> Archive Backup", and that step recommends emptying the contents of
>> pg_xlog. If this is chosen as the solution the docs should be adjusted
>> to recommend using pg_basebackup -x for setting up the standby.
>
> When the backup is taken using pg_start_backup or pg_basebackup,
> minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog.

How is minRecoveryPoint supposed to get set? I just tried taking a
pg_basebackup on master running pgbench. The resulting data dir
controlfile had this:

Min recovery ending location:         0/0

The end location was not in the backup_label either.

Looking at basebackup.c the process seems to be:
1. pg_start_backup
2. copy out backup_label
3. copy out rest of the datadir
4. copy out control file
5. pg_stop_backup
6. copy out WAL
7. send backup stop location

And pg_basebackup.c only uses the stop location to decide how much WAL to fetch.

I don't see anything here that could correctly communicate min
recovery point. Maybe I'm missing something.

> Yeah, it probably could use some editing, as the underlying code has evolved
> a lot since it was written. The suggestion to clear out pg_xlog seems like
> an unnecessary complication. It's safe to do so, if you restore with an
> archive, but unnecessary.
>
> The "File System Level Backup" chapter
> (http://www.postgresql.org/docs/devel/static/backup-file.html) probably
> should mention "pg_basebackup -x", too.
>
> Docs patches are welcome..

I will give it a shot.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 16.02.2013 10:40, Ants Aasma wrote:
> On Fri, Feb 15, 2013 at 3:49 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>>> While this solution would help solve my issue, it assumes that the
>>> correct amount of WAL files are actually there. Currently the docs for
>>> setting up a standby refer to "24.3.4. Recovering Using a Continuous
>>> Archive Backup", and that step recommends emptying the contents of
>>> pg_xlog. If this is chosen as the solution the docs should be adjusted
>>> to recommend using pg_basebackup -x for setting up the standby.
>>
>> When the backup is taken using pg_start_backup or pg_basebackup,
>> minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog.
>
> How is minRecoveryPoint supposed to get set?

I was a slightly imprecise. minRecoveryPoint is not set at backup, but 
backupStartPoint is. minRecoveryPoint is set later, once the 
end-of-backup record is seen. A valid backupStartPoint has the same 
effect as minRecoveryPoint: the system is considered inconsistent until 
the end-of-backup record is seen.

> I just tried taking a
> pg_basebackup on master running pgbench. The resulting data dir
> controlfile had this:
>
> Min recovery ending location:         0/0
>
> The end location was not in the backup_label either.
>
> Looking at basebackup.c the process seems to be:
> 1. pg_start_backup
> 2. copy out backup_label
> 3. copy out rest of the datadir
> 4. copy out control file
> 5. pg_stop_backup
> 6. copy out WAL
> 7. send backup stop location
>
> And pg_basebackup.c only uses the stop location to decide how much WAL to fetch.
>
> I don't see anything here that could correctly communicate min
> recovery point. Maybe I'm missing something.

backupStartPoint is set, which signals recovery to wait for an 
end-of-backup record, until the system is considered consistent. If the 
backup is taken from a hot standby, backupEndPoint is set, instead of 
inserting an end-of-backup record.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Ants Aasma
Date:
On Mon, Feb 18, 2013 at 8:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> backupStartPoint is set, which signals recovery to wait for an end-of-backup
> record, until the system is considered consistent. If the backup is taken
> from a hot standby, backupEndPoint is set, instead of inserting an
> end-of-backup record.

Thank you for explaining this, I can see how it works now. I'll see if
I can document this better so others won't have to go through as much
effort.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Hello, I looked this from another point of view.

I consider the current discussion to be based on how to predict
the last consistency point. But there is another aspect of this
issue.

I tried to postpone smgrtruncate after the next checkpoint. This
is similar to what hotstandby feedback does to vacuum.  It seems
to be working fine but I warry that it might also bloats the
table. I haven't found the way to postpone only objective
smgrtruncate.

The patch below is a immediate verification patch for this
solution.

- CreateCheckPoint records the oldest xmin at that point. Let's call it 'checkpoint xmin'.

- vacuum skips the modification by the transactions at the same time or after the checkpoint xmin.

What do you think of this?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


=========
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 479c14d..a274393 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -493,6 +493,7 @@ typedef struct XLogCtlData    XLogRecPtr    lastFpwDisableRecPtr;    slock_t        info_lck;
/* locks shared variables shown above */
 
+    TransactionId chkptxactid;} XLogCtlData;static XLogCtlData *XLogCtl = NULL;
@@ -676,6 +677,11 @@ static bool read_backup_label(XLogRecPtr *checkPointLoc,static void rm_redo_error_callback(void
*arg);staticint    get_sync_bit(int method);
 
+TransactionId
+XLogGetChkptTrId()
+{
+    return XLogCtl->chkptxactid;
+}/* * Insert an XLOG record having the specified RMID and info bytes,
@@ -3875,7 +3881,7 @@ XLOGShmemInit(void)    SpinLockInit(&XLogCtl->info_lck);    SpinLockInit(&XLogCtl->ulsn_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
-
+    XLogCtl->chkptxactid = InvalidTransactionId;    /*     * If we are not in bootstrap mode, pg_control should
alreadyexist. Read     * and validate it immediately (see comments in ReadControlFile() for the
 
@@ -6700,6 +6706,8 @@ CreateCheckPoint(int flags)    else        checkPoint.oldestActiveXid = InvalidTransactionId;
+    XLogCtl->chkptxactid = GetOldestXmin(true, true);
+    /*     * We must hold WALInsertLock while examining insert state to determine     * the checkpoint REDO pointer.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 4800b43..79205a0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -374,6 +374,7 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)/* * vacuum_set_xid_limits() -- compute oldest-Xmin
andfreeze cutoff points */
 
+extern TransactionId XLogGetChkptTrId(void);voidvacuum_set_xid_limits(int freeze_min_age,                      int
freeze_table_age,
@@ -397,6 +398,11 @@ vacuum_set_xid_limits(int freeze_min_age,     * always an independent transaction.     */
*oldestXmin= GetOldestXmin(sharedRel, true);
 
+    {
+        TransactionId chkpttrid = XLogGetChkptTrId();
+        if (chkpttrid < *oldestXmin)
+            *oldestXmin = chkpttrid;
+    }    Assert(TransactionIdIsNormal(*oldestXmin));
==========



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Sorry, Let me correct a bit.

> I tried to postpone smgrtruncate after the next checkpoint. This

I tried to postpone smgrtruncate TO the next checktpoint.

> is similar to what hotstandby feedback does to vacuum.  It seems
> to be working fine but I warry that it might also bloats the
> table. I haven't found the way to postpone only objective
> smgrtruncate.
> 
> The patch below is a immediate verification patch for this
> solution.
> 
> - CreateCheckPoint records the oldest xmin at that point. Let's
>   call it 'checkpoint xmin'.
> 
> - vacuum skips the modification by the transactions at the same
>   time or after the checkpoint xmin.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 20.02.2013 10:01, Kyotaro HORIGUCHI wrote:
> Sorry, Let me correct a bit.
>
>> I tried to postpone smgrtruncate after the next checkpoint. This
>
> I tried to postpone smgrtruncate TO the next checktpoint.

Umm, why? I don't understand this patch at all.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 15.02.2013 15:49, Heikki Linnakangas wrote:
> Attached is a patch for git master. The basic idea is to split
> InArchiveRecovery into two variables, InArchiveRecovery and
> ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
> recovery.conf exists. But if we don't know how far we need to recover,
> we first perform crash recovery with InArchiveRecovery=false. When we
> reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
> continue with normal archive recovery.

New version of this attached, with a few bugs fixed.

I'm thinking that this should be back-patched to 9.2, but not to earlier
branches. Before 9.2, we don't PANIC at a reference to a non-existent
page until end of recovery, even if we've already reached consistency.
The same basic issue still exists in earlier versions, though: if you
have hot_standby=on, the system will open for read-only queries too
early, before the database is consistent. But this patch is invasive
enough that I'm weary of back-patching it further, when the worst that
can happen is that there's a small window right after startup when you
can see an inconsistent database in hot standby mode. Maybe after we get
some more testing of this in 9.2 and master. Opinions on that?

- Heikki

Attachment

Re: 9.2.3 crashes during archive recovery

From
Michael Paquier
Date:


On Thu, Feb 21, 2013 at 11:09 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 15.02.2013 15:49, Heikki Linnakangas wrote:
Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.

New version of this attached, with a few bugs fixed.

I'm thinking that this should be back-patched to 9.2, but not to earlier branches. Before 9.2, we don't PANIC at a reference to a non-existent page until end of recovery, even if we've already reached consistency. The same basic issue still exists in earlier versions, though: if you have hot_standby=on, the system will open for read-only queries too early, before the database is consistent. But this patch is invasive enough that I'm weary of back-patching it further, when the worst that can happen is that there's a small window right after startup when you can see an inconsistent database in hot standby mode. Maybe after we get some more testing of this in 9.2 and master. Opinions on that?
People have not yet complained about this problem with versions prior to 9.1. Is it worth backpatching in this case?
--
Michael

Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 22.02.2013 02:13, Michael Paquier wrote:
> On Thu, Feb 21, 2013 at 11:09 PM, Heikki Linnakangas<
> hlinnakangas@vmware.com>  wrote:
>
>> On 15.02.2013 15:49, Heikki Linnakangas wrote:
>>
>>> Attached is a patch for git master. The basic idea is to split
>>> InArchiveRecovery into two variables, InArchiveRecovery and
>>> ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
>>> recovery.conf exists. But if we don't know how far we need to recover,
>>> we first perform crash recovery with InArchiveRecovery=false. When we
>>> reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
>>> continue with normal archive recovery.
>>>
>>
>> New version of this attached, with a few bugs fixed.
>>
>> I'm thinking that this should be back-patched to 9.2, but not to earlier
>> branches. Before 9.2, we don't PANIC at a reference to a non-existent page
>> until end of recovery, even if we've already reached consistency. The same
>> basic issue still exists in earlier versions, though: if you have
>> hot_standby=on, the system will open for read-only queries too early,
>> before the database is consistent. But this patch is invasive enough that
>> I'm weary of back-patching it further, when the worst that can happen is
>> that there's a small window right after startup when you can see an
>> inconsistent database in hot standby mode. Maybe after we get some more
>> testing of this in 9.2 and master. Opinions on that?
>>
> People have not yet complained about this problem with versions prior to
> 9.1. Is it worth backpatching in this case?

Possibly not..

Anyway, I've committed this to master and 9.2 now.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 14.02.2013 19:18, Fujii Masao wrote:
> Yes. And the resource agent for streaming replication in Pacemaker (it's the
> OSS clusterware) is the user of that archive recovery scenario, too. When it
> starts up the server, it always creates the recovery.conf and starts the server
> as the standby. It cannot start the master directly, IOW the server is always
> promoted to the master from the standby. So when it starts up the server
> after the server crashes, obviously it executes the same recovery scenario
> (i.e., force archive recovery instead of crash one) as Kyotaro described.
>
> The reason why that resource agent cannot start up the master directly is
> that it manages three server states, called Master, Slave and Down. It can
> move the server state from Down to Slave, and the reverse direction.
> Also it can move the state from Slave to Master, and the reverse direction.
> But there is no way to move the state between Down and Master directly.
> This kind of the state transition model is isolated case in
> clusterware, I think.

I don't have much sympathy for that to be honest. Seems like something 
that should be fixed in Pacemaker or the scripts used to glue it with 
PostgreSQL. However, this patch should make that work, so I guess 
everyone is happy.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 15.02.2013 10:33, Kyotaro HORIGUCHI wrote:
> Sorry, I omitted to show how we found this issue.
>
> In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL
> is stopped by 'pg_ctl stop -m i' regardless of situation.

That seems like a bad idea. If nothing else, crash recovery can take a 
long time. I don't know much about Pacemaker, but wouldn't it make more 
sense to at least try fast shutdown first, falling back to immediate 
shutdown after a timeout.

- Heikki



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Hello, 

> Anyway, I've committed this to master and 9.2 now.

This seems to fix the issue. We'll examine this further.

Thank you.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
At Fri, 22 Feb 2013 11:42:39 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in
<51273D8F.7060609@vmware.com>
> On 15.02.2013 10:33, Kyotaro HORIGUCHI wrote:
> > In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL
> > is stopped by 'pg_ctl stop -m i' regardless of situation.
> 
> That seems like a bad idea. If nothing else, crash recovery can take a
> long time. I don't know much about Pacemaker, but wouldn't it make
> more sense to at least try fast shutdown first, falling back to
> immediate shutdown after a timeout.

I completely agree with you. But I hear that they want to
shutdown as fast as possible on failure and also don't want more
states on RA.. On the other hand he result in case of this
failure is catastrophic so this should be fixed regardless of
what pacemaker does. 

regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
However this has become useless, I want to explain about how this
works.

> > I tried to postpone smgrtruncate TO the next checktpoint.
> 
> Umm, why? I don't understand this patch at all.

This inhibits truncate files after (quite vague in the patch:-)
the previous checkpoint by hindering the deleted tuples from
vacuum then hindering the vm pages from truncation. They are
truncated only If no tuples is deleted on the pages correspond to
them since the last checkpoint. Finally any nonexistent pages in
visibility maps won't be touched after the consistency point -
which is after the last checkpoint - on archive recovery.

# Umm.. sorry for confused explanation...

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Josh Berkus
Date:
Folks,

Is there any way this particular issue could cause data corruption
without causing a crash?  I don't see a way for it to do so, but I
wanted to verify.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
This is an interim report for this patch.

We found that PostgreSQL with this patch unexpctedly becomes
primary when starting up as standby. We'll do further
investigation for the behavior.

> > Anyway, I've committed this to master and 9.2 now.
> 
> This seems to fix the issue. We'll examine this further.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Hello, I could cause the behavior and might understand the cause.

The head of origin/REL9_2_STABLE shows the behavior I metioned in
the last message when using the shell script attached. 9.3dev
runs as expected.

In XLogPageRead, when RecPtr goes beyond the last page, the
current xlog file is released and new page requested.

The variables were as below at the point.
 StandbyRequested == true StandbyMode == false ArchiveRecoveryRequested == true InArchiveRecovery == false

In this case, XLogPageRead immediately returns NULL before trying
to get xlogs via streaming nor from archive. So ReadRecord
returns NULL, then unexpectedly exits 'main redo apply loop' and
increases timeline ID as if it were promoted.

This seems fiexed by letting it try all requested
sources. Attached patch does it and the test script runs as
expected.

> We found that PostgreSQL with this patch unexpctedly becomes
> primary when starting up as standby. We'll do further
> investigation for the behavior.
> 
> > > Anyway, I've committed this to master and 9.2 now.
> > 
> > This seems to fix the issue. We'll examine this further.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /bin/sh
version="abf5c5"
version="924b"
killall -9 postgres
source pgsetpath $version 0
rm -rf $PGDATA/* $PGARC/*
PGDATA0=$PGDATA
PGPORT0=$PGPORT
initdb -D $PGDATA0
cat >> $PGDATA0/postgresql.conf <<EOF
wal_level = hot_standby
checkpoint_segments = 300
checkpoint_timeout = 1h
archive_mode = on
archive_command = 'cp %p $PGARC/%f'
max_wal_senders = 3
hot_standby = on
EOF
cat >> $PGDATA0/pg_hba.conf <<EOF
local   replication     horiguti                                trust
EOF
echo ## Startup master
pg_ctl -D $PGDATA0 -w start
source pgsetpath $version 1 -p 5433
PGDATA1=$PGDATA
PGPORT1=$PGPORT
rm -rf $PGDATA/* $PGARC/*
echo "## basebackup"
pg_basebackup -h /tmp -p $PGPORT0 -F p -X s -D $PGDATA1
chmod 700 $PGDATA1
cat >> $PGDATA1/recovery.conf <<EOF
standby_mode = yes
primary_conninfo='host=/tmp port=5432'
restore_command='a=$PGARC; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi'
#restore_command='a=$PGARC; if [  -d \$a ]; then echo Archive directory \$a is not found.; exit 1; elif [ -f \$a/%f ];
thencp \$a/%f %p; else exit 1; fi'
 
EOF

echo "## Startup standby"
pg_ctl -D $PGDATA1 start
echo "## Sleep for 5 seconds"
sleep 5

echo "## Shutdown standby"
pg_ctl -D $PGDATA1 -w stop -m f

echo "## Shutdown master in immediate mode"
pg_ctl -D $PGDATA0 -w stop -m i

cat >> $PGDATA0/recovery.conf <<EOF
standby_mode = yes
primary_conninfo='host=/tmp port=5433'
restore_command='a=$PGARC; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi'
EOF

echo "## Starting master as a standby"
if [ "$1" == "w" ]; then touch /tmp/xlogwait; fi
PGPORT=5432 pg_ctl -D $PGDATA0  start
#psql postgres -c "select pg_is_in_recovery();"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..00b5bc5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10604,7 +10604,19 @@ retry:                                              sources);                switched_segment
=true;                if (readFile < 0)
 
+                {
+                    if (!InArchiveRecovery && ArchiveRecoveryRequested)
+                    {
+                        InArchiveRecovery = true;
+                        goto retry;
+                    }
+                    else if (!StandbyMode && StandbyModeRequested)
+                    {
+                        StandbyMode = true;
+                        goto retry;
+                    }                    return false;
+                }            }        }    }

Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Sorry, I sent wrong script.

> The head of origin/REL9_2_STABLE shows the behavior I metioned in
> the last message when using the shell script attached. 9.3dev
> runs as expected.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /bin/sh
pgpath="$HOME/bin/pgsql_924b"
echo $PATH | grep $pgpath | wc -l
if [ `echo $PATH | grep $pgpath | wc -l` == 0 ]; thenPATH=$pgpath/bin:$PATH
fi

PGDATA0="$HOME/data/pgdata_0"
PGDATA1="$HOME/data/pgdata_1"
PGARC0="$HOME/data/pgarc_0"
PGARC1="$HOME/data/pgarc_1"
PGPORT0=5432
PGPORT1=5433
unset PGPORT
unset PGDATA
echo "Postgresql is \"`which postgres`\""
killall -9 postgres
rm -rf $PGDATA0/* $PGARC0/*
initdb -D $PGDATA0
cat >> $PGDATA0/postgresql.conf <<EOF
port=5432
wal_level = hot_standby
checkpoint_segments = 300
checkpoint_timeout = 1h
archive_mode = on
archive_command = 'cp %p $PGARC0/%f'
max_wal_senders = 3
hot_standby = on
#log_min_messages = debug5
EOF
cat >> $PGDATA0/pg_hba.conf <<EOF
local   replication     horiguti                                trust
EOF
echo ## Startup master
pg_ctl -D $PGDATA0 -w  -o "-p $PGPORT0" start

rm -rf $PGDATA1/* $PGARC1/*
echo "## basebackup"
pg_basebackup -h /tmp -p $PGPORT0 -F p -X s -D $PGDATA1
chmod 700 $PGDATA1
cat >> $PGDATA1/recovery.conf <<EOF
standby_mode = yes
primary_conninfo='host=/tmp port=5432'
restore_command='a=$PGARC1; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi'
EOF

echo "## Startup standby"
pg_ctl -D $PGDATA1 -o "-p $PGPORT1" start
echo "## Sleep for 5 seconds"
sleep 5

echo "## Shutdown standby"
pg_ctl -D $PGDATA1 -w stop -m f

echo "## Shutdown master in immediate mode"
pg_ctl -D $PGDATA0 -w stop -m i

cat >> $PGDATA0/recovery.conf <<EOF
standby_mode = yes
primary_conninfo='host=/tmp port=5433'
restore_command='a=$PGARC0; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi'
EOF

echo "## Starting master as a standby"
pg_ctl -D $PGDATA0  start

Re: 9.2.3 crashes during archive recovery

From
KONDO Mitsumasa
Date:
Hi,

Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
Attempt patch records minRecoveryPoint.
[crash recovery -> record minRecoveryPoint in control file -> archive recovery]
I think that this is an original intention of Heikki's patch.

I also found a bug in latest 9.2_stable. It does not get latest timeline and
recovery history file in archive recovery when master and standby timeline is different.

Best regards,

(2013/03/05 18:22), Kyotaro HORIGUCHI wrote:
> Hello, I could cause the behavior and might understand the cause.
>
> The head of origin/REL9_2_STABLE shows the behavior I metioned in
> the last message when using the shell script attached. 9.3dev
> runs as expected.
>
> In XLogPageRead, when RecPtr goes beyond the last page, the
> current xlog file is released and new page requested.
>
> The variables were as below at the point.
>
>    StandbyRequested == true
>    StandbyMode == false
>    ArchiveRecoveryRequested == true
>    InArchiveRecovery == false
>
> In this case, XLogPageRead immediately returns NULL before trying
> to get xlogs via streaming nor from archive. So ReadRecord
> returns NULL, then unexpectedly exits 'main redo apply loop' and
> increases timeline ID as if it were promoted.
>
> This seems fiexed by letting it try all requested
> sources. Attached patch does it and the test script runs as
> expected.
>
>> We found that PostgreSQL with this patch unexpctedly becomes
>> primary when starting up as standby. We'll do further
>> investigation for the behavior.
>>
>>>> Anyway, I've committed this to master and 9.2 now.
>>>
>>> This seems to fix the issue. We'll examine this further.
>
> regards,
>
>
>
>
--
Mitsumasa KONDO
NTT OSS Center

Attachment

Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Hmm..

> Horiguch's patch does not seem to record minRecoveryPoint in
> ReadRecord();
> Attempt patch records minRecoveryPoint.
> [crash recovery -> record minRecoveryPoint in control file -> archive
> recovery]
> I think that this is an original intention of Heikki's patch.

It could be. Before that, my patch does not wake up hot standby
because I didn't care of minRecoveryPoint in it:-p

On the other hand, your patch fixes that point but ReadRecord
runs on the false page data. However the wrong record on the
false page can be identified as broken, It should be an
undesiable behavior.

If we want to show the final solution by ourselves, we need to
make another patch to settle them all. Let me take further
thought..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Hi, I suppose the attached patch is close to the solution.

> I think that this is an original intention of Heikki's patch.

I noticed that archive recovery will be turned on in
next_record_is_invalid thanks to your patch.

> On the other hand, your patch fixes that point but ReadRecord
> runs on the false page data. However the wrong record on the
> false page can be identified as broken, It should be an
> undesiable behavior.

All we should do to update minRecoveryPoint as needed when
XLogPageRead is failed in ReadRecord is to simply jump to
next_record_is_invalid if archive recovery is requested but doing
crash recovery yet.

Your modification in readTimeLineHistory and my modifictions in
XLogPageRead seem not necessary for the objective in this thread,
so removed.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..28d6f2e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4010,7 +4010,15 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)retry:    /* Read the page
containingthe record */    if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
 
+    {
+        /*
+         * If archive recovery was requested when crash recovery failed, go to
+         * the label next_record_is_invalid to switch to archive recovery.
+         */
+        if (!InArchiveRecovery && ArchiveRecoveryRequested)
+            goto next_record_is_invalid;        return NULL;
+    }    pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);    targetRecOff = RecPtr->xrecoff %
XLOG_BLCKSZ;

Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 05.03.2013 14:09, KONDO Mitsumasa wrote:
> Hi,
>
> Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
> Attempt patch records minRecoveryPoint.
> [crash recovery -> record minRecoveryPoint in control file -> archive
> recovery]
> I think that this is an original intention of Heikki's patch.

Yeah. That fix isn't right, though; XLogPageRead() is supposed to return
true on success, and false on error, and the patch makes it return
'true' on error, if archive recovery was requested but we're still in
crash recovery. The real issue here is that I missed the two "return
NULL;"s in ReadRecord(), so the code that I put in the
next_record_is_invalid codepath isn't run if XLogPageRead() doesn't find
the file at all. Attached patch is the proper fix for this.

> I also found a bug in latest 9.2_stable. It does not get latest timeline
> and
> recovery history file in archive recovery when master and standby
> timeline is different.

Works for me.. Can you create a test script for that? Remember to set
"recovery_target_timeline='latest'".

- Heikki

Attachment

Re: 9.2.3 crashes during archive recovery

From
KONDO Mitsumasa
Date:
(2013/03/06 16:50), Heikki Linnakangas wrote:>
>> Hi,
>>
>> Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
>> Attempt patch records minRecoveryPoint.
>> [crash recovery -> record minRecoveryPoint in control file -> archive
>> recovery]
>> I think that this is an original intention of Heikki's patch.
>
> Yeah. That fix isn't right, though; XLogPageRead() is supposed to return true on success, and false on error, and the
patchmakes it return 'true' on error, if archive recovery was requested but we're still in crash recovery. The real
issuehere is that I missed the two "return NULL;"s in ReadRecord(), so the code that I put in the
next_record_is_invalidcodepath isn't run if XLogPageRead() doesn't find the file at all. Attached patch is the proper
fixfor this. 
>
Thanks for createing patch! I test your patch in 9.2_STABLE, but it does not use promote command...
When XLogPageRead() was returned false ,it means the end of stanby loop, crash recovery loop, and archive recovery
loop.
Your patch is not good for promoting Standby to Master. It does not come off standby loop.

So I make new patch which is based Heikki's and Horiguchi's patch.
I attempt test script which was modifyed Horiuch's script. This script does not depend on shell enviroment. It was only
neededto fix PGPATH. 
Please execute this test script.


>> I also found a bug in latest 9.2_stable. It does not get latest timeline
>> and
>> recovery history file in archive recovery when master and standby
>> timeline is different.
>
> Works for me.. Can you create a test script for that? Remember to set "recovery_target_timeline='latest'".
I set recovery_target_timeline=latest. hmm...

Here is my recovery.conf.
> mitsu-ko@localhost postgresql]$ cat Standby/recovery.conf
> standby_mode = 'yes'
> recovery_target_timeline='latest'
> primary_conninfo='host=localhost port=65432'
> restore_command='cp ../arc/%f %p'
And my system's log message is here.
> waiting for server to start....[Standby] LOG:  database system was shut down in recovery at 2013-03-07 02:56:05 EST
> [Standby] LOG:  restored log file "00000002.history" from archive
> cp: cannot stat `../arc/00000003.history': そのようなファイルやディレクトリはありません
> [Standby] FATAL:  requested timeline 2 is not a child of database system timeline 1
> [Standby] LOG:  startup process (PID 20941) exited with exit code 1
> [Standby] LOG:  aborting startup due to startup process failure
It can be reproduced in my test script, too.
Last master start command might seem not to exist generally in my test script.
But it is generally that PostgreSQL with Pacemaker system.


Best regards,
--
Mitsumasa KONDO
NTT OSS Center

Attachment

Re: 9.2.3 crashes during archive recovery

From
Heikki Linnakangas
Date:
On 07.03.2013 10:05, KONDO Mitsumasa wrote:
> (2013/03/06 16:50), Heikki Linnakangas wrote:>
>> Yeah. That fix isn't right, though; XLogPageRead() is supposed to
>> return true on success, and false on error, and the patch makes it
>> return 'true' on error, if archive recovery was requested but we're
>> still in crash recovery. The real issue here is that I missed the two
>> "return NULL;"s in ReadRecord(), so the code that I put in the
>> next_record_is_invalid codepath isn't run if XLogPageRead() doesn't
>> find the file at all. Attached patch is the proper fix for this.
>>
> Thanks for createing patch! I test your patch in 9.2_STABLE, but it does
> not use promote command...
> When XLogPageRead() was returned false ,it means the end of stanby loop,
> crash recovery loop, and archive recovery loop.
> Your patch is not good for promoting Standby to Master. It does not come
> off standby loop.
>
> So I make new patch which is based Heikki's and Horiguchi's patch.

Ah, I see. I committed a slightly modified version of this.

>>> I also found a bug in latest 9.2_stable. It does not get latest timeline
>>> and
>>> recovery history file in archive recovery when master and standby
>>> timeline is different.
>>
>> Works for me.. Can you create a test script for that? Remember to set
>> "recovery_target_timeline='latest'".
> ...
> It can be reproduced in my test script, too.

I see the problem now, with that script. So what happens is that the 
startup process first scans the timeline history files to choose the 
recovery target timeline. For that scan, I temporarily set 
InArchiveRecovery=true, in readRecoveryCommandFile. However, after 
readRecoveryCommandFile returns, we then try to read the timeline 
history file corresponding the chosen recovery target timeline, but 
InArchiveRecovery is no longer set, so we don't fetch the file from 
archive, and return a "dummy" history, with just the target timeline in 
it. That doesn't contain the older timeline, so you get an error at 
recovery.

Fixed per your patch to check for ArchiveRecoveryRequested instead of 
InArchiveRecovery, when reading timeline history files. This also makes 
it unnecessary to temporarily set InArchiveRecovery=true, so removed that.

Committed both fixes. Please confirm this this fixed the problem in your 
test environment. Many thanks for the testing and the patches!

- Heikki



Re: 9.2.3 crashes during archive recovery

From
KONDO Mitsumasa
Date:
(2013/03/07 19:41), Heikki Linnakangas wrote:
> On 07.03.2013 10:05, KONDO Mitsumasa wrote:
>> (2013/03/06 16:50), Heikki Linnakangas wrote:>
>>> Yeah. That fix isn't right, though; XLogPageRead() is supposed to
>>> return true on success, and false on error, and the patch makes it
>>> return 'true' on error, if archive recovery was requested but we're
>>> still in crash recovery. The real issue here is that I missed the two
>>> "return NULL;"s in ReadRecord(), so the code that I put in the
>>> next_record_is_invalid codepath isn't run if XLogPageRead() doesn't
>>> find the file at all. Attached patch is the proper fix for this.
>>>
>> Thanks for createing patch! I test your patch in 9.2_STABLE, but it does
>> not use promote command...
>> When XLogPageRead() was returned false ,it means the end of stanby loop,
>> crash recovery loop, and archive recovery loop.
>> Your patch is not good for promoting Standby to Master. It does not come
>> off standby loop.
>>
>> So I make new patch which is based Heikki's and Horiguchi's patch.
>
> Ah, I see. I committed a slightly modified version of this.
I feel that your modification is legible. Thanks for your modification and committing patch! 
>>>> I also found a bug in latest 9.2_stable. It does not get latest timeline
>>>> and
>>>> recovery history file in archive recovery when master and standby
>>>> timeline is different.
>>>
>>> Works for me.. Can you create a test script for that? Remember to set
>>> "recovery_target_timeline='latest'".
>> ...
>> It can be reproduced in my test script, too.
>
> I see the problem now, with that script. So what happens is that the startup process first scans the timeline history
filesto choose the recovery target timeline. For that scan, I temporarily set InArchiveRecovery=true, in
readRecoveryCommandFile.However, after readRecoveryCommandFile returns, we then try to read the timeline history file
correspondingthe chosen recovery target timeline, but InArchiveRecovery is no longer set, so we don't fetch the file
fromarchive, and return a "dummy" history, with just the target timeline in it. That doesn't contain the older
timeline,so you get an error at recovery.
 
> Fixed per your patch to check for ArchiveRecoveryRequested instead of InArchiveRecovery, when reading timeline
historyfiles. This also makes it unnecessary to temporarily set InArchiveRecovery=true, so removed that.
 
> Committed both fixes. Please confirm this this fixed the problem in your test environment. Many thanks for the
testingand the patches!
 
I understand this problem. Thank you for adding modification and detail explanation! I test latest REL9_2_STABLE in my
system.I confirm that it run good without problem. If I found an another problem, I will report and send you patch and
testscript!
 


Best regards,
-- 
Mitsumasa KONDO
NTT OSS Center



Re: 9.2.3 crashes during archive recovery

From
Kyotaro HORIGUCHI
Date:
Everything seems settled up above my head while sleeping..

Sorry for crumsy test script, and thank you for refining it, Mitsumasa.

And thank you for fixing the bug and the detailed explanation, Heikki.

I confirmed that the problem is fixed also for me at origin/REL9_2_STABLE.

> I understand this problem. Thank you for adding modification and
> detail explanation! I test latest REL9_2_STABLE in my system. I
> confirm that it run good without problem. If I found an another
> problem, I will report and send you patch and test script!

regards,

-- 
堀口恭太郎

日本電信電話株式会社 NTTオープンソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490