Thread: 9.2.3 crashes during archive recovery
Hello, 9.2.3 crashes during archive recovery. This was also corrected at some point on origin/master with another problem fixed by the commit below if my memory is correct. But current HEAD and 9.2.3 crashes during archive recovery (not on standby) by the 'marking deleted page visible' problem. http://www.postgresql.org/message-id/50C7612E.5060008@vmware.com The script attached replays the situation. This could be illustrated as below, 1. StartupXLOG() initializes minRecoveryPoint with ControlFile->CheckPoint.redo (virtually) at first of archive recoveryprocess. 2. Consistency state becomes true just before redo starts. 3. Redoing certain XLOG_HEAP2_VISIBLE record causes crash because the page for visibility map has been alrady removed by smgr_truncate() who had emitted XLOG_SMGR_TRUNCATE record after that. > PANIC: WAL contains references to invalid pages After all, the consistency point should be postponed until the XLOG_SMGR_TRUNCATE. In this case, the FINAL consistency point is at the XLOG_SMGR_TRUNCATE record, but current implemet does not record the consistency point (checkpoint, or commit or smgr_truncate) itself, so we cannot predict the final consistency point on starting of recovery. Recovery was completed successfully with the small and rough patch below. This allows multiple consistency points but also kills quick cessation. (And of course no check is done against replication/promotion until now X-) > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -6029,7 +6029,9 @@ CheckRecoveryConsistency(void) > */ > XLogCheckInvalidPages(); > > - reachedConsistency = true; > + //reachedConsistency = true; > + minRecoveryPoint = InvalidXLogRecPtr; > + > ereport(LOG, > (errmsg("consistent recovery state reached at %X/%X", > (uint32) (XLogCtl->lastReplayedEndRecPtr >> 32), On the other hand, updating control file on every commits or smgr_truncate's should slow the transactions.. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center ====== Replay script. #! /bin/sh PGDATA=$HOME/pgdata PGARC=$HOME/pgarc rm -rf $PGDATA/* $PGARC/* initdb cat >> $PGDATA/postgresql.conf <<EOF wal_level = hot_standby checkpoint_segments = 300 checkpoint_timeout = 1h archive_mode = on archive_command = 'cp %p $PGARC/%f' EOF pg_ctl -w start createdb psql <<EOF select version(); create table t (a int); insert into t values (5); checkpoint; vacuum; delete from t; vacuum; EOF pg_ctl -w stop -m i cat >> $PGDATA/recovery.conf <<EOF restore_command='if [ -f $PGARC/%f ]; then cp $PGARC/%f %p; fi' EOF postgres =====
On 13.02.2013 09:46, Kyotaro HORIGUCHI wrote: > In this case, the FINAL consistency point is at the > XLOG_SMGR_TRUNCATE record, but current implemet does not record > the consistency point (checkpoint, or commit or smgr_truncate) > itself, so we cannot predict the final consistency point on > starting of recovery. Hmm, what you did was basically: 1. Run server normally. 2. Kill it with "pg_ctl stop -m immediate". 3. Create a recovery.conf file, turning the server into a hot standby. Without step 3, the server would perform crash recovery, and it would work. But because of the recovery.conf file, the server goes into archive recovery, and because minRecoveryPoint is not set, it assumes that the system is consistent from the start. Aside from the immediate issue with truncation, the system really isn't consistent until the WAL has been replayed far enough, so it shouldn't open for hot standby queries. There might be other, later, changes already flushed to data files. The system has no way of knowing how far it needs to replay the WAL to become consistent. At least in back-branches, I'd call this a pilot error. You can't turn a master into a standby just by creating a recovery.conf file. At least not if the master was not shut down cleanly first. If there's a use case for doing that, maybe we can do something better in HEAD. If the control file says that the system was running (DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash recovery first, until we reach the end of WAL, and go into archive recovery mode after that. We'd recover all the WAL files in pg_xlog as far as we can, same as in crash recovery, and only start restoring files from the archive once we reach the end of WAL in pg_xlog. At that point, we'd also consider the system as consistent, and start up for hot standby. I'm not sure that's worth the trouble, though. Perhaps it would be better to just throw an error if the control file state is DB_IN_PRODUCTION and a recovery.conf file exists. The admin can always start the server normally first, shut it down cleanly, and then create the recovery.conf file. > On the other hand, updating control file on every commits or > smgr_truncate's should slow the transactions.. To be precise, we'd need to update the control file on every XLogFlush(), like we do during archive recovery. That would indeed be unacceptable from a performance point of view. Updating the control file that often would also be bad for robustness. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > At least in back-branches, I'd call this a pilot error. You can't turn a > master into a standby just by creating a recovery.conf file. At least > not if the master was not shut down cleanly first. > ... > I'm not sure that's worth the trouble, though. Perhaps it would be > better to just throw an error if the control file state is > DB_IN_PRODUCTION and a recovery.conf file exists. +1 for that approach, at least until it's clear there's a market for doing this sort of thing. I think the error check could be back-patched, too. regards, tom lane
On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > To be precise, we'd need to update the control file on every XLogFlush(), > like we do during archive recovery. That would indeed be unacceptable from a > performance point of view. Updating the control file that often would also > be bad for robustness. If those arguments make sense, then why don't they apply to recovery as well? It sounds like we need to look at something better for use during archive recovery. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> To be precise, we'd need to update the control file on every XLogFlush(), >> like we do during archive recovery. That would indeed be unacceptable from a >> performance point of view. Updating the control file that often would also >> be bad for robustness. > If those arguments make sense, then why don't they apply to recovery as well? In plain old crash recovery, don't the checks on whether to apply WAL records based on LSN take care of this? regards, tom lane
On 13.02.2013 20:25, Simon Riggs wrote: > On 13 February 2013 09:04, Heikki Linnakangas<hlinnakangas@vmware.com> wrote: > >> To be precise, we'd need to update the control file on every XLogFlush(), >> like we do during archive recovery. That would indeed be unacceptable from a >> performance point of view. Updating the control file that often would also >> be bad for robustness. > > If those arguments make sense, then why don't they apply to recovery as well? To some degree, they do. The big difference is that during normal operation, every commit is XLogFlushed(). During recovery, XLogFlush() happens much less frequently - certainly not after replaying each commit record. > It sounds like we need to look at something better for use during > archive recovery. Well, no-one's complained about the performance. From a robustness point of view, it might be good to keep the minRecoveryPoint value in a separate file, for example, to avoid rewriting the control file that often. Then again, why fix it when it's not broken. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Well, no-one's complained about the performance. From a robustness point > of view, it might be good to keep the minRecoveryPoint value in a > separate file, for example, to avoid rewriting the control file that > often. Then again, why fix it when it's not broken. It would only be broken if someone interrupted a crash recovery mid-flight and tried to establish a recovery stop point before the end of WAL, no? Why don't we just forbid that case? This would either be the same as, or a small extension of, the pg_control state vs existence of recovery.conf error check that was just discussed. regards, tom lane
On 13.02.2013 21:21, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> Well, no-one's complained about the performance. From a robustness point >> of view, it might be good to keep the minRecoveryPoint value in a >> separate file, for example, to avoid rewriting the control file that >> often. Then again, why fix it when it's not broken. > > It would only be broken if someone interrupted a crash recovery > mid-flight and tried to establish a recovery stop point before the end > of WAL, no? Why don't we just forbid that case? This would either be > the same as, or a small extension of, the pg_control state vs existence > of recovery.conf error check that was just discussed. The problem is when you interrupt archive recovery (kill -9), and restart. After restart, the system needs to know how far the WAL was replayed before the crash, because it must not open for hot standby queries, or allow the database to be started up in master-mode, until it's replayed the WAL up to that same point again. - Heikki
On 13.02.2013 21:03, Tom Lane wrote: > Simon Riggs<simon@2ndQuadrant.com> writes: >> On 13 February 2013 09:04, Heikki Linnakangas<hlinnakangas@vmware.com> wrote: >>> To be precise, we'd need to update the control file on every XLogFlush(), >>> like we do during archive recovery. That would indeed be unacceptable from a >>> performance point of view. Updating the control file that often would also >>> be bad for robustness. > >> If those arguments make sense, then why don't they apply to recovery as well? > > In plain old crash recovery, don't the checks on whether to apply WAL > records based on LSN take care of this? The problem we're trying to solve is determining how much WAL needs to be replayed until the database is consistent again. In crash recovery, the answer is "all of it". That's why the CRC in the WAL is essential; it's required to determine where the WAL ends. But if we had some other mechanism, like if we updated minRecoveryPoint after every XLogFlush() like Simon suggested, we wouldn't necessarily need the CRC to detect end of WAL (not that I'd suggest removing it anyway), and we could throw an error if there is corrupt bit somewhere in the WAL before the true end of WAL. In archive recovery, we can't just say "replay all the WAL", because the whole idea of PITR is to not recover all the WAL. So we use minRecoveryPoint to keep track of how far the WAL needs to be replayed at a minimum, for the database to be consistent. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 13.02.2013 21:21, Tom Lane wrote: >> It would only be broken if someone interrupted a crash recovery >> mid-flight and tried to establish a recovery stop point before the end >> of WAL, no? Why don't we just forbid that case? This would either be >> the same as, or a small extension of, the pg_control state vs existence >> of recovery.conf error check that was just discussed. > The problem is when you interrupt archive recovery (kill -9), and > restart. After restart, the system needs to know how far the WAL was > replayed before the crash, because it must not open for hot standby > queries, or allow the database to be started up in master-mode, until > it's replayed the WAL up to that same point again. Well, archive recovery is a different scenario --- Simon was questioning whether we need a minRecoveryPoint mechanism in crash recovery, or at least that's what I thought he asked. regards, tom lane
On 13.02.2013 21:30, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> On 13.02.2013 21:21, Tom Lane wrote: >>> It would only be broken if someone interrupted a crash recovery >>> mid-flight and tried to establish a recovery stop point before the end >>> of WAL, no? Why don't we just forbid that case? This would either be >>> the same as, or a small extension of, the pg_control state vs existence >>> of recovery.conf error check that was just discussed. > >> The problem is when you interrupt archive recovery (kill -9), and >> restart. After restart, the system needs to know how far the WAL was >> replayed before the crash, because it must not open for hot standby >> queries, or allow the database to be started up in master-mode, until >> it's replayed the WAL up to that same point again. > > Well, archive recovery is a different scenario --- Simon was questioning > whether we need a minRecoveryPoint mechanism in crash recovery, or at > least that's what I thought he asked. Ah, ok. The short answer to that is "no", because in crash recovery, we just replay the WAL all the way to the end. I thought he was questioning updating the control file at every XLogFlush() during archive recovery. The answer to that is that it's not so bad, because XLogFlush() is called so infrequently during recovery. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > The problem we're trying to solve is determining how much WAL needs to > be replayed until the database is consistent again. In crash recovery, > the answer is "all of it". That's why the CRC in the WAL is essential; > it's required to determine where the WAL ends. But if we had some other > mechanism, like if we updated minRecoveryPoint after every XLogFlush() > like Simon suggested, we wouldn't necessarily need the CRC to detect end > of WAL (not that I'd suggest removing it anyway), and we could throw an > error if there is corrupt bit somewhere in the WAL before the true end > of WAL. Meh. I think that would be disastrous from both performance and reliability standpoints. (Performance because the whole point of WAL is to commit with only one disk write in one place, and reliability because of greatly increasing the number of writes to the utterly-critical pg_control file.) regards, tom lane
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 13.02.2013 21:30, Tom Lane wrote: >> Well, archive recovery is a different scenario --- Simon was questioning >> whether we need a minRecoveryPoint mechanism in crash recovery, or at >> least that's what I thought he asked. > Ah, ok. The short answer to that is "no", because in crash recovery, we > just replay the WAL all the way to the end. I thought he was questioning > updating the control file at every XLogFlush() during archive recovery. > The answer to that is that it's not so bad, because XLogFlush() is > called so infrequently during recovery. Right, and it's not so evil from a reliability standpoint either, partly because of that and partly because, by definition, this isn't your only copy of the data. regards, tom lane
On 13.02.2013 17:02, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> At least in back-branches, I'd call this a pilot error. You can't turn a >> master into a standby just by creating a recovery.conf file. At least >> not if the master was not shut down cleanly first. >> ... >> I'm not sure that's worth the trouble, though. Perhaps it would be >> better to just throw an error if the control file state is >> DB_IN_PRODUCTION and a recovery.conf file exists. > > +1 for that approach, at least until it's clear there's a market for > doing this sort of thing. I think the error check could be > back-patched, too. Hmm, I just realized a little problem with that approach. If you take a base backup using an atomic filesystem backup from a running server, and start archive recovery from that, that's essentially the same thing as Kyotaro's test case. - Heikki
On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Without step 3, the server would perform crash recovery, and it would work. > But because of the recovery.conf file, the server goes into archive > recovery, and because minRecoveryPoint is not set, it assumes that the > system is consistent from the start. > > Aside from the immediate issue with truncation, the system really isn't > consistent until the WAL has been replayed far enough, so it shouldn't open > for hot standby queries. There might be other, later, changes already > flushed to data files. The system has no way of knowing how far it needs to > replay the WAL to become consistent. > > At least in back-branches, I'd call this a pilot error. You can't turn a > master into a standby just by creating a recovery.conf file. At least not if > the master was not shut down cleanly first. > > If there's a use case for doing that, maybe we can do something better in > HEAD. If the control file says that the system was running > (DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash > recovery first, until we reach the end of WAL, and go into archive recovery > mode after that. We'd recover all the WAL files in pg_xlog as far as we can, > same as in crash recovery, and only start restoring files from the archive > once we reach the end of WAL in pg_xlog. At that point, we'd also consider > the system as consistent, and start up for hot standby. > > I'm not sure that's worth the trouble, though. Perhaps it would be better to > just throw an error if the control file state is DB_IN_PRODUCTION and a > recovery.conf file exists. The admin can always start the server normally > first, shut it down cleanly, and then create the recovery.conf file. Now I've read the whole thing... The problem is that we startup Hot Standby before we hit the min recovery point because that isn't recorded. For me, the thing to do is to make the min recovery point == end of WAL when state is DB_IN_PRODUCTION. That way we don't need to do any new writes and we don't need to risk people seeing inconsistent results if they do this. But I think that still gives you a timeline problem when putting a master back into a standby. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<p dir="ltr"><br /> On Feb 13, 2013 10:29 PM, "Heikki Linnakangas" <<a href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>>wrote:<br /> > Hmm, I just realized a little problemwith that approach. If you take a base backup using an atomic filesystem backup from a running server, and start archiverecovery from that, that's essentially the same thing as Kyotaro's test case.<p dir="ltr">Coincidentally I was wonderingabout the same thing when I was reviewing our slave provisioning procedures. There didn't seem to be any communicationchannel from pg_stop_backup for the slave to know when it was safe to allow connections.<p dir="ltr">Maybe thereshould be some mechanism akin to backup label to communicate the minimum recovery point? When the min recovery pointfile exists the value inside it is used, when the recovery point is reached the file is removed. pg_basebackup wouldjust append the file to the archive. Custom backup procedures could also use it to communicate the necessary WAL location.<pdir="ltr">--<br /> Ants Aasma<br />
On Thu, Feb 14, 2013 at 5:15 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 13.02.2013 17:02, Tom Lane wrote: >> >> Heikki Linnakangas<hlinnakangas@vmware.com> writes: >>> >>> At least in back-branches, I'd call this a pilot error. You can't turn a >>> master into a standby just by creating a recovery.conf file. At least >>> not if the master was not shut down cleanly first. >>> ... >>> I'm not sure that's worth the trouble, though. Perhaps it would be >>> better to just throw an error if the control file state is >>> DB_IN_PRODUCTION and a recovery.conf file exists. >> >> >> +1 for that approach, at least until it's clear there's a market for >> doing this sort of thing. I think the error check could be >> back-patched, too. > > > Hmm, I just realized a little problem with that approach. If you take a base > backup using an atomic filesystem backup from a running server, and start > archive recovery from that, that's essentially the same thing as Kyotaro's > test case. Yes. And the resource agent for streaming replication in Pacemaker (it's the OSS clusterware) is the user of that archive recovery scenario, too. When it starts up the server, it always creates the recovery.conf and starts the server as the standby. It cannot start the master directly, IOW the server is always promoted to the master from the standby. So when it starts up the server after the server crashes, obviously it executes the same recovery scenario (i.e., force archive recovery instead of crash one) as Kyotaro described. The reason why that resource agent cannot start up the master directly is that it manages three server states, called Master, Slave and Down. It can move the server state from Down to Slave, and the reverse direction. Also it can move the state from Slave to Master, and the reverse direction. But there is no way to move the state between Down and Master directly. This kind of the state transition model is isolated case in clusterware, I think. Regards, -- Fujii Masao
On Thu, Feb 14, 2013 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > >> Without step 3, the server would perform crash recovery, and it would work. >> But because of the recovery.conf file, the server goes into archive >> recovery, and because minRecoveryPoint is not set, it assumes that the >> system is consistent from the start. >> >> Aside from the immediate issue with truncation, the system really isn't >> consistent until the WAL has been replayed far enough, so it shouldn't open >> for hot standby queries. There might be other, later, changes already >> flushed to data files. The system has no way of knowing how far it needs to >> replay the WAL to become consistent. >> >> At least in back-branches, I'd call this a pilot error. You can't turn a >> master into a standby just by creating a recovery.conf file. At least not if >> the master was not shut down cleanly first. >> >> If there's a use case for doing that, maybe we can do something better in >> HEAD. If the control file says that the system was running >> (DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash >> recovery first, until we reach the end of WAL, and go into archive recovery >> mode after that. We'd recover all the WAL files in pg_xlog as far as we can, >> same as in crash recovery, and only start restoring files from the archive >> once we reach the end of WAL in pg_xlog. At that point, we'd also consider >> the system as consistent, and start up for hot standby. >> >> I'm not sure that's worth the trouble, though. Perhaps it would be better to >> just throw an error if the control file state is DB_IN_PRODUCTION and a >> recovery.conf file exists. The admin can always start the server normally >> first, shut it down cleanly, and then create the recovery.conf file. > > Now I've read the whole thing... > > The problem is that we startup Hot Standby before we hit the min > recovery point because that isn't recorded. For me, the thing to do is > to make the min recovery point == end of WAL when state is > DB_IN_PRODUCTION. That way we don't need to do any new writes and we > don't need to risk people seeing inconsistent results if they do this. +1 And if it's the standby case, the min recovery point can be set to the end of WAL files located in the standby. IOW, we can regard the database as consistent when we replay all the WAL files in local and try to connect to the master. Regards, -- Fujii Masao
Sorry, I omitted to show how we found this issue. In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL is stopped by 'pg_ctl stop -m i' regardless of situation. On the other hand, PosrgreSQL RA(Rsource Agent) is obliged to start the master node via hot standby state because of the restriction of the state transition of Pacemaker, So the simply stopping and then starting the master node can fall into this situation. > Hmm, I just realized a little problem with that approach. If you take > a base backup using an atomic filesystem backup from a running server, > and start archive recovery from that, that's essentially the same > thing as Kyotaro's test case. Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 13, 2013 at 10:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The problem is that we startup Hot Standby before we hit the min > recovery point because that isn't recorded. For me, the thing to do is > to make the min recovery point == end of WAL when state is > DB_IN_PRODUCTION. That way we don't need to do any new writes and we > don't need to risk people seeing inconsistent results if they do this. While this solution would help solve my issue, it assumes that the correct amount of WAL files are actually there. Currently the docs for setting up a standby refer to "24.3.4. Recovering Using a Continuous Archive Backup", and that step recommends emptying the contents of pg_xlog. If this is chosen as the solution the docs should be adjusted to recommend using pg_basebackup -x for setting up the standby. As a related point, pointing standby setup to that section has confused at least one of my clients. That chapter is rather scarily complicated compared to what's usually necessary. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 15.02.2013 13:05, Ants Aasma wrote: > On Wed, Feb 13, 2013 at 10:52 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> The problem is that we startup Hot Standby before we hit the min >> recovery point because that isn't recorded. For me, the thing to do is >> to make the min recovery point == end of WAL when state is >> DB_IN_PRODUCTION. That way we don't need to do any new writes and we >> don't need to risk people seeing inconsistent results if they do this. > > While this solution would help solve my issue, it assumes that the > correct amount of WAL files are actually there. Currently the docs for > setting up a standby refer to "24.3.4. Recovering Using a Continuous > Archive Backup", and that step recommends emptying the contents of > pg_xlog. If this is chosen as the solution the docs should be adjusted > to recommend using pg_basebackup -x for setting up the standby. When the backup is taken using pg_start_backup or pg_basebackup, minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog. It's only if you take the backup using an atomic filesystem snapshot, or just kill -9 the server and take a backup while it's not running, that we have a problem. In those scenarios, you should not clear pg_xlog. Attached is a patch for git master. The basic idea is to split InArchiveRecovery into two variables, InArchiveRecovery and ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when recovery.conf exists. But if we don't know how far we need to recover, we first perform crash recovery with InArchiveRecovery=false. When we reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we continue with normal archive recovery. > As a > related point, pointing standby setup to that section has confused at > least one of my clients. That chapter is rather scarily complicated > compared to what's usually necessary. Yeah, it probably could use some editing, as the underlying code has evolved a lot since it was written. The suggestion to clear out pg_xlog seems like an unnecessary complication. It's safe to do so, if you restore with an archive, but unnecessary. The "File System Level Backup" chapter (http://www.postgresql.org/docs/devel/static/backup-file.html) probably should mention "pg_basebackup -x", too. Docs patches are welcome.. - Heikki
Attachment
On Fri, Feb 15, 2013 at 3:49 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> While this solution would help solve my issue, it assumes that the >> correct amount of WAL files are actually there. Currently the docs for >> setting up a standby refer to "24.3.4. Recovering Using a Continuous >> Archive Backup", and that step recommends emptying the contents of >> pg_xlog. If this is chosen as the solution the docs should be adjusted >> to recommend using pg_basebackup -x for setting up the standby. > > When the backup is taken using pg_start_backup or pg_basebackup, > minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog. How is minRecoveryPoint supposed to get set? I just tried taking a pg_basebackup on master running pgbench. The resulting data dir controlfile had this: Min recovery ending location: 0/0 The end location was not in the backup_label either. Looking at basebackup.c the process seems to be: 1. pg_start_backup 2. copy out backup_label 3. copy out rest of the datadir 4. copy out control file 5. pg_stop_backup 6. copy out WAL 7. send backup stop location And pg_basebackup.c only uses the stop location to decide how much WAL to fetch. I don't see anything here that could correctly communicate min recovery point. Maybe I'm missing something. > Yeah, it probably could use some editing, as the underlying code has evolved > a lot since it was written. The suggestion to clear out pg_xlog seems like > an unnecessary complication. It's safe to do so, if you restore with an > archive, but unnecessary. > > The "File System Level Backup" chapter > (http://www.postgresql.org/docs/devel/static/backup-file.html) probably > should mention "pg_basebackup -x", too. > > Docs patches are welcome.. I will give it a shot. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 16.02.2013 10:40, Ants Aasma wrote: > On Fri, Feb 15, 2013 at 3:49 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> While this solution would help solve my issue, it assumes that the >>> correct amount of WAL files are actually there. Currently the docs for >>> setting up a standby refer to "24.3.4. Recovering Using a Continuous >>> Archive Backup", and that step recommends emptying the contents of >>> pg_xlog. If this is chosen as the solution the docs should be adjusted >>> to recommend using pg_basebackup -x for setting up the standby. >> >> When the backup is taken using pg_start_backup or pg_basebackup, >> minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog. > > How is minRecoveryPoint supposed to get set? I was a slightly imprecise. minRecoveryPoint is not set at backup, but backupStartPoint is. minRecoveryPoint is set later, once the end-of-backup record is seen. A valid backupStartPoint has the same effect as minRecoveryPoint: the system is considered inconsistent until the end-of-backup record is seen. > I just tried taking a > pg_basebackup on master running pgbench. The resulting data dir > controlfile had this: > > Min recovery ending location: 0/0 > > The end location was not in the backup_label either. > > Looking at basebackup.c the process seems to be: > 1. pg_start_backup > 2. copy out backup_label > 3. copy out rest of the datadir > 4. copy out control file > 5. pg_stop_backup > 6. copy out WAL > 7. send backup stop location > > And pg_basebackup.c only uses the stop location to decide how much WAL to fetch. > > I don't see anything here that could correctly communicate min > recovery point. Maybe I'm missing something. backupStartPoint is set, which signals recovery to wait for an end-of-backup record, until the system is considered consistent. If the backup is taken from a hot standby, backupEndPoint is set, instead of inserting an end-of-backup record. - Heikki
On Mon, Feb 18, 2013 at 8:27 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > backupStartPoint is set, which signals recovery to wait for an end-of-backup > record, until the system is considered consistent. If the backup is taken > from a hot standby, backupEndPoint is set, instead of inserting an > end-of-backup record. Thank you for explaining this, I can see how it works now. I'll see if I can document this better so others won't have to go through as much effort. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Hello, I looked this from another point of view. I consider the current discussion to be based on how to predict the last consistency point. But there is another aspect of this issue. I tried to postpone smgrtruncate after the next checkpoint. This is similar to what hotstandby feedback does to vacuum. It seems to be working fine but I warry that it might also bloats the table. I haven't found the way to postpone only objective smgrtruncate. The patch below is a immediate verification patch for this solution. - CreateCheckPoint records the oldest xmin at that point. Let's call it 'checkpoint xmin'. - vacuum skips the modification by the transactions at the same time or after the checkpoint xmin. What do you think of this? -- Kyotaro Horiguchi NTT Open Source Software Center ========= diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 479c14d..a274393 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -493,6 +493,7 @@ typedef struct XLogCtlData XLogRecPtr lastFpwDisableRecPtr; slock_t info_lck; /* locks shared variables shown above */ + TransactionId chkptxactid;} XLogCtlData;static XLogCtlData *XLogCtl = NULL; @@ -676,6 +677,11 @@ static bool read_backup_label(XLogRecPtr *checkPointLoc,static void rm_redo_error_callback(void *arg);staticint get_sync_bit(int method); +TransactionId +XLogGetChkptTrId() +{ + return XLogCtl->chkptxactid; +}/* * Insert an XLOG record having the specified RMID and info bytes, @@ -3875,7 +3881,7 @@ XLOGShmemInit(void) SpinLockInit(&XLogCtl->info_lck); SpinLockInit(&XLogCtl->ulsn_lck); InitSharedLatch(&XLogCtl->recoveryWakeupLatch); - + XLogCtl->chkptxactid = InvalidTransactionId; /* * If we are not in bootstrap mode, pg_control should alreadyexist. Read * and validate it immediately (see comments in ReadControlFile() for the @@ -6700,6 +6706,8 @@ CreateCheckPoint(int flags) else checkPoint.oldestActiveXid = InvalidTransactionId; + XLogCtl->chkptxactid = GetOldestXmin(true, true); + /* * We must hold WALInsertLock while examining insert state to determine * the checkpoint REDO pointer. diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 4800b43..79205a0 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -374,6 +374,7 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)/* * vacuum_set_xid_limits() -- compute oldest-Xmin andfreeze cutoff points */ +extern TransactionId XLogGetChkptTrId(void);voidvacuum_set_xid_limits(int freeze_min_age, int freeze_table_age, @@ -397,6 +398,11 @@ vacuum_set_xid_limits(int freeze_min_age, * always an independent transaction. */ *oldestXmin= GetOldestXmin(sharedRel, true); + { + TransactionId chkpttrid = XLogGetChkptTrId(); + if (chkpttrid < *oldestXmin) + *oldestXmin = chkpttrid; + } Assert(TransactionIdIsNormal(*oldestXmin)); ==========
Sorry, Let me correct a bit. > I tried to postpone smgrtruncate after the next checkpoint. This I tried to postpone smgrtruncate TO the next checktpoint. > is similar to what hotstandby feedback does to vacuum. It seems > to be working fine but I warry that it might also bloats the > table. I haven't found the way to postpone only objective > smgrtruncate. > > The patch below is a immediate verification patch for this > solution. > > - CreateCheckPoint records the oldest xmin at that point. Let's > call it 'checkpoint xmin'. > > - vacuum skips the modification by the transactions at the same > time or after the checkpoint xmin. -- Kyotaro Horiguchi NTT Open Source Software Center
On 20.02.2013 10:01, Kyotaro HORIGUCHI wrote: > Sorry, Let me correct a bit. > >> I tried to postpone smgrtruncate after the next checkpoint. This > > I tried to postpone smgrtruncate TO the next checktpoint. Umm, why? I don't understand this patch at all. - Heikki
On 15.02.2013 15:49, Heikki Linnakangas wrote: > Attached is a patch for git master. The basic idea is to split > InArchiveRecovery into two variables, InArchiveRecovery and > ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when > recovery.conf exists. But if we don't know how far we need to recover, > we first perform crash recovery with InArchiveRecovery=false. When we > reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we > continue with normal archive recovery. New version of this attached, with a few bugs fixed. I'm thinking that this should be back-patched to 9.2, but not to earlier branches. Before 9.2, we don't PANIC at a reference to a non-existent page until end of recovery, even if we've already reached consistency. The same basic issue still exists in earlier versions, though: if you have hot_standby=on, the system will open for read-only queries too early, before the database is consistent. But this patch is invasive enough that I'm weary of back-patching it further, when the worst that can happen is that there's a small window right after startup when you can see an inconsistent database in hot standby mode. Maybe after we get some more testing of this in 9.2 and master. Opinions on that? - Heikki
Attachment
On Thu, Feb 21, 2013 at 11:09 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
-- On 15.02.2013 15:49, Heikki Linnakangas wrote:New version of this attached, with a few bugs fixed.Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.
I'm thinking that this should be back-patched to 9.2, but not to earlier branches. Before 9.2, we don't PANIC at a reference to a non-existent page until end of recovery, even if we've already reached consistency. The same basic issue still exists in earlier versions, though: if you have hot_standby=on, the system will open for read-only queries too early, before the database is consistent. But this patch is invasive enough that I'm weary of back-patching it further, when the worst that can happen is that there's a small window right after startup when you can see an inconsistent database in hot standby mode. Maybe after we get some more testing of this in 9.2 and master. Opinions on that?
People have not yet complained about this problem with versions prior to 9.1. Is it worth backpatching in this case?
Michael
On 22.02.2013 02:13, Michael Paquier wrote: > On Thu, Feb 21, 2013 at 11:09 PM, Heikki Linnakangas< > hlinnakangas@vmware.com> wrote: > >> On 15.02.2013 15:49, Heikki Linnakangas wrote: >> >>> Attached is a patch for git master. The basic idea is to split >>> InArchiveRecovery into two variables, InArchiveRecovery and >>> ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when >>> recovery.conf exists. But if we don't know how far we need to recover, >>> we first perform crash recovery with InArchiveRecovery=false. When we >>> reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we >>> continue with normal archive recovery. >>> >> >> New version of this attached, with a few bugs fixed. >> >> I'm thinking that this should be back-patched to 9.2, but not to earlier >> branches. Before 9.2, we don't PANIC at a reference to a non-existent page >> until end of recovery, even if we've already reached consistency. The same >> basic issue still exists in earlier versions, though: if you have >> hot_standby=on, the system will open for read-only queries too early, >> before the database is consistent. But this patch is invasive enough that >> I'm weary of back-patching it further, when the worst that can happen is >> that there's a small window right after startup when you can see an >> inconsistent database in hot standby mode. Maybe after we get some more >> testing of this in 9.2 and master. Opinions on that? >> > People have not yet complained about this problem with versions prior to > 9.1. Is it worth backpatching in this case? Possibly not.. Anyway, I've committed this to master and 9.2 now. - Heikki
On 14.02.2013 19:18, Fujii Masao wrote: > Yes. And the resource agent for streaming replication in Pacemaker (it's the > OSS clusterware) is the user of that archive recovery scenario, too. When it > starts up the server, it always creates the recovery.conf and starts the server > as the standby. It cannot start the master directly, IOW the server is always > promoted to the master from the standby. So when it starts up the server > after the server crashes, obviously it executes the same recovery scenario > (i.e., force archive recovery instead of crash one) as Kyotaro described. > > The reason why that resource agent cannot start up the master directly is > that it manages three server states, called Master, Slave and Down. It can > move the server state from Down to Slave, and the reverse direction. > Also it can move the state from Slave to Master, and the reverse direction. > But there is no way to move the state between Down and Master directly. > This kind of the state transition model is isolated case in > clusterware, I think. I don't have much sympathy for that to be honest. Seems like something that should be fixed in Pacemaker or the scripts used to glue it with PostgreSQL. However, this patch should make that work, so I guess everyone is happy. - Heikki
On 15.02.2013 10:33, Kyotaro HORIGUCHI wrote: > Sorry, I omitted to show how we found this issue. > > In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL > is stopped by 'pg_ctl stop -m i' regardless of situation. That seems like a bad idea. If nothing else, crash recovery can take a long time. I don't know much about Pacemaker, but wouldn't it make more sense to at least try fast shutdown first, falling back to immediate shutdown after a timeout. - Heikki
Hello, > Anyway, I've committed this to master and 9.2 now. This seems to fix the issue. We'll examine this further. Thank you. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 22 Feb 2013 11:42:39 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in <51273D8F.7060609@vmware.com> > On 15.02.2013 10:33, Kyotaro HORIGUCHI wrote: > > In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL > > is stopped by 'pg_ctl stop -m i' regardless of situation. > > That seems like a bad idea. If nothing else, crash recovery can take a > long time. I don't know much about Pacemaker, but wouldn't it make > more sense to at least try fast shutdown first, falling back to > immediate shutdown after a timeout. I completely agree with you. But I hear that they want to shutdown as fast as possible on failure and also don't want more states on RA.. On the other hand he result in case of this failure is catastrophic so this should be fixed regardless of what pacemaker does. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
However this has become useless, I want to explain about how this works. > > I tried to postpone smgrtruncate TO the next checktpoint. > > Umm, why? I don't understand this patch at all. This inhibits truncate files after (quite vague in the patch:-) the previous checkpoint by hindering the deleted tuples from vacuum then hindering the vm pages from truncation. They are truncated only If no tuples is deleted on the pages correspond to them since the last checkpoint. Finally any nonexistent pages in visibility maps won't be touched after the consistency point - which is after the last checkpoint - on archive recovery. # Umm.. sorry for confused explanation... regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Folks, Is there any way this particular issue could cause data corruption without causing a crash? I don't see a way for it to do so, but I wanted to verify. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
This is an interim report for this patch. We found that PostgreSQL with this patch unexpctedly becomes primary when starting up as standby. We'll do further investigation for the behavior. > > Anyway, I've committed this to master and 9.2 now. > > This seems to fix the issue. We'll examine this further. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, I could cause the behavior and might understand the cause. The head of origin/REL9_2_STABLE shows the behavior I metioned in the last message when using the shell script attached. 9.3dev runs as expected. In XLogPageRead, when RecPtr goes beyond the last page, the current xlog file is released and new page requested. The variables were as below at the point. StandbyRequested == true StandbyMode == false ArchiveRecoveryRequested == true InArchiveRecovery == false In this case, XLogPageRead immediately returns NULL before trying to get xlogs via streaming nor from archive. So ReadRecord returns NULL, then unexpectedly exits 'main redo apply loop' and increases timeline ID as if it were promoted. This seems fiexed by letting it try all requested sources. Attached patch does it and the test script runs as expected. > We found that PostgreSQL with this patch unexpctedly becomes > primary when starting up as standby. We'll do further > investigation for the behavior. > > > > Anyway, I've committed this to master and 9.2 now. > > > > This seems to fix the issue. We'll examine this further. regards, -- Kyotaro Horiguchi NTT Open Source Software Center #! /bin/sh version="abf5c5" version="924b" killall -9 postgres source pgsetpath $version 0 rm -rf $PGDATA/* $PGARC/* PGDATA0=$PGDATA PGPORT0=$PGPORT initdb -D $PGDATA0 cat >> $PGDATA0/postgresql.conf <<EOF wal_level = hot_standby checkpoint_segments = 300 checkpoint_timeout = 1h archive_mode = on archive_command = 'cp %p $PGARC/%f' max_wal_senders = 3 hot_standby = on EOF cat >> $PGDATA0/pg_hba.conf <<EOF local replication horiguti trust EOF echo ## Startup master pg_ctl -D $PGDATA0 -w start source pgsetpath $version 1 -p 5433 PGDATA1=$PGDATA PGPORT1=$PGPORT rm -rf $PGDATA/* $PGARC/* echo "## basebackup" pg_basebackup -h /tmp -p $PGPORT0 -F p -X s -D $PGDATA1 chmod 700 $PGDATA1 cat >> $PGDATA1/recovery.conf <<EOF standby_mode = yes primary_conninfo='host=/tmp port=5432' restore_command='a=$PGARC; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi' #restore_command='a=$PGARC; if [ -d \$a ]; then echo Archive directory \$a is not found.; exit 1; elif [ -f \$a/%f ]; thencp \$a/%f %p; else exit 1; fi' EOF echo "## Startup standby" pg_ctl -D $PGDATA1 start echo "## Sleep for 5 seconds" sleep 5 echo "## Shutdown standby" pg_ctl -D $PGDATA1 -w stop -m f echo "## Shutdown master in immediate mode" pg_ctl -D $PGDATA0 -w stop -m i cat >> $PGDATA0/recovery.conf <<EOF standby_mode = yes primary_conninfo='host=/tmp port=5433' restore_command='a=$PGARC; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi' EOF echo "## Starting master as a standby" if [ "$1" == "w" ]; then touch /tmp/xlogwait; fi PGPORT=5432 pg_ctl -D $PGDATA0 start #psql postgres -c "select pg_is_in_recovery();" diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 92adc4e..00b5bc5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -10604,7 +10604,19 @@ retry: sources); switched_segment =true; if (readFile < 0) + { + if (!InArchiveRecovery && ArchiveRecoveryRequested) + { + InArchiveRecovery = true; + goto retry; + } + else if (!StandbyMode && StandbyModeRequested) + { + StandbyMode = true; + goto retry; + } return false; + } } } }
Sorry, I sent wrong script. > The head of origin/REL9_2_STABLE shows the behavior I metioned in > the last message when using the shell script attached. 9.3dev > runs as expected. regards, -- Kyotaro Horiguchi NTT Open Source Software Center #! /bin/sh pgpath="$HOME/bin/pgsql_924b" echo $PATH | grep $pgpath | wc -l if [ `echo $PATH | grep $pgpath | wc -l` == 0 ]; thenPATH=$pgpath/bin:$PATH fi PGDATA0="$HOME/data/pgdata_0" PGDATA1="$HOME/data/pgdata_1" PGARC0="$HOME/data/pgarc_0" PGARC1="$HOME/data/pgarc_1" PGPORT0=5432 PGPORT1=5433 unset PGPORT unset PGDATA echo "Postgresql is \"`which postgres`\"" killall -9 postgres rm -rf $PGDATA0/* $PGARC0/* initdb -D $PGDATA0 cat >> $PGDATA0/postgresql.conf <<EOF port=5432 wal_level = hot_standby checkpoint_segments = 300 checkpoint_timeout = 1h archive_mode = on archive_command = 'cp %p $PGARC0/%f' max_wal_senders = 3 hot_standby = on #log_min_messages = debug5 EOF cat >> $PGDATA0/pg_hba.conf <<EOF local replication horiguti trust EOF echo ## Startup master pg_ctl -D $PGDATA0 -w -o "-p $PGPORT0" start rm -rf $PGDATA1/* $PGARC1/* echo "## basebackup" pg_basebackup -h /tmp -p $PGPORT0 -F p -X s -D $PGDATA1 chmod 700 $PGDATA1 cat >> $PGDATA1/recovery.conf <<EOF standby_mode = yes primary_conninfo='host=/tmp port=5432' restore_command='a=$PGARC1; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi' EOF echo "## Startup standby" pg_ctl -D $PGDATA1 -o "-p $PGPORT1" start echo "## Sleep for 5 seconds" sleep 5 echo "## Shutdown standby" pg_ctl -D $PGDATA1 -w stop -m f echo "## Shutdown master in immediate mode" pg_ctl -D $PGDATA0 -w stop -m i cat >> $PGDATA0/recovery.conf <<EOF standby_mode = yes primary_conninfo='host=/tmp port=5433' restore_command='a=$PGARC0; if [ -f \$a/%f ]; then cp \$a/%f %p; else exit 1; fi' EOF echo "## Starting master as a standby" pg_ctl -D $PGDATA0 start
Hi, Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord(); Attempt patch records minRecoveryPoint. [crash recovery -> record minRecoveryPoint in control file -> archive recovery] I think that this is an original intention of Heikki's patch. I also found a bug in latest 9.2_stable. It does not get latest timeline and recovery history file in archive recovery when master and standby timeline is different. Best regards, (2013/03/05 18:22), Kyotaro HORIGUCHI wrote: > Hello, I could cause the behavior and might understand the cause. > > The head of origin/REL9_2_STABLE shows the behavior I metioned in > the last message when using the shell script attached. 9.3dev > runs as expected. > > In XLogPageRead, when RecPtr goes beyond the last page, the > current xlog file is released and new page requested. > > The variables were as below at the point. > > StandbyRequested == true > StandbyMode == false > ArchiveRecoveryRequested == true > InArchiveRecovery == false > > In this case, XLogPageRead immediately returns NULL before trying > to get xlogs via streaming nor from archive. So ReadRecord > returns NULL, then unexpectedly exits 'main redo apply loop' and > increases timeline ID as if it were promoted. > > This seems fiexed by letting it try all requested > sources. Attached patch does it and the test script runs as > expected. > >> We found that PostgreSQL with this patch unexpctedly becomes >> primary when starting up as standby. We'll do further >> investigation for the behavior. >> >>>> Anyway, I've committed this to master and 9.2 now. >>> >>> This seems to fix the issue. We'll examine this further. > > regards, > > > > -- Mitsumasa KONDO NTT OSS Center
Attachment
Hmm.. > Horiguch's patch does not seem to record minRecoveryPoint in > ReadRecord(); > Attempt patch records minRecoveryPoint. > [crash recovery -> record minRecoveryPoint in control file -> archive > recovery] > I think that this is an original intention of Heikki's patch. It could be. Before that, my patch does not wake up hot standby because I didn't care of minRecoveryPoint in it:-p On the other hand, your patch fixes that point but ReadRecord runs on the false page data. However the wrong record on the false page can be identified as broken, It should be an undesiable behavior. If we want to show the final solution by ourselves, we need to make another patch to settle them all. Let me take further thought.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, I suppose the attached patch is close to the solution. > I think that this is an original intention of Heikki's patch. I noticed that archive recovery will be turned on in next_record_is_invalid thanks to your patch. > On the other hand, your patch fixes that point but ReadRecord > runs on the false page data. However the wrong record on the > false page can be identified as broken, It should be an > undesiable behavior. All we should do to update minRecoveryPoint as needed when XLogPageRead is failed in ReadRecord is to simply jump to next_record_is_invalid if archive recovery is requested but doing crash recovery yet. Your modification in readTimeLineHistory and my modifictions in XLogPageRead seem not necessary for the objective in this thread, so removed. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 92adc4e..28d6f2e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4010,7 +4010,15 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)retry: /* Read the page containingthe record */ if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess)) + { + /* + * If archive recovery was requested when crash recovery failed, go to + * the label next_record_is_invalid to switch to archive recovery. + */ + if (!InArchiveRecovery && ArchiveRecoveryRequested) + goto next_record_is_invalid; return NULL; + } pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf); targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
On 05.03.2013 14:09, KONDO Mitsumasa wrote: > Hi, > > Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord(); > Attempt patch records minRecoveryPoint. > [crash recovery -> record minRecoveryPoint in control file -> archive > recovery] > I think that this is an original intention of Heikki's patch. Yeah. That fix isn't right, though; XLogPageRead() is supposed to return true on success, and false on error, and the patch makes it return 'true' on error, if archive recovery was requested but we're still in crash recovery. The real issue here is that I missed the two "return NULL;"s in ReadRecord(), so the code that I put in the next_record_is_invalid codepath isn't run if XLogPageRead() doesn't find the file at all. Attached patch is the proper fix for this. > I also found a bug in latest 9.2_stable. It does not get latest timeline > and > recovery history file in archive recovery when master and standby > timeline is different. Works for me.. Can you create a test script for that? Remember to set "recovery_target_timeline='latest'". - Heikki
Attachment
(2013/03/06 16:50), Heikki Linnakangas wrote:> >> Hi, >> >> Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord(); >> Attempt patch records minRecoveryPoint. >> [crash recovery -> record minRecoveryPoint in control file -> archive >> recovery] >> I think that this is an original intention of Heikki's patch. > > Yeah. That fix isn't right, though; XLogPageRead() is supposed to return true on success, and false on error, and the patchmakes it return 'true' on error, if archive recovery was requested but we're still in crash recovery. The real issuehere is that I missed the two "return NULL;"s in ReadRecord(), so the code that I put in the next_record_is_invalidcodepath isn't run if XLogPageRead() doesn't find the file at all. Attached patch is the proper fixfor this. > Thanks for createing patch! I test your patch in 9.2_STABLE, but it does not use promote command... When XLogPageRead() was returned false ,it means the end of stanby loop, crash recovery loop, and archive recovery loop. Your patch is not good for promoting Standby to Master. It does not come off standby loop. So I make new patch which is based Heikki's and Horiguchi's patch. I attempt test script which was modifyed Horiuch's script. This script does not depend on shell enviroment. It was only neededto fix PGPATH. Please execute this test script. >> I also found a bug in latest 9.2_stable. It does not get latest timeline >> and >> recovery history file in archive recovery when master and standby >> timeline is different. > > Works for me.. Can you create a test script for that? Remember to set "recovery_target_timeline='latest'". I set recovery_target_timeline=latest. hmm... Here is my recovery.conf. > mitsu-ko@localhost postgresql]$ cat Standby/recovery.conf > standby_mode = 'yes' > recovery_target_timeline='latest' > primary_conninfo='host=localhost port=65432' > restore_command='cp ../arc/%f %p' And my system's log message is here. > waiting for server to start....[Standby] LOG: database system was shut down in recovery at 2013-03-07 02:56:05 EST > [Standby] LOG: restored log file "00000002.history" from archive > cp: cannot stat `../arc/00000003.history': そのようなファイルやディレクトリはありません > [Standby] FATAL: requested timeline 2 is not a child of database system timeline 1 > [Standby] LOG: startup process (PID 20941) exited with exit code 1 > [Standby] LOG: aborting startup due to startup process failure It can be reproduced in my test script, too. Last master start command might seem not to exist generally in my test script. But it is generally that PostgreSQL with Pacemaker system. Best regards, -- Mitsumasa KONDO NTT OSS Center
Attachment
On 07.03.2013 10:05, KONDO Mitsumasa wrote: > (2013/03/06 16:50), Heikki Linnakangas wrote:> >> Yeah. That fix isn't right, though; XLogPageRead() is supposed to >> return true on success, and false on error, and the patch makes it >> return 'true' on error, if archive recovery was requested but we're >> still in crash recovery. The real issue here is that I missed the two >> "return NULL;"s in ReadRecord(), so the code that I put in the >> next_record_is_invalid codepath isn't run if XLogPageRead() doesn't >> find the file at all. Attached patch is the proper fix for this. >> > Thanks for createing patch! I test your patch in 9.2_STABLE, but it does > not use promote command... > When XLogPageRead() was returned false ,it means the end of stanby loop, > crash recovery loop, and archive recovery loop. > Your patch is not good for promoting Standby to Master. It does not come > off standby loop. > > So I make new patch which is based Heikki's and Horiguchi's patch. Ah, I see. I committed a slightly modified version of this. >>> I also found a bug in latest 9.2_stable. It does not get latest timeline >>> and >>> recovery history file in archive recovery when master and standby >>> timeline is different. >> >> Works for me.. Can you create a test script for that? Remember to set >> "recovery_target_timeline='latest'". > ... > It can be reproduced in my test script, too. I see the problem now, with that script. So what happens is that the startup process first scans the timeline history files to choose the recovery target timeline. For that scan, I temporarily set InArchiveRecovery=true, in readRecoveryCommandFile. However, after readRecoveryCommandFile returns, we then try to read the timeline history file corresponding the chosen recovery target timeline, but InArchiveRecovery is no longer set, so we don't fetch the file from archive, and return a "dummy" history, with just the target timeline in it. That doesn't contain the older timeline, so you get an error at recovery. Fixed per your patch to check for ArchiveRecoveryRequested instead of InArchiveRecovery, when reading timeline history files. This also makes it unnecessary to temporarily set InArchiveRecovery=true, so removed that. Committed both fixes. Please confirm this this fixed the problem in your test environment. Many thanks for the testing and the patches! - Heikki
(2013/03/07 19:41), Heikki Linnakangas wrote: > On 07.03.2013 10:05, KONDO Mitsumasa wrote: >> (2013/03/06 16:50), Heikki Linnakangas wrote:> >>> Yeah. That fix isn't right, though; XLogPageRead() is supposed to >>> return true on success, and false on error, and the patch makes it >>> return 'true' on error, if archive recovery was requested but we're >>> still in crash recovery. The real issue here is that I missed the two >>> "return NULL;"s in ReadRecord(), so the code that I put in the >>> next_record_is_invalid codepath isn't run if XLogPageRead() doesn't >>> find the file at all. Attached patch is the proper fix for this. >>> >> Thanks for createing patch! I test your patch in 9.2_STABLE, but it does >> not use promote command... >> When XLogPageRead() was returned false ,it means the end of stanby loop, >> crash recovery loop, and archive recovery loop. >> Your patch is not good for promoting Standby to Master. It does not come >> off standby loop. >> >> So I make new patch which is based Heikki's and Horiguchi's patch. > > Ah, I see. I committed a slightly modified version of this. I feel that your modification is legible. Thanks for your modification and committing patch! >>>> I also found a bug in latest 9.2_stable. It does not get latest timeline >>>> and >>>> recovery history file in archive recovery when master and standby >>>> timeline is different. >>> >>> Works for me.. Can you create a test script for that? Remember to set >>> "recovery_target_timeline='latest'". >> ... >> It can be reproduced in my test script, too. > > I see the problem now, with that script. So what happens is that the startup process first scans the timeline history filesto choose the recovery target timeline. For that scan, I temporarily set InArchiveRecovery=true, in readRecoveryCommandFile.However, after readRecoveryCommandFile returns, we then try to read the timeline history file correspondingthe chosen recovery target timeline, but InArchiveRecovery is no longer set, so we don't fetch the file fromarchive, and return a "dummy" history, with just the target timeline in it. That doesn't contain the older timeline,so you get an error at recovery. > Fixed per your patch to check for ArchiveRecoveryRequested instead of InArchiveRecovery, when reading timeline historyfiles. This also makes it unnecessary to temporarily set InArchiveRecovery=true, so removed that. > Committed both fixes. Please confirm this this fixed the problem in your test environment. Many thanks for the testingand the patches! I understand this problem. Thank you for adding modification and detail explanation! I test latest REL9_2_STABLE in my system.I confirm that it run good without problem. If I found an another problem, I will report and send you patch and testscript! Best regards, -- Mitsumasa KONDO NTT OSS Center
Everything seems settled up above my head while sleeping.. Sorry for crumsy test script, and thank you for refining it, Mitsumasa. And thank you for fixing the bug and the detailed explanation, Heikki. I confirmed that the problem is fixed also for me at origin/REL9_2_STABLE. > I understand this problem. Thank you for adding modification and > detail explanation! I test latest REL9_2_STABLE in my system. I > confirm that it run good without problem. If I found an another > problem, I will report and send you patch and test script! regards, -- 堀口恭太郎 日本電信電話株式会社 NTTオープンソフトウェアセンタ Phone: 03-5860-5115 / Fax: 03-5463-5490