Thread: PITR failing to stop before DROP DATABASE
In 9.3.5, if I set up archiving, create a database, pull a base backup, look at the clock, drop database, stop the server, rm -rf datadir, put back the backup, edit recovery.conf: cd /tmp; initdb foo edit postgresql.conf with archive_mode = on, archive_command, max_wal_senders = 1, wal_level = hot_standby edit pg_hba.conf pg_ctl start psql -c 'create database db1' psql -c 'checkpoint' pg_basebackup -D /tmp/back1 date Di 25. Nov 17:00:31 CET 2014 psql -c 'drop database db1' pg_ctl stop rm -rf foo; cp -a back1 foo restore_command = 'cp /tmp/%f %p' standby_mode = on recovery_target_time = '2014-11-25 17:00:31' ... and then restart the server, I get: postgres=# \c db1 FATAL: database "db1" does not exist DETAIL: The database subdirectory "base/70262" is missing. 2014-11-25 16:11:22 CET [15225-1] LOG: database system was interrupted; last known up at 2014-11-25 15:56:51 CET 2014-11-25 16:11:22 CET [15225-2] LOG: creating missing WAL directory "pg_xlog/archive_status" 2014-11-25 16:11:22 CET [15225-3] LOG: entering standby mode 2014-11-25 16:11:22 CET [15225-4] LOG: restored log file "000000010000000D0000007B" from archive 2014-11-25 16:11:22 CET [15225-5] LOG: redo starts at D/7B000028 2014-11-25 16:11:22 CET [15225-6] LOG: consistent recovery state reached at D/7B0000F0 2014-11-25 16:11:22 CET [15225-7] LOG: restored log file "000000010000000D0000007C" from archive 2014-11-25 16:11:22 CET [15225-8] LOG: recovery stopping before commit of transaction 60681, time 2014-11-25 15:59:57.071137+01 2014-11-25 16:11:22 CET [15225-9] LOG: redo done at D/7C001110 cp: cannot stat '/tmp/00000002.history': No such file or directory 2014-11-25 16:11:22 CET [15225-10] LOG: selected new timeline ID: 2 cp: cannot stat '/tmp/00000001.history': No such file or directory 2014-11-25 16:11:23 CET [15225-11] LOG: archive recovery complete 2014-11-25 16:11:23 CET [15236-1] [unknown]@[unknown] LOG: incomplete startup packet 2014-11-25 16:11:23 CET [15238-1] LOG: autovacuum launcher started 2014-11-25 16:11:23 CET [15224-1] LOG: database system is ready to accept connections 2014-11-25 16:11:41 CET [15268-1] postgres@db1 FATAL: database "db1" does not exist 2014-11-25 16:11:41 CET [15268-2] postgres@db1 DETAIL: The database subdirectory "base/70262" is missing. db1 is registered in pg_database, but the directory is missing on disk. (The actual log here is from a server with a longer history, but the problem is reproducible in a freshly initdb'ed server as well.) Christoph -- cb@df7cb.de | http://www.df7cb.de/
On 11/25/2014 06:06 PM, Christoph Berg wrote: > In 9.3.5, if I set up archiving, create a database, pull a base > backup, look at the clock, drop database, stop the server, rm -rf > datadir, put back the backup, edit recovery.conf: > > cd /tmp; initdb foo > edit postgresql.conf with archive_mode = on, archive_command, > max_wal_senders = 1, wal_level = hot_standby > edit pg_hba.conf > pg_ctl start > psql -c 'create database db1' > psql -c 'checkpoint' > pg_basebackup -D /tmp/back1 > date > Di 25. Nov 17:00:31 CET 2014 > psql -c 'drop database db1' > pg_ctl stop > rm -rf foo; cp -a back1 foo > > restore_command = 'cp /tmp/%f %p' > standby_mode = on > recovery_target_time = '2014-11-25 17:00:31' > > ... and then restart the server, I get: > > postgres=# \c db1 > FATAL: database "db1" does not exist > DETAIL: The database subdirectory "base/70262" is missing. > > 2014-11-25 16:11:22 CET [15225-1] LOG: database system was interrupted; last known up at 2014-11-25 15:56:51 CET > 2014-11-25 16:11:22 CET [15225-2] LOG: creating missing WAL directory "pg_xlog/archive_status" > 2014-11-25 16:11:22 CET [15225-3] LOG: entering standby mode > 2014-11-25 16:11:22 CET [15225-4] LOG: restored log file "000000010000000D0000007B" from archive > 2014-11-25 16:11:22 CET [15225-5] LOG: redo starts at D/7B000028 > 2014-11-25 16:11:22 CET [15225-6] LOG: consistent recovery state reached at D/7B0000F0 > 2014-11-25 16:11:22 CET [15225-7] LOG: restored log file "000000010000000D0000007C" from archive > 2014-11-25 16:11:22 CET [15225-8] LOG: recovery stopping before commit of transaction 60681, time 2014-11-25 15:59:57.071137+01 > 2014-11-25 16:11:22 CET [15225-9] LOG: redo done at D/7C001110 > cp: cannot stat '/tmp/00000002.history': No such file or directory > 2014-11-25 16:11:22 CET [15225-10] LOG: selected new timeline ID: 2 > cp: cannot stat '/tmp/00000001.history': No such file or directory > 2014-11-25 16:11:23 CET [15225-11] LOG: archive recovery complete > 2014-11-25 16:11:23 CET [15236-1] [unknown]@[unknown] LOG: incomplete startup packet > 2014-11-25 16:11:23 CET [15238-1] LOG: autovacuum launcher started > 2014-11-25 16:11:23 CET [15224-1] LOG: database system is ready to accept connections > 2014-11-25 16:11:41 CET [15268-1] postgres@db1 FATAL: database "db1" does not exist > 2014-11-25 16:11:41 CET [15268-2] postgres@db1 DETAIL: The database subdirectory "base/70262" is missing. > > db1 is registered in pg_database, but the directory is missing on > disk. Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the transaction only after that. There's this comment at the end of dropdb() function: > /* > * Force synchronous commit, thus minimizing the window between removal of > * the database files and commital of the transaction. If we crash before > * committing, we'll have a DB that's gone on disk but still there > * according to pg_database, which is not good. > */ So you could see the same after crash recovery, but it's a lot easier to reproduce with PITR. This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At the DROP DATABASE command, just memorize the OID of the dropped database, but don't delete anything yet. Perform the actual deletion after flushing the commit record to disk. But then you would have the opposite problem - you might be left with a database that's dropped according to pg_database, but the files are still present on disk. - Heikki
On 25.11.2014 18:11, Heikki Linnakangas wrote: > On 11/25/2014 06:06 PM, Christoph Berg wrote: > >> db1 is registered in pg_database, but the directory is missing on >> disk. > > Yeah, DROP DATABASE cheats. It deletes all the files first, and commits > the transaction only after that. There's this comment at the end of > dropdb() function: > >> /* >> * Force synchronous commit, thus minimizing the window between >> removal of >> * the database files and commital of the transaction. If we crash >> before >> * committing, we'll have a DB that's gone on disk but still there >> * according to pg_database, which is not good. >> */ > > So you could see the same after crash recovery, but it's a lot easier to > reproduce with PITR. So we remove the files, and if there happens to be a crash at the right moment, it results in a database with a record in pg_database, but no directory/files? Is it possible to cancel the command half-way through, leaving the database in an essentially broken state (half the files exists, half is already deleted? > This could be fixed by doing DROP DATABASE the same way we do DROP > TABLE. At the DROP DATABASE command, just memorize the OID of the > dropped database, but don't delete anything yet. Perform the actual > deletion after flushing the commit record to disk. But then you would > have the opposite problem - you might be left with a database that's > dropped according to pg_database, but the files are still present on disk. I'm in favor of modifying DROP DATABASE so that it behaves like DROP TABLE. I find it more consistent (two DROP commands should not really do such wildly different things). ISTM this would allow us to remove the ForceSyncCommit(); and allow DROP DATABASE to be executed within a transaction (say, within a management script etc.). Tomas
On 11/25/2014 11:01 PM, Tomas Vondra wrote: > [snip] >> So you could see the same after crash recovery, but it's a lot easier to >> reproduce with PITR. > So we remove the files, and if there happens to be a crash at the right > moment, it results in a database with a record in pg_database, but no > directory/files? > > Is it possible to cancel the command half-way through, leaving the > database in an essentially broken state (half the files exists, half is > already deleted? > > >> This could be fixed by doing DROP DATABASE the same way we do DROP >> TABLE. At the DROP DATABASE command, just memorize the OID of the >> dropped database, but don't delete anything yet. Perform the actual >> deletion after flushing the commit record to disk. But then you would >> have the opposite problem - you might be left with a database that's >> dropped according to pg_database, but the files are still present on disk. A rename from <oid>/ to <oid>.del/ is indeed atomic ... ... and so DROP DATABASE could become: - lock database (sic) - rename dir - delete from pg_database - flush commit recordto disk - background? deletion of all files (the command would not return until this last phase finished) If we cancel before flushing, the undo (and recovery after a crash) would just rename the directory back in place. If we flushed, recovery after re-starting postmaster would simply delete the remaining files within <oid>.del/ and then rmdir the directory itself. ... since essentially any other possibility will just leave an inconsistent db around and wastes space unnecesarily. Just my two cents, of course. > I'm in favor of modifying DROP DATABASE so that it behaves like DROP > TABLE. I find it more consistent (two DROP commands should not really do > such wildly different things). +1 > ISTM this would allow us to remove the ForceSyncCommit(); and allow DROP > DATABASE to be executed within a transaction (say, within a management > script etc.). Regards, / J.L.
Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com> > >db1 is registered in pg_database, but the directory is missing on > >disk. > > Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the > transaction only after that. There's this comment at the end of dropdb() > function: Oh ok. So this is an artifact of the non-transactionality (is this a word?) of CREATE DATABASE. > > > /* > > * Force synchronous commit, thus minimizing the window between removal of > > * the database files and commital of the transaction. If we crash before > > * committing, we'll have a DB that's gone on disk but still there > > * according to pg_database, which is not good. > > */ > > So you could see the same after crash recovery, but it's a lot easier to > reproduce with PITR. > > This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At > the DROP DATABASE command, just memorize the OID of the dropped database, > but don't delete anything yet. Perform the actual deletion after flushing > the commit record to disk. But then you would have the opposite problem - > you might be left with a database that's dropped according to pg_database, > but the files are still present on disk. My concern is mostly that "PITR to just before an accidental DROP DATABASE" is one of the primary use cases for PITR, so it should Just Work. (I ran into this during a training and had a hard time explaining why PITR bugs exist :) I just did another test, and as expected, the problem goes away if I execute any transaction just before the DROP DATABASE - even a simple "SELECT txid_current()" is enough. So my suggestion for a simple fix would be to make DROP DATABASE execute a short fake transaction before it starts deleting files and then continue as before. This would serve as a stopping point for recovery_target_time to run into. (We could still fix this properly later, but this idea seems like a good fix for a practical problem that doesn't break anything else.) Christoph -- cb@df7cb.de | http://www.df7cb.de/
On 11/26/2014 11:19 AM, Christoph Berg wrote: > Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com> >>> db1 is registered in pg_database, but the directory is missing on >>> disk. >> >> Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the >> transaction only after that. There's this comment at the end of dropdb() >> function: > > Oh ok. So this is an artifact of the non-transactionality (is this a > word?) of CREATE DATABASE. DROP DATABASE. CREATE DATABASE is a different story. It does similar non-transactional tricks and has similar issues, but it's a completely different codepath and could be fixed independently of DROP DATABASE. >>> /* >>> * Force synchronous commit, thus minimizing the window between removal of >>> * the database files and commital of the transaction. If we crash before >>> * committing, we'll have a DB that's gone on disk but still there >>> * according to pg_database, which is not good. >>> */ >> >> So you could see the same after crash recovery, but it's a lot easier to >> reproduce with PITR. >> >> This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At >> the DROP DATABASE command, just memorize the OID of the dropped database, >> but don't delete anything yet. Perform the actual deletion after flushing >> the commit record to disk. But then you would have the opposite problem - >> you might be left with a database that's dropped according to pg_database, >> but the files are still present on disk. > > My concern is mostly that "PITR to just before an accidental DROP > DATABASE" is one of the primary use cases for PITR, so it should Just > Work. (I ran into this during a training and had a hard time > explaining why PITR bugs exist :) > > I just did another test, and as expected, the problem goes away if I > execute any transaction just before the DROP DATABASE - even a simple > "SELECT txid_current()" is enough. > > So my suggestion for a simple fix would be to make DROP DATABASE > execute a short fake transaction before it starts deleting files and > then continue as before. This would serve as a stopping point for > recovery_target_time to run into. (We could still fix this properly > later, but this idea seems like a good fix for a practical problem > that doesn't break anything else.) Yeah, seems reasonable. - Heikki
Re: Heikki Linnakangas 2014-11-26 <54759BC0.4070505@vmware.com> > >Oh ok. So this is an artifact of the non-transactionality (is this a > >word?) of CREATE DATABASE. > > DROP DATABASE. CREATE DATABASE is a different story. It does similar > non-transactional tricks and has similar issues, but it's a completely > different codepath and could be fixed independently of DROP DATABASE. Err right. Too early in the morning... > >So my suggestion for a simple fix would be to make DROP DATABASE > >execute a short fake transaction before it starts deleting files and > >then continue as before. This would serve as a stopping point for > >recovery_target_time to run into. (We could still fix this properly > >later, but this idea seems like a good fix for a practical problem > >that doesn't break anything else.) > > Yeah, seems reasonable. Here's a first shot at a patch. It's not working yet because I think the commit isn't doing anything because no work was done in the transaction yet. *** a/src/backend/commands/dbcommands.c --- b/src/backend/commands/dbcommands.c *************** dropdb(const char *dbname, bool missing_ *** 778,783 **** --- 778,798 ---- nslots_active; /* + * Commit now to cause a commit xlog record to be logged. (We are outside + * any transaction so this is safe to do.) If we don't do this here, doing + * a PITR restore to just before DROP DATABASE will cause the files on disk + * to be deleted, while PITR stops before removing the database from the + * system catalogs, so the database is still visible while it is in fact + * already deleted. It is still possible to get to this intermediate state + * by selecting the correct transaction number in recovery.conf, but this + * fixes the common use case of specifying a recovery target time just + * before DROP DATABASE. + */ + PopActiveSnapshot(); + CommitTransactionCommand(); + StartTransactionCommand(); + + /* * Look up the target database's OID, and get exclusive lock on it. We * need this to ensure that no newbackend starts up in the target * database while we are deleting it (see postinit.c), and that no one is Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Wed, Nov 26, 2014 at 12:35:27PM +0100, Christoph Berg wrote: > Re: Heikki Linnakangas 2014-11-26 <54759BC0.4070505@vmware.com> > > >Oh ok. So this is an artifact of the non-transactionality (is this a > > >word?) of CREATE DATABASE. > > > > DROP DATABASE. CREATE DATABASE is a different story. It does similar > > non-transactional tricks and has similar issues, but it's a completely > > different codepath and could be fixed independently of DROP DATABASE. > > Err right. Too early in the morning... > > > >So my suggestion for a simple fix would be to make DROP DATABASE > > >execute a short fake transaction before it starts deleting files and > > >then continue as before. This would serve as a stopping point for > > >recovery_target_time to run into. (We could still fix this properly > > >later, but this idea seems like a good fix for a practical problem > > >that doesn't break anything else.) > > > > Yeah, seems reasonable. > > Here's a first shot at a patch. It's not working yet because I think > the commit isn't doing anything because no work was done in the > transaction yet. Where are we on this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: Bruce Momjian 2015-03-20 <20150320223549.GZ6317@momjian.us> > > > >So my suggestion for a simple fix would be to make DROP DATABASE > > > >execute a short fake transaction before it starts deleting files and > > > >then continue as before. This would serve as a stopping point for > > > >recovery_target_time to run into. (We could still fix this properly > > > >later, but this idea seems like a good fix for a practical problem > > > >that doesn't break anything else.) > > > > > > Yeah, seems reasonable. > > > > Here's a first shot at a patch. It's not working yet because I think > > the commit isn't doing anything because no work was done in the > > transaction yet. > > Where are we on this? I guess my patch could be fixed by forcing it to acquire a transaction id (SELECT txid_current() or whatever seems suitable), but my insight into the gory backend details is limited, so it'd be better if someone with more clue tried to fix it. Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com> > >db1 is registered in pg_database, but the directory is missing on > >disk. > > Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the > transaction only after that. There's this comment at the end of dropdb() > function: > > > /* > > * Force synchronous commit, thus minimizing the window between removal of > > * the database files and commital of the transaction. If we crash before > > * committing, we'll have a DB that's gone on disk but still there > > * according to pg_database, which is not good. > > */ > > So you could see the same after crash recovery, but it's a lot easier to > reproduce with PITR. > > This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At > the DROP DATABASE command, just memorize the OID of the dropped database, > but don't delete anything yet. Perform the actual deletion after flushing > the commit record to disk. But then you would have the opposite problem - > you might be left with a database that's dropped according to pg_database, > but the files are still present on disk. This seems to be the better idea anyway (and was mentioned again when I talked to Heikki and Andres about it at FOSDEM). The "opposite" problem wouldn't be so bad I guess - it might be visible in practise where you could easily clean up later, but the original problem is pretty bad if you hit it when trying to do PITR because something bad happened. (And we treat DROP TABLE the same.) Christoph -- cb@df7cb.de | http://www.df7cb.de/
On March 22, 2015 12:17:57 PM GMT+01:00, Christoph Berg <cb@df7cb.de> wrote: >Re: Bruce Momjian 2015-03-20 <20150320223549.GZ6317@momjian.us> >> > > >So my suggestion for a simple fix would be to make DROP DATABASE >> > > >execute a short fake transaction before it starts deleting files >and >> > > >then continue as before. This would serve as a stopping point >for >> > > >recovery_target_time to run into. (We could still fix this >properly >> > > >later, but this idea seems like a good fix for a practical >problem >> > > >that doesn't break anything else.) >> > > >> > > Yeah, seems reasonable. >> > >> > Here's a first shot at a patch. It's not working yet because I >think >> > the commit isn't doing anything because no work was done in the >> > transaction yet. >> >> Where are we on this? > >I guess my patch could be fixed by forcing it to acquire a transaction >id (SELECT txid_current() or whatever seems suitable), but my insight >into the gory backend details is limited, so it'd be better if someone >with more clue tried to fix it. It'd need to do a GetTopTransactionId(). >Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com> >> >db1 is registered in pg_database, but the directory is missing on >> >disk. >> >> Yeah, DROP DATABASE cheats. It deletes all the files first, and >commits the >> transaction only after that. There's this comment at the end of >dropdb() >> function: >> >> > /* >> > * Force synchronous commit, thus minimizing the window between >removal of >> > * the database files and commital of the transaction. If we crash >before >> > * committing, we'll have a DB that's gone on disk but still there >> > * according to pg_database, which is not good. >> > */ >> >> So you could see the same after crash recovery, but it's a lot easier >to >> reproduce with PITR. >> >> This could be fixed by doing DROP DATABASE the same way we do DROP >TABLE. At >> the DROP DATABASE command, just memorize the OID of the dropped >database, >> but don't delete anything yet. Perform the actual deletion after >flushing >> the commit record to disk. But then you would have the opposite >problem - >> you might be left with a database that's dropped according to >pg_database, >> but the files are still present on disk. > >This seems to be the better idea anyway (and was mentioned again when >I talked to Heikki and Andres about it at FOSDEM). Actually allowing for things like this was one of the reasons for the commit/abort extensibility stuff I committed a fewdays ago. --- Please excuse brevity and formatting - I am writing this on my mobile phone.