Thread: PITR failing to stop before DROP DATABASE

PITR failing to stop before DROP DATABASE

From
Christoph Berg
Date:
In 9.3.5, if I set up archiving, create a database, pull a base
backup, look at the clock, drop database, stop the server, rm -rf
datadir, put back the backup, edit recovery.conf:

cd /tmp; initdb foo
edit postgresql.conf with archive_mode = on, archive_command,
max_wal_senders = 1, wal_level = hot_standby
edit pg_hba.conf
pg_ctl start
psql -c 'create database db1'
psql -c 'checkpoint'
pg_basebackup -D /tmp/back1
date
Di 25. Nov 17:00:31 CET 2014
psql -c 'drop database db1'
pg_ctl stop
rm -rf foo; cp -a back1 foo

restore_command = 'cp /tmp/%f %p'
standby_mode = on
recovery_target_time = '2014-11-25 17:00:31'

... and then restart the server, I get:

postgres=# \c db1
FATAL:  database "db1" does not exist
DETAIL:  The database subdirectory "base/70262" is missing.

2014-11-25 16:11:22 CET [15225-1] LOG:  database system was interrupted; last known up at 2014-11-25 15:56:51 CET
2014-11-25 16:11:22 CET [15225-2] LOG:  creating missing WAL directory "pg_xlog/archive_status"
2014-11-25 16:11:22 CET [15225-3] LOG:  entering standby mode
2014-11-25 16:11:22 CET [15225-4] LOG:  restored log file "000000010000000D0000007B" from archive
2014-11-25 16:11:22 CET [15225-5] LOG:  redo starts at D/7B000028
2014-11-25 16:11:22 CET [15225-6] LOG:  consistent recovery state reached at D/7B0000F0
2014-11-25 16:11:22 CET [15225-7] LOG:  restored log file "000000010000000D0000007C" from archive
2014-11-25 16:11:22 CET [15225-8] LOG:  recovery stopping before commit of transaction 60681, time 2014-11-25
15:59:57.071137+01
2014-11-25 16:11:22 CET [15225-9] LOG:  redo done at D/7C001110
cp: cannot stat '/tmp/00000002.history': No such file or directory
2014-11-25 16:11:22 CET [15225-10] LOG:  selected new timeline ID: 2
cp: cannot stat '/tmp/00000001.history': No such file or directory
2014-11-25 16:11:23 CET [15225-11] LOG:  archive recovery complete
2014-11-25 16:11:23 CET [15236-1] [unknown]@[unknown] LOG:  incomplete startup packet
2014-11-25 16:11:23 CET [15238-1] LOG:  autovacuum launcher started
2014-11-25 16:11:23 CET [15224-1] LOG:  database system is ready to accept connections
2014-11-25 16:11:41 CET [15268-1] postgres@db1 FATAL:  database "db1" does not exist
2014-11-25 16:11:41 CET [15268-2] postgres@db1 DETAIL:  The database subdirectory "base/70262" is missing.

db1 is registered in pg_database, but the directory is missing on
disk.

(The actual log here is from a server with a longer history, but the
problem is reproducible in a freshly initdb'ed server as well.)

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: PITR failing to stop before DROP DATABASE

From
Heikki Linnakangas
Date:
On 11/25/2014 06:06 PM, Christoph Berg wrote:
> In 9.3.5, if I set up archiving, create a database, pull a base
> backup, look at the clock, drop database, stop the server, rm -rf
> datadir, put back the backup, edit recovery.conf:
>
> cd /tmp; initdb foo
> edit postgresql.conf with archive_mode = on, archive_command,
> max_wal_senders = 1, wal_level = hot_standby
> edit pg_hba.conf
> pg_ctl start
> psql -c 'create database db1'
> psql -c 'checkpoint'
> pg_basebackup -D /tmp/back1
> date
> Di 25. Nov 17:00:31 CET 2014
> psql -c 'drop database db1'
> pg_ctl stop
> rm -rf foo; cp -a back1 foo
>
> restore_command = 'cp /tmp/%f %p'
> standby_mode = on
> recovery_target_time = '2014-11-25 17:00:31'
>
> ... and then restart the server, I get:
>
> postgres=# \c db1
> FATAL:  database "db1" does not exist
> DETAIL:  The database subdirectory "base/70262" is missing.
>
> 2014-11-25 16:11:22 CET [15225-1] LOG:  database system was interrupted; last known up at 2014-11-25 15:56:51 CET
> 2014-11-25 16:11:22 CET [15225-2] LOG:  creating missing WAL directory "pg_xlog/archive_status"
> 2014-11-25 16:11:22 CET [15225-3] LOG:  entering standby mode
> 2014-11-25 16:11:22 CET [15225-4] LOG:  restored log file "000000010000000D0000007B" from archive
> 2014-11-25 16:11:22 CET [15225-5] LOG:  redo starts at D/7B000028
> 2014-11-25 16:11:22 CET [15225-6] LOG:  consistent recovery state reached at D/7B0000F0
> 2014-11-25 16:11:22 CET [15225-7] LOG:  restored log file "000000010000000D0000007C" from archive
> 2014-11-25 16:11:22 CET [15225-8] LOG:  recovery stopping before commit of transaction 60681, time 2014-11-25
15:59:57.071137+01
> 2014-11-25 16:11:22 CET [15225-9] LOG:  redo done at D/7C001110
> cp: cannot stat '/tmp/00000002.history': No such file or directory
> 2014-11-25 16:11:22 CET [15225-10] LOG:  selected new timeline ID: 2
> cp: cannot stat '/tmp/00000001.history': No such file or directory
> 2014-11-25 16:11:23 CET [15225-11] LOG:  archive recovery complete
> 2014-11-25 16:11:23 CET [15236-1] [unknown]@[unknown] LOG:  incomplete startup packet
> 2014-11-25 16:11:23 CET [15238-1] LOG:  autovacuum launcher started
> 2014-11-25 16:11:23 CET [15224-1] LOG:  database system is ready to accept connections
> 2014-11-25 16:11:41 CET [15268-1] postgres@db1 FATAL:  database "db1" does not exist
> 2014-11-25 16:11:41 CET [15268-2] postgres@db1 DETAIL:  The database subdirectory "base/70262" is missing.
>
> db1 is registered in pg_database, but the directory is missing on
> disk.

Yeah, DROP DATABASE cheats. It deletes all the files first, and commits 
the transaction only after that. There's this comment at the end of 
dropdb() function:

>     /*
>      * Force synchronous commit, thus minimizing the window between removal of
>      * the database files and commital of the transaction. If we crash before
>      * committing, we'll have a DB that's gone on disk but still there
>      * according to pg_database, which is not good.
>      */

So you could see the same after crash recovery, but it's a lot easier to 
reproduce with PITR.

This could be fixed by doing DROP DATABASE the same way we do DROP 
TABLE. At the DROP DATABASE command, just memorize the OID of the 
dropped database, but don't delete anything yet. Perform the actual 
deletion after flushing the commit record to disk. But then you would 
have the opposite problem - you might be left with a database that's 
dropped according to pg_database, but the files are still present on disk.

- Heikki




Re: PITR failing to stop before DROP DATABASE

From
Tomas Vondra
Date:
On 25.11.2014 18:11, Heikki Linnakangas wrote:
> On 11/25/2014 06:06 PM, Christoph Berg wrote:
>
>> db1 is registered in pg_database, but the directory is missing on
>> disk.
> 
> Yeah, DROP DATABASE cheats. It deletes all the files first, and commits
> the transaction only after that. There's this comment at the end of
> dropdb() function:
> 
>>     /*
>>      * Force synchronous commit, thus minimizing the window between
>> removal of
>>      * the database files and commital of the transaction. If we crash
>> before
>>      * committing, we'll have a DB that's gone on disk but still there
>>      * according to pg_database, which is not good.
>>      */
> 
> So you could see the same after crash recovery, but it's a lot easier to
> reproduce with PITR.

So we remove the files, and if there happens to be a crash at the right
moment, it results in a database with a record in pg_database, but no
directory/files?

Is it possible to cancel the command half-way through, leaving the
database in an essentially broken state (half the files exists, half is
already deleted?


> This could be fixed by doing DROP DATABASE the same way we do DROP
> TABLE. At the DROP DATABASE command, just memorize the OID of the
> dropped database, but don't delete anything yet. Perform the actual
> deletion after flushing the commit record to disk. But then you would
> have the opposite problem - you might be left with a database that's
> dropped according to pg_database, but the files are still present on disk.

I'm in favor of modifying DROP DATABASE so that it behaves like DROP
TABLE. I find it more consistent (two DROP commands should not really do
such wildly different things).

ISTM this would allow us to remove the ForceSyncCommit(); and allow DROP
DATABASE to be executed within a transaction (say, within a management
script etc.).

Tomas





Re: PITR failing to stop before DROP DATABASE

From
José Luis Tallón
Date:
On 11/25/2014 11:01 PM, Tomas Vondra wrote:
> [snip]
>> So you could see the same after crash recovery, but it's a lot easier to
>> reproduce with PITR.
> So we remove the files, and if there happens to be a crash at the right
> moment, it results in a database with a record in pg_database, but no
> directory/files?
>
> Is it possible to cancel the command half-way through, leaving the
> database in an essentially broken state (half the files exists, half is
> already deleted?
>
>
>> This could be fixed by doing DROP DATABASE the same way we do DROP
>> TABLE. At the DROP DATABASE command, just memorize the OID of the
>> dropped database, but don't delete anything yet. Perform the actual
>> deletion after flushing the commit record to disk. But then you would
>> have the opposite problem - you might be left with a database that's
>> dropped according to pg_database, but the files are still present on disk.

A rename from <oid>/ to <oid>.del/ is indeed atomic ...

... and so DROP DATABASE could become: - lock database (sic) - rename dir - delete from pg_database - flush commit
recordto disk - background? deletion of all files (the command would not return until this last phase finished)
 

If we cancel before flushing, the undo (and recovery after a crash) 
would just rename the directory back in place.
If we flushed, recovery after re-starting postmaster would simply delete 
the remaining files within <oid>.del/ and then rmdir the directory itself.
  ... since essentially any other possibility will just leave an 
inconsistent db around and wastes space unnecesarily.

Just my two cents, of course.

> I'm in favor of modifying DROP DATABASE so that it behaves like DROP
> TABLE. I find it more consistent (two DROP commands should not really do
> such wildly different things).

+1
> ISTM this would allow us to remove the ForceSyncCommit(); and allow DROP
> DATABASE to be executed within a transaction (say, within a management
> script etc.).


Regards,
    / J.L.




Re: PITR failing to stop before DROP DATABASE

From
Christoph Berg
Date:
Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com>
> >db1 is registered in pg_database, but the directory is missing on
> >disk.
> 
> Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the
> transaction only after that. There's this comment at the end of dropdb()
> function:

Oh ok. So this is an artifact of the non-transactionality (is this a
word?) of CREATE DATABASE.

> 
> >    /*
> >     * Force synchronous commit, thus minimizing the window between removal of
> >     * the database files and commital of the transaction. If we crash before
> >     * committing, we'll have a DB that's gone on disk but still there
> >     * according to pg_database, which is not good.
> >     */
> 
> So you could see the same after crash recovery, but it's a lot easier to
> reproduce with PITR.
> 
> This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At
> the DROP DATABASE command, just memorize the OID of the dropped database,
> but don't delete anything yet. Perform the actual deletion after flushing
> the commit record to disk. But then you would have the opposite problem -
> you might be left with a database that's dropped according to pg_database,
> but the files are still present on disk.

My concern is mostly that "PITR to just before an accidental DROP
DATABASE" is one of the primary use cases for PITR, so it should Just
Work. (I ran into this during a training and had a hard time
explaining why PITR bugs exist :)

I just did another test, and as expected, the problem goes away if I
execute any transaction just before the DROP DATABASE - even a simple
"SELECT txid_current()" is enough.

So my suggestion for a simple fix would be to make DROP DATABASE
execute a short fake transaction before it starts deleting files and
then continue as before. This would serve as a stopping point for
recovery_target_time to run into. (We could still fix this properly
later, but this idea seems like a good fix for a practical problem
that doesn't break anything else.)

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: PITR failing to stop before DROP DATABASE

From
Heikki Linnakangas
Date:
On 11/26/2014 11:19 AM, Christoph Berg wrote:
> Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com>
>>> db1 is registered in pg_database, but the directory is missing on
>>> disk.
>>
>> Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the
>> transaction only after that. There's this comment at the end of dropdb()
>> function:
>
> Oh ok. So this is an artifact of the non-transactionality (is this a
> word?) of CREATE DATABASE.

DROP DATABASE. CREATE DATABASE is a different story. It does similar 
non-transactional tricks and has similar issues, but it's a completely 
different codepath and could be fixed independently of DROP DATABASE.

>>>     /*
>>>      * Force synchronous commit, thus minimizing the window between removal of
>>>      * the database files and commital of the transaction. If we crash before
>>>      * committing, we'll have a DB that's gone on disk but still there
>>>      * according to pg_database, which is not good.
>>>      */
>>
>> So you could see the same after crash recovery, but it's a lot easier to
>> reproduce with PITR.
>>
>> This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At
>> the DROP DATABASE command, just memorize the OID of the dropped database,
>> but don't delete anything yet. Perform the actual deletion after flushing
>> the commit record to disk. But then you would have the opposite problem -
>> you might be left with a database that's dropped according to pg_database,
>> but the files are still present on disk.
>
> My concern is mostly that "PITR to just before an accidental DROP
> DATABASE" is one of the primary use cases for PITR, so it should Just
> Work. (I ran into this during a training and had a hard time
> explaining why PITR bugs exist :)
>
> I just did another test, and as expected, the problem goes away if I
> execute any transaction just before the DROP DATABASE - even a simple
> "SELECT txid_current()" is enough.
>
> So my suggestion for a simple fix would be to make DROP DATABASE
> execute a short fake transaction before it starts deleting files and
> then continue as before. This would serve as a stopping point for
> recovery_target_time to run into. (We could still fix this properly
> later, but this idea seems like a good fix for a practical problem
> that doesn't break anything else.)

Yeah, seems reasonable.

- Heikki




Re: PITR failing to stop before DROP DATABASE

From
Christoph Berg
Date:
Re: Heikki Linnakangas 2014-11-26 <54759BC0.4070505@vmware.com>
> >Oh ok. So this is an artifact of the non-transactionality (is this a
> >word?) of CREATE DATABASE.
> 
> DROP DATABASE. CREATE DATABASE is a different story. It does similar
> non-transactional tricks and has similar issues, but it's a completely
> different codepath and could be fixed independently of DROP DATABASE.

Err right. Too early in the morning...

> >So my suggestion for a simple fix would be to make DROP DATABASE
> >execute a short fake transaction before it starts deleting files and
> >then continue as before. This would serve as a stopping point for
> >recovery_target_time to run into. (We could still fix this properly
> >later, but this idea seems like a good fix for a practical problem
> >that doesn't break anything else.)
> 
> Yeah, seems reasonable.

Here's a first shot at a patch. It's not working yet because I think
the commit isn't doing anything because no work was done in the
transaction yet.

*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
*************** dropdb(const char *dbname, bool missing_
*** 778,783 ****
--- 778,798 ----                 nslots_active;      /*
+      * Commit now to cause a commit xlog record to be logged.  (We are outside
+      * any transaction so this is safe to do.)  If we don't do this here, doing
+      * a PITR restore to just before DROP DATABASE will cause the files on disk
+      * to be deleted, while PITR stops before removing the database from the
+      * system catalogs, so the database is still visible while it is in fact
+      * already deleted.  It is still possible to get to this intermediate state
+      * by selecting the correct transaction number in recovery.conf, but this
+      * fixes the common use case of specifying a recovery target time just
+      * before DROP DATABASE.
+      */
+     PopActiveSnapshot();
+     CommitTransactionCommand();
+     StartTransactionCommand();
+ 
+     /*      * Look up the target database's OID, and get exclusive lock on it. We      * need this to ensure that no
newbackend starts up in the target      * database while we are deleting it (see postinit.c), and that no one is
 


Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: PITR failing to stop before DROP DATABASE

From
Bruce Momjian
Date:
On Wed, Nov 26, 2014 at 12:35:27PM +0100, Christoph Berg wrote:
> Re: Heikki Linnakangas 2014-11-26 <54759BC0.4070505@vmware.com>
> > >Oh ok. So this is an artifact of the non-transactionality (is this a
> > >word?) of CREATE DATABASE.
> > 
> > DROP DATABASE. CREATE DATABASE is a different story. It does similar
> > non-transactional tricks and has similar issues, but it's a completely
> > different codepath and could be fixed independently of DROP DATABASE.
> 
> Err right. Too early in the morning...
> 
> > >So my suggestion for a simple fix would be to make DROP DATABASE
> > >execute a short fake transaction before it starts deleting files and
> > >then continue as before. This would serve as a stopping point for
> > >recovery_target_time to run into. (We could still fix this properly
> > >later, but this idea seems like a good fix for a practical problem
> > >that doesn't break anything else.)
> > 
> > Yeah, seems reasonable.
> 
> Here's a first shot at a patch. It's not working yet because I think
> the commit isn't doing anything because no work was done in the
> transaction yet.

Where are we on this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: PITR failing to stop before DROP DATABASE

From
Christoph Berg
Date:
Re: Bruce Momjian 2015-03-20 <20150320223549.GZ6317@momjian.us>
> > > >So my suggestion for a simple fix would be to make DROP DATABASE
> > > >execute a short fake transaction before it starts deleting files and
> > > >then continue as before. This would serve as a stopping point for
> > > >recovery_target_time to run into. (We could still fix this properly
> > > >later, but this idea seems like a good fix for a practical problem
> > > >that doesn't break anything else.)
> > > 
> > > Yeah, seems reasonable.
> > 
> > Here's a first shot at a patch. It's not working yet because I think
> > the commit isn't doing anything because no work was done in the
> > transaction yet.
> 
> Where are we on this?

I guess my patch could be fixed by forcing it to acquire a transaction
id (SELECT txid_current() or whatever seems suitable), but my insight
into the gory backend details is limited, so it'd be better if someone
with more clue tried to fix it.


Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com>
> >db1 is registered in pg_database, but the directory is missing on
> >disk.
> 
> Yeah, DROP DATABASE cheats. It deletes all the files first, and commits the
> transaction only after that. There's this comment at the end of dropdb()
> function:
> 
> >    /*
> >     * Force synchronous commit, thus minimizing the window between removal of
> >     * the database files and commital of the transaction. If we crash before
> >     * committing, we'll have a DB that's gone on disk but still there
> >     * according to pg_database, which is not good.
> >     */
> 
> So you could see the same after crash recovery, but it's a lot easier to
> reproduce with PITR.
> 
> This could be fixed by doing DROP DATABASE the same way we do DROP TABLE. At
> the DROP DATABASE command, just memorize the OID of the dropped database,
> but don't delete anything yet. Perform the actual deletion after flushing
> the commit record to disk. But then you would have the opposite problem -
> you might be left with a database that's dropped according to pg_database,
> but the files are still present on disk.

This seems to be the better idea anyway (and was mentioned again when
I talked to Heikki and Andres about it at FOSDEM).

The "opposite" problem wouldn't be so bad I guess - it might be
visible in practise where you could easily clean up later, but the
original problem is pretty bad if you hit it when trying to do PITR
because something bad happened. (And we treat DROP TABLE the same.)

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: PITR failing to stop before DROP DATABASE

From
Andres Freund
Date:
On March 22, 2015 12:17:57 PM GMT+01:00, Christoph Berg <cb@df7cb.de> wrote:
>Re: Bruce Momjian 2015-03-20 <20150320223549.GZ6317@momjian.us>
>> > > >So my suggestion for a simple fix would be to make DROP DATABASE
>> > > >execute a short fake transaction before it starts deleting files
>and
>> > > >then continue as before. This would serve as a stopping point
>for
>> > > >recovery_target_time to run into. (We could still fix this
>properly
>> > > >later, but this idea seems like a good fix for a practical
>problem
>> > > >that doesn't break anything else.)
>> > > 
>> > > Yeah, seems reasonable.
>> > 
>> > Here's a first shot at a patch. It's not working yet because I
>think
>> > the commit isn't doing anything because no work was done in the
>> > transaction yet.
>> 
>> Where are we on this?
>
>I guess my patch could be fixed by forcing it to acquire a transaction
>id (SELECT txid_current() or whatever seems suitable), but my insight
>into the gory backend details is limited, so it'd be better if someone
>with more clue tried to fix it.

It'd need to do a GetTopTransactionId().

>Re: Heikki Linnakangas 2014-11-25 <5474B848.3060909@vmware.com>
>> >db1 is registered in pg_database, but the directory is missing on
>> >disk.
>> 
>> Yeah, DROP DATABASE cheats. It deletes all the files first, and
>commits the
>> transaction only after that. There's this comment at the end of
>dropdb()
>> function:
>> 
>> >    /*
>> >     * Force synchronous commit, thus minimizing the window between
>removal of
>> >     * the database files and commital of the transaction. If we crash
>before
>> >     * committing, we'll have a DB that's gone on disk but still there
>> >     * according to pg_database, which is not good.
>> >     */
>> 
>> So you could see the same after crash recovery, but it's a lot easier
>to
>> reproduce with PITR.
>> 
>> This could be fixed by doing DROP DATABASE the same way we do DROP
>TABLE. At
>> the DROP DATABASE command, just memorize the OID of the dropped
>database,
>> but don't delete anything yet. Perform the actual deletion after
>flushing
>> the commit record to disk. But then you would have the opposite
>problem -
>> you might be left with a database that's dropped according to
>pg_database,
>> but the files are still present on disk.
>
>This seems to be the better idea anyway (and was mentioned again when
>I talked to Heikki and Andres about it at FOSDEM).


Actually allowing for things like this was one of the reasons for the commit/abort extensibility stuff I committed a
fewdays ago.
 

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.