Thread: [bug fix] PITR corrupts the database cluster
Hello, I've encountered a bug of PITR that corrupts the database. I'm willing to submit the patch to fix it, but I'm wondering what approach is appropriate. Could you give me your opinions? [Problem] I cannot connect to the database after performing the following steps: 1. CREATE DATABASE mydb; 2. Take a base backup with pg_basebackup. 3. DROP DATABASE mydb; 4. Shutdown the database server with "pg_ctl stop". 5. Recover the database cluster to the point where the base backup completed, i.e., before dropping mydb. The contents of recovery.conf is: restore_command = 'cp /arc/dir/%f %p' recovery_target_timeline = 'latest' recovery_target_time = 'STOP TIME recorded in the backup history file which was created during base backup' I expected to be able to connect to mydb because I recovered to the point before dropping mydb. However, I cannot connect to mydb because the directory for mydb does not exist. The entry for mydb exists in pg_database. [Cause] DROP DATABASE emits the below WAL records: 1. System catalog changes including deletion of a tuple for mydb in pg_database 2. Deletion of directories for the database 3. Transaction commit During recovery, postgres replays 1 and 2. It ends the recovery when it notices that the time recorded in commit record (3 above) is later than the recovery target time. The commit record is not replayed, thus the system catalog changes are virtually undone. The problem is that 2 is replayed. This deletes the directory for the database although the transaction is not committed. [How to fix] There are two approaches. Which do you think is the way to go? <Approach 1> During recovery, when the WAL record for directory deletion is found, just record that fact for later replay (in a hash table keyed by xid). When the corresponding transaction commit record is found, replay the directory deletion record. <Approach 2> Like the DROP TABLE/INDEX case, piggyback the directory deletion record on the transaction commit record, and eliminate the directory deletion record altogether. I think we need to take approach 1 even when we also does 2, because 1 is necessary when the backup and archive WAL are already taken with the current PostgreSQL anyway. Regards MauMau
On 2013-07-24 19:30:09 +0900, MauMau wrote: > I've encountered a bug of PITR that corrupts the database. I'm willing to > submit the patch to fix it, but I'm wondering what approach is appropriate. > Could you give me your opinions? > > [Problem] > I cannot connect to the database after performing the following steps: > > 1. CREATE DATABASE mydb; > 2. Take a base backup with pg_basebackup. > 3. DROP DATABASE mydb; > 4. Shutdown the database server with "pg_ctl stop". > 5. Recover the database cluster to the point where the base backup > completed, i.e., before dropping mydb. The contents of recovery.conf is: > restore_command = 'cp /arc/dir/%f %p' > recovery_target_timeline = 'latest' > recovery_target_time = 'STOP TIME recorded in the backup history file which > was created during base backup' For a second I wanted to say that it's a user error because you should just set recovery_target_lsn based on the END WAL location in the backup history file. Unfortunately recovery_target_lsn doesn't exist. It should. > I expected to be able to connect to mydb because I recovered to the point > before dropping mydb. However, I cannot connect to mydb because the > directory for mydb does not exist. The entry for mydb exists in > pg_database. > [Cause] > DROP DATABASE emits the below WAL records: > > 1. System catalog changes including deletion of a tuple for mydb in > pg_database > 2. Deletion of directories for the database > 3. Transaction commit > <Approach 1> > During recovery, when the WAL record for directory deletion is found, just > record that fact for later replay (in a hash table keyed by xid). When the > corresponding transaction commit record is found, replay the directory > deletion record. I think that's too much of a special case implementation. > <Approach 2> > Like the DROP TABLE/INDEX case, piggyback the directory deletion record on > the transaction commit record, and eliminate the directory deletion record > altogether. I don't think burdening commit records with that makes sense. It's just not a common enough case. What we imo could do would be to drop the tablespaces in a *separate* transaction *after* the transaction that removed the pg_tablespace entry. Then an "incomplete actions" logic similar to btree and gin could be used to remove the database directory if we crashed between the two transactions. SO: TXN1 does: * remove catalog entries * drop buffers * XLogInsert(XLOG_DBASE_DROP_BEGIN) TXN2: * remove_dbtablespaces * XLogInsert(XLOG_DBASE_DROP_FINISH) The RM_DBASE_ID resource manager would then grow a rm_cleanup callback (which would perform TXN2 if we failed inbetween) and a rm_safe_restartpoint which would prevent restartpoints from occuring on standby between both. The same should probably done for CREATE DATABASE because that currently can result in partially copied databases lying around. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-07-24 12:59:43 +0200, Andres Freund wrote: > > <Approach 2> > > Like the DROP TABLE/INDEX case, piggyback the directory deletion record on > > the transaction commit record, and eliminate the directory deletion record > > altogether. > > I don't think burdening commit records with that makes sense. It's just > not a common enough case. > > What we imo could do would be to drop the tablespaces in a *separate* > transaction *after* the transaction that removed the pg_tablespace > entry. Then an "incomplete actions" logic similar to btree and gin could > be used to remove the database directory if we crashed between the two > transactions. > > SO: > TXN1 does: > * remove catalog entries > * drop buffers > * XLogInsert(XLOG_DBASE_DROP_BEGIN) > > TXN2: > * remove_dbtablespaces > * XLogInsert(XLOG_DBASE_DROP_FINISH) > > The RM_DBASE_ID resource manager would then grow a rm_cleanup callback > (which would perform TXN2 if we failed inbetween) and a > rm_safe_restartpoint which would prevent restartpoints from occuring on > standby between both. > > The same should probably done for CREATE DATABASE because that currently > can result in partially copied databases lying around. And CREATE/DROP TABLESPACE. Not really related, but CREATE DATABASE's implementation makes me itch everytime I read parts of it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: >On 2013-07-24 12:59:43 +0200, Andres Freund wrote: >> > <Approach 2> >> > Like the DROP TABLE/INDEX case, piggyback the directory deletion >record on >> > the transaction commit record, and eliminate the directory deletion >record >> > altogether. >> >> I don't think burdening commit records with that makes sense. It's >just >> not a common enough case. >> >> What we imo could do would be to drop the tablespaces in a *separate* >> transaction *after* the transaction that removed the pg_tablespace >> entry. Then an "incomplete actions" logic similar to btree and gin >could >> be used to remove the database directory if we crashed between the >two >> transactions. >> >> SO: >> TXN1 does: >> * remove catalog entries >> * drop buffers >> * XLogInsert(XLOG_DBASE_DROP_BEGIN) >> >> TXN2: >> * remove_dbtablespaces >> * XLogInsert(XLOG_DBASE_DROP_FINISH) >> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup >callback >> (which would perform TXN2 if we failed inbetween) and a >> rm_safe_restartpoint which would prevent restartpoints from occuring >on >> standby between both. >> >> The same should probably done for CREATE DATABASE because that >currently >> can result in partially copied databases lying around. > >And CREATE/DROP TABLESPACE. > >Not really related, but CREATE DATABASE's implementation makes me itch >everytime I read parts of it... I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back, and I'vebeen thinking of doing the same for gin and btree. The way it works currently is buggy - while we have rm_safe_restartpointto avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a checkpointwhile incomplete actions are pending. It's possible that there are page locks or something that prevent it in practice,but it feels shaky. So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if we're goingto have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow resource managersto attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay them. - Heikki
On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > >On 2013-07-24 12:59:43 +0200, Andres Freund wrote: > >> > <Approach 2> > >> What we imo could do would be to drop the tablespaces in a *separate* > >> transaction *after* the transaction that removed the pg_tablespace > >> entry. Then an "incomplete actions" logic similar to btree and gin > >could > >> be used to remove the database directory if we crashed between the > >two > >> transactions. > >> > >> SO: > >> TXN1 does: > >> * remove catalog entries > >> * drop buffers > >> * XLogInsert(XLOG_DBASE_DROP_BEGIN) > >> > >> TXN2: > >> * remove_dbtablespaces > >> * XLogInsert(XLOG_DBASE_DROP_FINISH) > >> > >> The RM_DBASE_ID resource manager would then grow a rm_cleanup > >callback > >> (which would perform TXN2 if we failed inbetween) and a > >> rm_safe_restartpoint which would prevent restartpoints from occuring > >on > >> standby between both. > >> > >> The same should probably done for CREATE DATABASE because that > >currently > >> can result in partially copied databases lying around. > > > >And CREATE/DROP TABLESPACE. > > > >Not really related, but CREATE DATABASE's implementation makes me itch > >everytime I read parts of it... > > I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back, andI've been thinking of doing the same for gin and btree. The way it works currently is buggy - while we have rm_safe_restartpointto avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a checkpointwhile incomplete actions are pending. It's possible that there are page locks or something that prevent it in practice,but it feels shaky. > > So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if we'regoing to have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow resourcemanagers to attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay them. The problem is that piggybacking on the commit record doesn't really fix the problem that we end up with a bad state if we crash in a bad moment. For CREATE DATABASE you will have to copy the template database *before* you commit the pg_database insert. Which means if we abort before that we have old data in the datadir. For DROP DATABASE, without something like incomplete actions, piggybacking on the commit record doesn't solve the issue of CHECKPOINTS either, because the commit record you piggybacked on could have committed before a checkpoint, while you still were busy deleting all the files. Similar things go for CREATE/DROP TABLESPACE. I think getting rid of something like incomplete actions entirely isn't really realistic. There are several places were we probably should use them but don't. I think we probably will have change the logic around it so there's something that determines whether a normal checkpoint is safe atm. A simple version of this basically exists via GetVirtualXIDsDelayingChkpt() which is now also used for checksums but we probably want to extend that at some point. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: >On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote: >> Andres Freund <andres@2ndquadrant.com> wrote: >> >On 2013-07-24 12:59:43 +0200, Andres Freund wrote: >> >> > <Approach 2> >> >> What we imo could do would be to drop the tablespaces in a >*separate* >> >> transaction *after* the transaction that removed the pg_tablespace >> >> entry. Then an "incomplete actions" logic similar to btree and gin >> >could >> >> be used to remove the database directory if we crashed between the >> >two >> >> transactions. >> >> >> >> SO: >> >> TXN1 does: >> >> * remove catalog entries >> >> * drop buffers >> >> * XLogInsert(XLOG_DBASE_DROP_BEGIN) >> >> >> >> TXN2: >> >> * remove_dbtablespaces >> >> * XLogInsert(XLOG_DBASE_DROP_FINISH) >> >> >> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup >> >callback >> >> (which would perform TXN2 if we failed inbetween) and a >> >> rm_safe_restartpoint which would prevent restartpoints from >occuring >> >on >> >> standby between both. >> >> >> >> The same should probably done for CREATE DATABASE because that >> >currently >> >> can result in partially copied databases lying around. >> > >> >And CREATE/DROP TABLESPACE. >> > >> >Not really related, but CREATE DATABASE's implementation makes me >itch >> >everytime I read parts of it... >> >> I've been hoping that we could get rid of the rm_cleanup mechanism >entirely. I eliminated it for gist a while back, and I've been thinking >of doing the same for gin and btree. The way it works currently is >buggy - while we have rm_safe_restartpoint to avoid creating a >restartpoint at a bad moment, there is nothing to stop you from running >a checkpoint while incomplete actions are pending. It's possible that >there are page locks or something that prevent it in practice, but it >feels shaky. >> >> So I'd prefer a solution that doesn't rely on rm_cleanup. >Piggybacking on commit record seems ok to me, though if we're going to >have a lot of different things to attach there, maybe we need to >generalize it somehow. Like, allow resource managers to attach >arbitrary payload to the commit record, and provide a new >rm_redo_commit function to replay them. > >The problem is that piggybacking on the commit record doesn't really >fix >the problem that we end up with a bad state if we crash in a bad >moment. > >For CREATE DATABASE you will have to copy the template database >*before* >you commit the pg_database insert. Which means if we abort before that >we have old data in the datadir. > >For DROP DATABASE, without something like incomplete actions, >piggybacking on the commit record doesn't solve the issue of >CHECKPOINTS >either, because the commit record you piggybacked on could have >committed before a checkpoint, while you still were busy deleting all >the files. That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT,the file is leaked. But it's just a waste of space, everything still works. It would be nice to fix that leak, for tables and indexes too... - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT,the file is leaked. But it's just a waste of space, everything still works. Well, it is different, because if you crash partway through dropping a tablespace or database, you have inconsistent state. > It would be nice to fix that leak, for tables and indexes too... I'm inclined to think that this wouldn't be a good use of resources, at least not at the individual table/index level. We'd surely be adding some significant amount of overhead to normal operating paths, in order to cover a case that really shouldn't happen in practice. The only thing here that really bothers me is that a crash during DROP DATABASE/TABLESPACE could leave us with a partially populated db/ts that's still accessible through the system catalogs. We could however do something to ensure that the db/ts is atomically removed from use before we start dropping individual files. Then, if you get a crash, there'd still be system catalog entries but they'd be pointing at nothing, so the behavior would be clean and understandable whereas right now it's not. In the case of DROP TABLESPACE this seems relatively easy: drop or rename the symlink before we start flushing individual files. I'm not quite sure how to do it for DROP DATABASE though --- I thought of renaming the database directory, say from "12345" to "12345.dead", but if there are tablespaces in use then we might have a database subdirectory in each one, so we couldn't rename them all atomically. I guess one thing we could do is create a flag file, say "dead.dont.use", in the database's default-tablespace directory, and make new backends check for that before being willing to start up in that database; then make sure that removal of that file is the last step in DROP DATABASE. regards, tom lane
On 2013-07-24 16:15:59 +0300, Heikki Linnakangas wrote: > >For DROP DATABASE, without something like incomplete actions, > >piggybacking on the commit record doesn't solve the issue of > >CHECKPOINTS > >either, because the commit record you piggybacked on could have > >committed before a checkpoint, while you still were busy deleting all > >the files. > > That's no different from CREATE TABLE / INDEX and DROP TABLE / > INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the > file is leaked. But it's just a waste of space, everything still > works. Imo it's not really the same. For one, spurious files left over by CREATE TABLE are handled when assigning future relfilenodes (see GetNewRelFileNode()), we don't do the same for databases and tablespaces afaik. Furthermore, I don't think there's such a big issue with DROP TABLE unless maybe you've created the relation in the same transaction that you're dropping it. Sure, there's the issue with a checkpoint in the wrong place, but other than that we're going to remove the file when replaying the commit record. We also should never end up with an inconsistent state between catalog and filesystem. > It would be nice to fix that leak, for tables and indexes too... Agreed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
I wrote: > The only thing here that really bothers me is that a crash during DROP > DATABASE/TABLESPACE could leave us with a partially populated db/ts > that's still accessible through the system catalogs. ... > I guess one thing we could do is create a flag file, say > "dead.dont.use", in the database's default-tablespace directory, and > make new backends check for that before being willing to start up in > that database; then make sure that removal of that file is the last > step in DROP DATABASE. After a bit of experimentation, it seems there's a pre-existing way that we could do this: just remove PG_VERSION from the database's default directory as the first filesystem action in DROP DATABASE. If we crash before committing, subsequent attempts to connect to that database would fail like this: $ psql bogus psql: FATAL: "base/176774" is not a valid data directory DETAIL: File "base/176774/PG_VERSION" is missing. which is probably already good enough, though maybe we could add a HINT suggesting that the DB was incompletely dropped. To ensure this is replayed properly on slave servers, I'd be inclined to mechanize it by (1) changing remove_dbtablespaces to ensure that the DB's default tablespace is the first one dropped, and (2) incorporating remove-PG_VERSION-first into rmtree(). regards, tom lane
On 2013-07-24 10:57:14 -0400, Tom Lane wrote: > I wrote: > > The only thing here that really bothers me is that a crash during DROP > > DATABASE/TABLESPACE could leave us with a partially populated db/ts > > that's still accessible through the system catalogs. ... > > I guess one thing we could do is create a flag file, say > > "dead.dont.use", in the database's default-tablespace directory, and > > make new backends check for that before being willing to start up in > > that database; then make sure that removal of that file is the last > > step in DROP DATABASE. > > After a bit of experimentation, it seems there's a pre-existing way that > we could do this: just remove PG_VERSION from the database's default > directory as the first filesystem action in DROP DATABASE. If we > crash before committing, subsequent attempts to connect to that database > would fail like this: Neat idea, especially as it would work on the back branches without problems. It doesn't address MauMau's original problem of not being able to set a correct stop point because the datadir is being removed before the commit record (which is the only place we currently can set a recovery_target on in a basebackup scenario) though. Wouldn't it make more sense to simply commit the pg_database entry removal separately from the actual removal of the datadir? DROP DATABASE already can't run in a transaction, so that should be easy. > > $ psql bogus > psql: FATAL: "base/176774" is not a valid data directory > DETAIL: File "base/176774/PG_VERSION" is missing. > > which is probably already good enough, though maybe we could add a HINT > suggesting that the DB was incompletely dropped. We could rename/fsync() the file to PG_BEING_DROPPED atomically. Then we could give a useful error instead of an error message with a "maybe" hint. That would require to remove the file last though, so I am not sure if it's worth the bother. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 24, 2013 at 10:57:14AM -0400, Tom Lane wrote: > I wrote: > > The only thing here that really bothers me is that a crash during DROP > > DATABASE/TABLESPACE could leave us with a partially populated db/ts > > that's still accessible through the system catalogs. ... > > I guess one thing we could do is create a flag file, say > > "dead.dont.use", in the database's default-tablespace directory, and > > make new backends check for that before being willing to start up in > > that database; then make sure that removal of that file is the last > > step in DROP DATABASE. > > After a bit of experimentation, it seems there's a pre-existing way that > we could do this: just remove PG_VERSION from the database's default > directory as the first filesystem action in DROP DATABASE. If we > crash before committing, subsequent attempts to connect to that database > would fail like this: > > $ psql bogus > psql: FATAL: "base/176774" is not a valid data directory > DETAIL: File "base/176774/PG_VERSION" is missing. > > which is probably already good enough, though maybe we could add a HINT > suggesting that the DB was incompletely dropped. > > To ensure this is replayed properly on slave servers, I'd be inclined to > mechanize it by (1) changing remove_dbtablespaces to ensure that the > DB's default tablespace is the first one dropped, and (2) incorporating > remove-PG_VERSION-first into rmtree(). Where are we on this? Is there a TODO here? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +