Thread: [bug fix] PITR corrupts the database cluster

[bug fix] PITR corrupts the database cluster

From
"MauMau"
Date:
Hello,

I've encountered a bug of PITR that corrupts the database.  I'm willing to 
submit the patch to fix it, but I'm wondering what approach is appropriate. 
Could you give me your opinions?

[Problem]
I cannot connect to the database after performing the following steps:

1. CREATE DATABASE mydb;
2. Take a base backup with pg_basebackup.
3. DROP DATABASE mydb;
4. Shutdown the database server with "pg_ctl stop".
5. Recover the database cluster to the point where the base backup 
completed, i.e., before dropping mydb.  The contents of recovery.conf is:
restore_command = 'cp /arc/dir/%f %p'
recovery_target_timeline = 'latest'
recovery_target_time = 'STOP TIME recorded in the backup history file which 
was created during base backup'

I expected to be able to connect to mydb because I recovered to the point 
before dropping mydb.  However, I cannot connect to mydb because the 
directory for mydb does not exist.  The entry for mydb exists in 
pg_database.


[Cause]
DROP DATABASE emits the below WAL records:

1. System catalog changes including deletion of a tuple for mydb in 
pg_database
2. Deletion of directories for the database
3. Transaction commit

During recovery, postgres replays 1 and 2.  It ends the recovery when it 
notices that the time recorded in commit record (3 above) is later than the 
recovery target time.  The commit record is not replayed, thus the system 
catalog changes are virtually undone.

The problem is that 2 is replayed.  This deletes the directory for the 
database although the transaction is not committed.


[How to fix]
There are two approaches.  Which do you think is the way to go?

<Approach 1>
During recovery, when the WAL record for directory deletion is found, just 
record that fact for later replay (in a hash table keyed by xid).  When the 
corresponding transaction commit record is found, replay the directory 
deletion record.

<Approach 2>
Like the DROP TABLE/INDEX case, piggyback the directory deletion record on 
the transaction commit record, and eliminate the directory deletion record 
altogether.

I think we need to take approach 1 even when we also does 2, because 1 is 
necessary when the backup and archive WAL are already taken with the current 
PostgreSQL anyway.


Regards
MauMau




Re: [bug fix] PITR corrupts the database cluster

From
Andres Freund
Date:
On 2013-07-24 19:30:09 +0900, MauMau wrote:
> I've encountered a bug of PITR that corrupts the database.  I'm willing to
> submit the patch to fix it, but I'm wondering what approach is appropriate.
> Could you give me your opinions?
> 
> [Problem]
> I cannot connect to the database after performing the following steps:
> 
> 1. CREATE DATABASE mydb;
> 2. Take a base backup with pg_basebackup.
> 3. DROP DATABASE mydb;
> 4. Shutdown the database server with "pg_ctl stop".
> 5. Recover the database cluster to the point where the base backup
> completed, i.e., before dropping mydb.  The contents of recovery.conf is:
> restore_command = 'cp /arc/dir/%f %p'
> recovery_target_timeline = 'latest'
> recovery_target_time = 'STOP TIME recorded in the backup history file which
> was created during base backup'

For a second I wanted to say that it's a user error because you should
just set recovery_target_lsn based on the END WAL location in the backup
history file. Unfortunately recovery_target_lsn doesn't exist. It
should.

> I expected to be able to connect to mydb because I recovered to the point
> before dropping mydb.  However, I cannot connect to mydb because the
> directory for mydb does not exist.  The entry for mydb exists in
> pg_database.

> [Cause]
> DROP DATABASE emits the below WAL records:
> 
> 1. System catalog changes including deletion of a tuple for mydb in
> pg_database
> 2. Deletion of directories for the database
> 3. Transaction commit

> <Approach 1>
> During recovery, when the WAL record for directory deletion is found, just
> record that fact for later replay (in a hash table keyed by xid).  When the
> corresponding transaction commit record is found, replay the directory
> deletion record.

I think that's too much of a special case implementation.

> <Approach 2>
> Like the DROP TABLE/INDEX case, piggyback the directory deletion record on
> the transaction commit record, and eliminate the directory deletion record
> altogether.

I don't think burdening commit records with that makes sense. It's just
not a common enough case.

What we imo could do would be to drop the tablespaces in a *separate*
transaction *after* the transaction that removed the pg_tablespace
entry. Then an "incomplete actions" logic similar to btree and gin could
be used to remove the database directory if we crashed between the two
transactions.

SO:
TXN1 does:
* remove catalog entries
* drop buffers
* XLogInsert(XLOG_DBASE_DROP_BEGIN)

TXN2:
* remove_dbtablespaces
* XLogInsert(XLOG_DBASE_DROP_FINISH)

The RM_DBASE_ID resource manager would then grow a rm_cleanup callback
(which would perform TXN2 if we failed inbetween) and a
rm_safe_restartpoint which would prevent restartpoints from occuring on
standby between both.

The same should probably done for CREATE DATABASE because that currently
can result in partially copied databases lying around.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [bug fix] PITR corrupts the database cluster

From
Andres Freund
Date:
On 2013-07-24 12:59:43 +0200, Andres Freund wrote:
> > <Approach 2>
> > Like the DROP TABLE/INDEX case, piggyback the directory deletion record on
> > the transaction commit record, and eliminate the directory deletion record
> > altogether.
> 
> I don't think burdening commit records with that makes sense. It's just
> not a common enough case.
> 
> What we imo could do would be to drop the tablespaces in a *separate*
> transaction *after* the transaction that removed the pg_tablespace
> entry. Then an "incomplete actions" logic similar to btree and gin could
> be used to remove the database directory if we crashed between the two
> transactions.
> 
> SO:
> TXN1 does:
> * remove catalog entries
> * drop buffers
> * XLogInsert(XLOG_DBASE_DROP_BEGIN)
> 
> TXN2:
> * remove_dbtablespaces
> * XLogInsert(XLOG_DBASE_DROP_FINISH)
> 
> The RM_DBASE_ID resource manager would then grow a rm_cleanup callback
> (which would perform TXN2 if we failed inbetween) and a
> rm_safe_restartpoint which would prevent restartpoints from occuring on
> standby between both.
> 
> The same should probably done for CREATE DATABASE because that currently
> can result in partially copied databases lying around.

And CREATE/DROP TABLESPACE.

Not really related, but CREATE DATABASE's implementation makes me itch
everytime I read parts of it...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [bug fix] PITR corrupts the database cluster

From
Heikki Linnakangas
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
>On 2013-07-24 12:59:43 +0200, Andres Freund wrote:
>> > <Approach 2>
>> > Like the DROP TABLE/INDEX case, piggyback the directory deletion
>record on
>> > the transaction commit record, and eliminate the directory deletion
>record
>> > altogether.
>> 
>> I don't think burdening commit records with that makes sense. It's
>just
>> not a common enough case.
>> 
>> What we imo could do would be to drop the tablespaces in a *separate*
>> transaction *after* the transaction that removed the pg_tablespace
>> entry. Then an "incomplete actions" logic similar to btree and gin
>could
>> be used to remove the database directory if we crashed between the
>two
>> transactions.
>> 
>> SO:
>> TXN1 does:
>> * remove catalog entries
>> * drop buffers
>> * XLogInsert(XLOG_DBASE_DROP_BEGIN)
>> 
>> TXN2:
>> * remove_dbtablespaces
>> * XLogInsert(XLOG_DBASE_DROP_FINISH)
>> 
>> The RM_DBASE_ID resource manager would then grow a rm_cleanup
>callback
>> (which would perform TXN2 if we failed inbetween) and a
>> rm_safe_restartpoint which would prevent restartpoints from occuring
>on
>> standby between both.
>> 
>> The same should probably done for CREATE DATABASE because that
>currently
>> can result in partially copied databases lying around.
>
>And CREATE/DROP TABLESPACE.
>
>Not really related, but CREATE DATABASE's implementation makes me itch
>everytime I read parts of it...

I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back, and
I'vebeen thinking of doing the same for gin and btree. The way it works currently is buggy - while we have
rm_safe_restartpointto avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a
checkpointwhile incomplete actions are pending. It's possible that there are page locks or something that prevent it in
practice,but it feels shaky.
 

So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if we're
goingto have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow resource
managersto attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay them.
 

- Heikki



Re: [bug fix] PITR corrupts the database cluster

From
Andres Freund
Date:
On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
> >On 2013-07-24 12:59:43 +0200, Andres Freund wrote:
> >> > <Approach 2>
> >> What we imo could do would be to drop the tablespaces in a *separate*
> >> transaction *after* the transaction that removed the pg_tablespace
> >> entry. Then an "incomplete actions" logic similar to btree and gin
> >could
> >> be used to remove the database directory if we crashed between the
> >two
> >> transactions.
> >> 
> >> SO:
> >> TXN1 does:
> >> * remove catalog entries
> >> * drop buffers
> >> * XLogInsert(XLOG_DBASE_DROP_BEGIN)
> >> 
> >> TXN2:
> >> * remove_dbtablespaces
> >> * XLogInsert(XLOG_DBASE_DROP_FINISH)
> >> 
> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup
> >callback
> >> (which would perform TXN2 if we failed inbetween) and a
> >> rm_safe_restartpoint which would prevent restartpoints from occuring
> >on
> >> standby between both.
> >> 
> >> The same should probably done for CREATE DATABASE because that
> >currently
> >> can result in partially copied databases lying around.
> >
> >And CREATE/DROP TABLESPACE.
> >
> >Not really related, but CREATE DATABASE's implementation makes me itch
> >everytime I read parts of it...
> 
> I've been hoping that we could get rid of the rm_cleanup mechanism entirely. I eliminated it for gist a while back,
andI've been thinking of doing the same for gin and btree. The way it works currently is buggy - while we have
rm_safe_restartpointto avoid creating a restartpoint at a bad moment, there is nothing to stop you from running a
checkpointwhile incomplete actions are pending. It's possible that there are page locks or something that prevent it in
practice,but it feels shaky.
 
> 
> So I'd prefer a solution that doesn't rely on rm_cleanup. Piggybacking on commit record seems ok to me, though if
we'regoing to have a lot of different things to attach there, maybe we need to generalize it somehow. Like, allow
resourcemanagers to attach arbitrary payload to the commit record, and provide a new rm_redo_commit function to replay
them.

The problem is that piggybacking on the commit record doesn't really fix
the problem that we end up with a bad state if we crash in a bad
moment.

For CREATE DATABASE you will have to copy the template database *before*
you commit the pg_database insert. Which means if we abort before that
we have old data in the datadir.

For DROP DATABASE, without something like incomplete actions,
piggybacking on the commit record doesn't solve the issue of CHECKPOINTS
either, because the commit record you piggybacked on could have
committed before a checkpoint, while you still were busy deleting all
the files.

Similar things go for CREATE/DROP TABLESPACE.

I think getting rid of something like incomplete actions entirely isn't
really realistic. There are several places were we probably should use
them but don't. I think we probably will have change the logic around it
so there's something that determines whether a normal checkpoint is safe
atm.
A simple version of this basically exists via
GetVirtualXIDsDelayingChkpt() which is now also used for checksums but
we probably want to extend that at some point.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [bug fix] PITR corrupts the database cluster

From
Heikki Linnakangas
Date:
Andres Freund <andres@2ndquadrant.com> wrote:
>On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote:
>> Andres Freund <andres@2ndquadrant.com> wrote:
>> >On 2013-07-24 12:59:43 +0200, Andres Freund wrote:
>> >> > <Approach 2>
>> >> What we imo could do would be to drop the tablespaces in a
>*separate*
>> >> transaction *after* the transaction that removed the pg_tablespace
>> >> entry. Then an "incomplete actions" logic similar to btree and gin
>> >could
>> >> be used to remove the database directory if we crashed between the
>> >two
>> >> transactions.
>> >> 
>> >> SO:
>> >> TXN1 does:
>> >> * remove catalog entries
>> >> * drop buffers
>> >> * XLogInsert(XLOG_DBASE_DROP_BEGIN)
>> >> 
>> >> TXN2:
>> >> * remove_dbtablespaces
>> >> * XLogInsert(XLOG_DBASE_DROP_FINISH)
>> >> 
>> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup
>> >callback
>> >> (which would perform TXN2 if we failed inbetween) and a
>> >> rm_safe_restartpoint which would prevent restartpoints from
>occuring
>> >on
>> >> standby between both.
>> >> 
>> >> The same should probably done for CREATE DATABASE because that
>> >currently
>> >> can result in partially copied databases lying around.
>> >
>> >And CREATE/DROP TABLESPACE.
>> >
>> >Not really related, but CREATE DATABASE's implementation makes me
>itch
>> >everytime I read parts of it...
>> 
>> I've been hoping that we could get rid of the rm_cleanup mechanism
>entirely. I eliminated it for gist a while back, and I've been thinking
>of doing the same for gin and btree. The way it works currently is
>buggy - while we have rm_safe_restartpoint to avoid creating a
>restartpoint at a bad moment, there is nothing to stop you from running
>a checkpoint while incomplete actions are pending. It's possible that
>there are page locks or something that prevent it in practice, but it
>feels shaky.
>> 
>> So I'd prefer a solution that doesn't rely on rm_cleanup.
>Piggybacking on commit record seems ok to me, though if we're going to
>have a lot of different things to attach there, maybe we need to
>generalize it somehow. Like, allow resource managers to attach
>arbitrary payload to the commit record, and provide a new
>rm_redo_commit function to replay them.
>
>The problem is that piggybacking on the commit record doesn't really
>fix
>the problem that we end up with a bad state if we crash in a bad
>moment.
>
>For CREATE DATABASE you will have to copy the template database
>*before*
>you commit the pg_database insert. Which means if we abort before that
>we have old data in the datadir.
>
>For DROP DATABASE, without something like incomplete actions,
>piggybacking on the commit record doesn't solve the issue of
>CHECKPOINTS
>either, because the commit record you piggybacked on could have
>committed before a checkpoint, while you still were busy deleting all
>the files.

That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before
COMMIT,the file is leaked. But it's just a waste of space, everything still works.
 

It would be  nice to fix that leak, for tables and indexes too...


- Heikki



Re: [bug fix] PITR corrupts the database cluster

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before
COMMIT,the file is leaked. But it's just a waste of space, everything still works.
 

Well, it is different, because if you crash partway through dropping a
tablespace or database, you have inconsistent state.

> It would be  nice to fix that leak, for tables and indexes too...

I'm inclined to think that this wouldn't be a good use of resources,
at least not at the individual table/index level.  We'd surely be adding
some significant amount of overhead to normal operating paths, in order
to cover a case that really shouldn't happen in practice.

The only thing here that really bothers me is that a crash during DROP
DATABASE/TABLESPACE could leave us with a partially populated db/ts
that's still accessible through the system catalogs.  We could however
do something to ensure that the db/ts is atomically removed from use
before we start dropping individual files.  Then, if you get a crash,
there'd still be system catalog entries but they'd be pointing at
nothing, so the behavior would be clean and understandable whereas
right now it's not.

In the case of DROP TABLESPACE this seems relatively easy: drop or
rename the symlink before we start flushing individual files.
I'm not quite sure how to do it for DROP DATABASE though --- I thought
of renaming the database directory, say from "12345" to "12345.dead",
but if there are tablespaces in use then we might have a database
subdirectory in each one, so we couldn't rename them all atomically.
I guess one thing we could do is create a flag file, say
"dead.dont.use", in the database's default-tablespace directory, and
make new backends check for that before being willing to start up in
that database; then make sure that removal of that file is the last
step in DROP DATABASE.
        regards, tom lane



Re: [bug fix] PITR corrupts the database cluster

From
Andres Freund
Date:
On 2013-07-24 16:15:59 +0300, Heikki Linnakangas wrote:
> >For DROP DATABASE, without something like incomplete actions,
> >piggybacking on the commit record doesn't solve the issue of
> >CHECKPOINTS
> >either, because the commit record you piggybacked on could have
> >committed before a checkpoint, while you still were busy deleting all
> >the files.
> 
> That's no different from CREATE TABLE / INDEX and DROP TABLE /
> INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the
> file is leaked. But it's just a waste of space, everything still
> works.

Imo it's not really the same. For one, spurious files left over by
CREATE TABLE are handled when assigning future relfilenodes (see
GetNewRelFileNode()), we don't do the same for databases and
tablespaces afaik.
Furthermore, I don't think there's such a big issue with DROP TABLE
unless maybe you've created the relation in the same transaction that
you're dropping it. Sure, there's the issue with a checkpoint in the
wrong place, but other than that we're going to remove the file when
replaying the commit record.

We also should never end up with an inconsistent state between catalog
and filesystem.

> It would be  nice to fix that leak, for tables and indexes too...

Agreed.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [bug fix] PITR corrupts the database cluster

From
Tom Lane
Date:
I wrote:
> The only thing here that really bothers me is that a crash during DROP
> DATABASE/TABLESPACE could leave us with a partially populated db/ts
> that's still accessible through the system catalogs.  ...
> I guess one thing we could do is create a flag file, say
> "dead.dont.use", in the database's default-tablespace directory, and
> make new backends check for that before being willing to start up in
> that database; then make sure that removal of that file is the last
> step in DROP DATABASE.

After a bit of experimentation, it seems there's a pre-existing way that
we could do this: just remove PG_VERSION from the database's default
directory as the first filesystem action in DROP DATABASE.  If we
crash before committing, subsequent attempts to connect to that database
would fail like this:

$ psql bogus
psql: FATAL:  "base/176774" is not a valid data directory
DETAIL:  File "base/176774/PG_VERSION" is missing.

which is probably already good enough, though maybe we could add a HINT
suggesting that the DB was incompletely dropped.

To ensure this is replayed properly on slave servers, I'd be inclined to
mechanize it by (1) changing remove_dbtablespaces to ensure that the
DB's default tablespace is the first one dropped, and (2) incorporating
remove-PG_VERSION-first into rmtree().
        regards, tom lane



Re: [bug fix] PITR corrupts the database cluster

From
Andres Freund
Date:
On 2013-07-24 10:57:14 -0400, Tom Lane wrote:
> I wrote:
> > The only thing here that really bothers me is that a crash during DROP
> > DATABASE/TABLESPACE could leave us with a partially populated db/ts
> > that's still accessible through the system catalogs.  ...
> > I guess one thing we could do is create a flag file, say
> > "dead.dont.use", in the database's default-tablespace directory, and
> > make new backends check for that before being willing to start up in
> > that database; then make sure that removal of that file is the last
> > step in DROP DATABASE.
>
> After a bit of experimentation, it seems there's a pre-existing way that
> we could do this: just remove PG_VERSION from the database's default
> directory as the first filesystem action in DROP DATABASE.  If we
> crash before committing, subsequent attempts to connect to that database
> would fail like this:

Neat idea, especially as it would work on the back branches without
problems.

It doesn't address MauMau's original problem of not being able to set a
correct stop point because the datadir is being removed before the
commit record (which is the only place we currently can set a
recovery_target on in a basebackup scenario) though.

Wouldn't it make more sense to simply commit the pg_database entry
removal separately from the actual removal of the datadir? DROP DATABASE
already can't run in a transaction, so that should be easy.

>
> $ psql bogus
> psql: FATAL:  "base/176774" is not a valid data directory
> DETAIL:  File "base/176774/PG_VERSION" is missing.
>
> which is probably already good enough, though maybe we could add a HINT
> suggesting that the DB was incompletely dropped.

We could rename/fsync() the file to PG_BEING_DROPPED atomically. Then we
could give a useful error instead of an error message with a "maybe" hint.

That would require to remove the file last though, so I am not sure if
it's worth the bother.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [bug fix] PITR corrupts the database cluster

From
Bruce Momjian
Date:
On Wed, Jul 24, 2013 at 10:57:14AM -0400, Tom Lane wrote:
> I wrote:
> > The only thing here that really bothers me is that a crash during DROP
> > DATABASE/TABLESPACE could leave us with a partially populated db/ts
> > that's still accessible through the system catalogs.  ...
> > I guess one thing we could do is create a flag file, say
> > "dead.dont.use", in the database's default-tablespace directory, and
> > make new backends check for that before being willing to start up in
> > that database; then make sure that removal of that file is the last
> > step in DROP DATABASE.
> 
> After a bit of experimentation, it seems there's a pre-existing way that
> we could do this: just remove PG_VERSION from the database's default
> directory as the first filesystem action in DROP DATABASE.  If we
> crash before committing, subsequent attempts to connect to that database
> would fail like this:
> 
> $ psql bogus
> psql: FATAL:  "base/176774" is not a valid data directory
> DETAIL:  File "base/176774/PG_VERSION" is missing.
> 
> which is probably already good enough, though maybe we could add a HINT
> suggesting that the DB was incompletely dropped.
> 
> To ensure this is replayed properly on slave servers, I'd be inclined to
> mechanize it by (1) changing remove_dbtablespaces to ensure that the
> DB's default tablespace is the first one dropped, and (2) incorporating
> remove-PG_VERSION-first into rmtree().

Where are we on this?  Is there a TODO here?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +