Thread: point in time recovery and moving datafiles online

point in time recovery and moving datafiles online

From
Marc Munro
Date:
I'd like to take a crack at implementing point in time recovery.  My
plan is to do this as gently as possible in a number of small
self-contained steps.  I'd appreciate lots of critcial feedback.

Alternatively, if someone else is looking into this please let me know
so I can either back off or help out.

Stage 1

Add hooks for begin_backup and end_backup at a data file level.  Between
the calls begin_backup(myfile) and end_backup(myfile), writes to myfile
will be disabled allowing the file to be safely copied.

Unfortunately, it seems that this approach is fundamentally in conflict
with the principle of a checkpoint, which, as I understand it, is
supposed to flush all unwritten changes to disk.

My solution is to allow the checkpoint to complete without flushing
myfiles buffers, and then on end_backup(myfile) either perform another
checkpoint or just flush the buffers for myfile.  As long as you cannot
shut down the database untile end_backup has completed its checkpoint I
think all should be well.

So, have I missed something?  Would it instead be better to block the
checkpoint until end_backup(myfile) is complete?  Any other ideas?

Stage 2

Add a move_file function.  This will, use the begin/end_backup hooks,
make a copy of a datafile, create a symlink for it where the original
file used to be, and close and re-open the file without altering any
buffers.

There have been a few mutterings on the mailing lists recently that
suggest this would be useful, and it's a relatively easy test case for
the begin/end_backup hooks.

Stage 3

Add a pgtar executable that will create a tar of all data files using
the begin/end_backup hooks.

Stage 4

Provide some sort of archiving mechanism for WAL files.

Stage 5

Provide some control over the recovery process.  This has to deal with
recovery using both up to date control files, and control files from a
backup set.

Sorry for cross posting this to pgsql-general but my last post to
hackers never made it.


--
Marc        marc@bloodnok.com

Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
[ pg-general removed from cc: list, as this is off topic for it ]

Marc Munro <marc@bloodnok.com> writes:
> Add hooks for begin_backup and end_backup at a data file level.  Between
> the calls begin_backup(myfile) and end_backup(myfile), writes to myfile
> will be disabled allowing the file to be safely copied.

And the writes are going to go where instead?  If you intend to just
hold the dirty pages in memory, the system will grind to a halt in no
time, ie as soon as it runs out of spare buffers.  This strikes me as
only marginally better than "shut down the database while you copy the
files".

Perhaps more to the point, I'm not following how this helps achieve
point-in-time recovery.  I suppose what you are after is to get an
instantaneous snapshot of the data files that could be used as a
starting point for replaying the WAL --- but AFAICS you'd need a
snapshot that's instantaneous across the *whole* database, ie,
all the data files are in the state corresponding to the chosen
starting point for the WAL.  Locking and copying one file at a time
doesn't get you there.

It seems to me that you can get the desired results without any
locking.  Assume that you start archiving the WAL just after a
checkpoint record.  Also, start copying data files to your backup
medium.  Some not inconsiderable time later, you are done copying
data files.  You continue copying off and archiving WAL entries.
You cannot say that the copied data files correspond to any particular
point in the WAL, or that they form a consistent set of data at all
--- but if you were to reload them and replay the WAL into them
starting from the checkpoint, then you *would* have a consistent set
of files once you reached the point in the WAL corresponding to the
end-time of the data file backup.  You could stop there, or continue
WAL replay to any later point in time.
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
Marc Munro
Date:
Tom,
Many thanks for your reply.  This is exactly the sort of feedback I
need.

On Thu, 2002-02-21 at 19:52, Tom Lane wrote:
> [ pg-general removed from cc: list, as this is off topic for it ]
> 
> Marc Munro <marc@bloodnok.com> writes:
> > Add hooks for begin_backup and end_backup at a data file level.  Between
> > the calls begin_backup(myfile) and end_backup(myfile), writes to myfile
> > will be disabled allowing the file to be safely copied.
> 
> And the writes are going to go where instead?  If you intend to just
> hold the dirty pages in memory, the system will grind to a halt in no
> time, ie as soon as it runs out of spare buffers.  This strikes me as
> only marginally better than "shut down the database while you copy the
> files".

The intention is to lock only one file at a time, take a clean snapshot
of it, and only after allowing writes to its buffers to continue, go on
to the next file.  I would hope that the time it would take to copy a
single 1G data file would be considerably less than the time needed to
dirty all in-memory buffers but maybe that is just wild optimism.
> Perhaps more to the point, I'm not following how this helps achieve
> point-in-time recovery.  I suppose what you are after is to get an
> instantaneous snapshot of the data files that could be used as a
> starting point for replaying the WAL --- but AFAICS you'd need a
> snapshot that's instantaneous across the *whole* database, ie,
> all the data files are in the state corresponding to the chosen
> starting point for the WAL.  Locking and copying one file at a time
> doesn't get you there.

Actually I'm not trying to get a consistent snapshot.  I just want each
file to be internally consistent.  Then, on recovery, we should be able
to replay the WAL to reach; first a consistent state, then any point in
time from that consistent state to the end of the WAL entries.

I guess I should have explained that ;-)

> It seems to me that you can get the desired results without any
> locking.  Assume that you start archiving the WAL just after a
> checkpoint record.  Also, start copying data files to your backup
> medium.  Some not inconsiderable time later, you are done copying
> data files.  You continue copying off and archiving WAL entries.
> You cannot say that the copied data files correspond to any particular
> point in the WAL, or that they form a consistent set of data at all
> --- but if you were to reload them and replay the WAL into them
> starting from the checkpoint, then you *would* have a consistent set
> of files once you reached the point in the WAL corresponding to the
> end-time of the data file backup.  You could stop there, or continue
> WAL replay to any later point in time.

If I understand you correctly this is exactly what I was thinking, based
on Oracle recovery.  But we must still prevent writes to each data file
as we back it up, so that it remains internally consistent.  This is the
point of the begin/end_backup hooks.  Managing the archiving of the WAL
files is down on my list for doing later (one baby step at a time).

-- 
Marc        marc@bloodnok.com


Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> On Thu, 2002-02-21 at 19:52, Tom Lane wrote:
>> It seems to me that you can get the desired results without any
>> locking.  Assume that you start archiving the WAL just after a
>> checkpoint record.  Also, start copying data files to your backup
>> medium.  Some not inconsiderable time later, you are done copying
>> data files.  You continue copying off and archiving WAL entries.
>> You cannot say that the copied data files correspond to any particular
>> point in the WAL, or that they form a consistent set of data at all
>> --- but if you were to reload them and replay the WAL into them
>> starting from the checkpoint, then you *would* have a consistent set
>> of files once you reached the point in the WAL corresponding to the
>> end-time of the data file backup.  You could stop there, or continue
>> WAL replay to any later point in time.

> If I understand you correctly this is exactly what I was thinking, based
> on Oracle recovery.  But we must still prevent writes to each data file
> as we back it up, so that it remains internally consistent.

No, you're missing my point.  You don't need intra-file consistency any
more than you need cross-file consistency.  You merely need to be sure
that you have captured all the state of pages that are not updated
anywhere in the series of WAL entries that you have.

I had originally started to compose email suggesting that locking on a
per-disk-page basis (not a per-file basis) would be better, but I do
not believe you need even that, for two reasons:

1. PG will always write changes to data files in page-size write
operations.  The Unix kernel guarantees that these writes appear atomic
from the point of view of other processes.  So the data-file-backup
process will see pagewise consistent data in any case.

2. Even if the backup process managed to acquire an inconsistent
(partially updated) copy of a page due to a concurrent write by a
Postgres backend, we do not care.  The WAL activity is designed to
ensure recovery from partial-page disk writes, and backing up such an
inconsistent page copy would be isomorphic to a system failure after a
partial page write.  Replay of the WAL will ensure that the page will be
fully written from the WAL data.

In short, all you need is a mechanism for archiving off the WAL data and
locating a checkpoint record in the WAL as a starting point for replay.
Your data-file backup mechanism can be plain ol' tar or cp -r.  No
interlocks needed or wanted.
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
"Zeugswetter Andreas SB SD"
Date:
> It seems to me that you can get the desired results without any
> locking.  Assume that you start archiving the WAL just after a
> checkpoint record.  Also, start copying data files to your backup
> medium.  Some not inconsiderable time later, you are done copying
> data files.  You continue copying off and archiving WAL entries.
> You cannot say that the copied data files correspond to any particular
> point in the WAL, or that they form a consistent set of data at all
> --- but if you were to reload them and replay the WAL into them
> starting from the checkpoint, then you *would* have a consistent set
> of files once you reached the point in the WAL corresponding to the
> end-time of the data file backup.  You could stop there, or continue
> WAL replay to any later point in time.

Yes. But I think you might have to avoid "vacuum full" during data file backup.
And you need the whole WAL including the "page images". Since you have
those page images you also need not care about backing up a concurrently
written (thus prbbly partial) page (as Tom already pointed out). Key issue
is restoring or creating a pg_control file, that tells startup recover the
correct checkpoint and WAL position for startup rollforward.

Actually it would be worth simply testing the following:

0. configure WAL for enough room so it does not wrap around
1. start a lengthy pgbench run or something similar
2. force a checkpoint
3. save pg_control
4. tar PGDATA
5. tar the WAL's

now restore above files in order 4,3,5 start pg and debug :-)
might need to fix last page in WAL.

Andreas


Re: point in time recovery and moving datafiles online

From
"Alex Adriaanse"
Date:
Marc Munro <marc@bloodnok.com> writes:
> Add hooks for begin_backup and end_backup at a data file level.  Between
> the calls begin_backup(myfile) and end_backup(myfile), writes to myfile
> will be disabled allowing the file to be safely copied.

I pasted some Oracle documentation below that talks about how it's
implemented in Oracle (Oracle leaves the backing up of the data and redo log
files up to the administrator, so that tar, cp, or anything else can be used
so that you're not restricted to backup files in tar format.  All they do is
implement ALTER TABLESPACE ... BEGIN/END BACKUP statements, and a way to
create archives of redo log files and the rest is up to the administrator).
I hope I'm not just pasting the obvious here.

I got this off
http://download-west.oracle.com/otndoc/oracle9i/901_doc/server.901/a90134/os
backup.htm#9419 - they've got lots of neat stuff in this guide in some of
the other chapters as well.

===============
You must put a read/write tablespace in backup mode to make user-managed
datafile backups when the tablespace is online and the database is open. The
ALTER TABLESPACE BEGIN BACKUP statement places a tablespace in backup mode.

Oracle stops recording checkpoints to the datafiles in the tablespace when a
tablespace is in backup mode. Because a block can be partially updated at
the very moment that the operating system backup utility is copying it,
Oracle copies whole changed data blocks into the redo stream while in backup
mode. After you take the tablespace out of backup mode with the ALTER
TABLESPACE ... END BACKUP or ALTER DATABASE END BACKUP statement, Oracle
advances the datafile header to the current database checkpoint.

When you restore a datafile backed up in this way, the datafile header has a
record of the most recent datafile checkpoint that occurred before the
online tablespace backup, not any that occurred during it. As a result,
Oracle asks for the appropriate set of redo log files to apply should
recovery be needed. The redo logs contain all changes required to recover
the datafiles and make them consistent.

...

Caution: If you forget to take the tablespace out of backup mode, then
Oracle continues to write entire copies of data blocks in this tablespace to
the online redo logs, possibly causing performance problems. Also, you will
receive an ORA-01149 error if you attempt to shut down the database with the
tablespaces still in backup mode.
===============

So I guess to sum it up, when you put the database in hot-backup mode, the
database is still completely functional, and all changes are still written
to disk (to the redo log file until the backup of the data files are done,
which is when the changes from the redo log are flushed to the data files).
Once that's done, you archive the redo log, back up the archived logs, and
you're all set.

Best regards,

Alex



Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> Yes. But I think you might have to avoid "vacuum full" during data file backup.

Why?  If vacuum is unsafe in this scenario, wouldn't it also be unsafe
in event of a system crash?

I do believe that vacuum should (but presently does not) emit a WAL
record showing its truncation of the file, so that the equivalent
truncation can be repeated during replay.  However, this is needed
in any case --- point-in-time recovery simply means replaying WAL on
a slightly longer timescale than is usual for crash-recovery.

> And you need the whole WAL including the "page images".

Check, you can't compress out the page images.
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> I do believe that vacuum should (but presently does not) emit a WAL
>> record showing its truncation of the file, so that the equivalent
>> truncation can be repeated during replay.

> I see, but it has only the effect of not freeing the space to the OS, 
> so it is not really a bug ? Next vacuum will do it anyway. 

[ thinks... ]  Yeah, you're probably right.  The extra pages should
contain only dead tuples (either already dead or moved-off by vacuum),
so a seqscan would find no live data in them anyway.  It should be
fixed, but it's not critical (except possibly for people who are low
on disk space...)
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
"Zeugswetter Andreas SB SD"
Date:
> > Yes. But I think you might have to avoid "vacuum full" during data file backup.
>
> Why?  If vacuum is unsafe in this scenario, wouldn't it also be unsafe
> in event of a system crash?

I was not sure we are 100% crash safe during "vacuum full".
But yes, now recalling your last remark on a crash during vacuum full
I guess we should be.

> I do believe that vacuum should (but presently does not) emit a WAL
> record showing its truncation of the file, so that the equivalent
> truncation can be repeated during replay.

> However, this is needed in any case.

I see, but it has only the effect of not freeing the space to the OS,
so it is not really a bug ? Next vacuum will do it anyway.

Andreas


Re: point in time recovery and moving datafiles online

From
Marc Munro
Date:
Tom,
Again, many thanks for the reply.

On Thu, 2002-02-21 at 21:27, Tom Lane wrote:
> 
> No, you're missing my point.  You don't need intra-file consistency any
> more than you need cross-file consistency.  You merely need to be sure
> that you have captured all the state of pages that are not updated
> anywhere in the series of WAL entries that you have.
> 
> I had originally started to compose email suggesting that locking on a
> per-disk-page basis (not a per-file basis) would be better, but I do
> not believe you need even that, for two reasons:
> 
> 1. PG will always write changes to data files in page-size write
> operations.  The Unix kernel guarantees that these writes appear atomic
> from the point of view of other processes.  So the data-file-backup
> process will see pagewise consistent data in any case.
> 
> 2. Even if the backup process managed to acquire an inconsistent
> (partially updated) copy of a page due to a concurrent write by a
> Postgres backend, we do not care.  The WAL activity is designed to
> ensure recovery from partial-page disk writes, and backing up such an
> inconsistent page copy would be isomorphic to a system failure after a
> partial page write.  Replay of the WAL will ensure that the page will be
> fully written from the WAL data.

Wow.  I hadn't appreciated that.  This is way cooler than I realised.

> In short, all you need is a mechanism for archiving off the WAL data and
> locating a checkpoint record in the WAL as a starting point for replay.
> Your data-file backup mechanism can be plain ol' tar or cp -r.  No
> interlocks needed or wanted.

So, the whole job is much easier than I thought.  This is a good thing.

I will rethink my strategy.  It looks like the tasks now are to manage
the archival of WAL files, provide an interface to manage the recovery
process, produce some guidelines/scripts for managing hot backups, and
write the documentation.


-- 
Marc        marc@bloodnok.com


Re: point in time recovery and moving datafiles online

From
Bruce Momjian
Date:
Zeugswetter Andreas SB SD wrote:
> > > Yes. But I think you might have to avoid "vacuum full" during data file backup.
> > 
> > Why?  If vacuum is unsafe in this scenario, wouldn't it also be unsafe
> > in event of a system crash?
> 
> I was not sure we are 100% crash safe during "vacuum full".
> But yes, now recalling your last remark on a crash during vacuum full
> I guess we should be.
> 
> > I do believe that vacuum should (but presently does not) emit a WAL
> > record showing its truncation of the file, so that the equivalent
> > truncation can be repeated during replay.
> 
> > However, this is needed in any case.
> 
> I see, but it has only the effect of not freeing the space to the OS, 
> so it is not really a bug ? Next vacuum will do it anyway. 

One idea is that replication can give is point-in-time recovery.  A
recovery is nothing more than a replication slave taking the last full
backup and reading the replication traffic to get itself up to date.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: point in time recovery and moving datafiles online

From
Tatsuo Ishii
Date:
> In short, all you need is a mechanism for archiving off the WAL data and
> locating a checkpoint record in the WAL as a starting point for replay.
> Your data-file backup mechanism can be plain ol' tar or cp -r.  No
> interlocks needed or wanted.

What would happen if a table is dropped or truncated while doing tar
on it? I think we do not perform any logging while doing DROP TABLE or
TRUCATE TABLE. Logging entire table would solve the problem, but that
seems to be impossible for huge tables.

Maybe we need to have some interlock mechanism to prevent DROP
TABLE/TRUNCATE/DROP DATABSE etc. while doing a backup.
--
Tatsuo Ishii





Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> What would happen if a table is dropped or truncated while doing tar
> on it?

Nothing bad, unless tar itself got confused.  (Which could perhaps
happen in the TRUNCATE case; I dunno if tar is happy when it can't read
as much from the file as it originally expected.)  As far as DROP goes,
any standard Unix system will postpone the physical file delete until
the last open file descriptor closes.

> I think we do not perform any logging while doing DROP TABLE or
> TRUCATE TABLE.

TRUNCATE isn't considered rollback-able anyway (indeed that's its whole
point) so I'm not too excited about whether or not it plays by the WAL
rules.  CREATE/DROP DATABASE the same.

As for DROP TABLE, what should happen is that at commit time we log the
IDs of the tables being deleted in WAL; then on replay we can re-delete
them as needed.  If you look in xact.c you will see comments where this
is supposed to happen --- but evidently Vadim never did get around to
that.  Other file creation and deletion actions need to be logged in WAL
as well.
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > What would happen if a table is dropped or truncated while doing tar
> > on it?
> 
> Nothing bad, unless tar itself got confused.  (Which could perhaps
> happen in the TRUNCATE case; I dunno if tar is happy when it can't read
> as much from the file as it originally expected.)  As far as DROP goes,
> any standard Unix system will postpone the physical file delete until
> the last open file descriptor closes.

My concern is after DROP TABLE is committed. In my understanding when
the backup process starts,it records a LSN somewhere (maybe
pg_control?) to remember the place where the archive log recovery
could start later.

Consider a scenario:

(1) backup process starts (and records LSN)
(2) DROP TABLE t1 starts
(3) DROP TABLE t1 commits
(4) backup process ends

I think the database status should be able to go back to (1) using
the archive log recovery. No?
--
Tatsuo Ishii


Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Consider a scenario:

> (1) backup process starts (and records LSN)
> (2) DROP TABLE t1 starts
> (3) DROP TABLE t1 commits
> (4) backup process ends

> I think the database status should be able to go back to (1) using
> the archive log recovery. No?

No.  It is not reasonable to expect the backup to allow you to recreate
any state occurring before the *end* of the backup process.  After the
backup is complete, you can use the backup and the WAL to duplicate the
state of any later instant.
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
Tatsuo Ishii
Date:
> > (1) backup process starts (and records LSN)
> > (2) DROP TABLE t1 starts
> > (3) DROP TABLE t1 commits
> > (4) backup process ends
> 
> > I think the database status should be able to go back to (1) using
> > the archive log recovery. No?
> 
> No.  It is not reasonable to expect the backup to allow you to recreate
> any state occurring before the *end* of the backup process.  After the
> backup is complete, you can use the backup and the WAL to duplicate the
> state of any later instant.

I see your point. But is it reasonable? We cannot know when the backup
process completes beforehand and that makes the archive log recovery
less usefull in my opinion.

I guess Oracle and other commercial DBMSs declare the start of backup
process explicitly and that would be the point where the archive log
recovery could go back. I'm interested in how Oracle accomplishes this
(I know DROP TABLE is not rollbackable in Orale).
--
Tatsuo Ishii


Re: point in time recovery and moving datafiles online

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> No.  It is not reasonable to expect the backup to allow you to recreate
>> any state occurring before the *end* of the backup process.

> I see your point. But is it reasonable? We cannot know when the backup
> process completes beforehand and that makes the archive log recovery
> less usefull in my opinion.

So?  The backup will not let you recover states occurring before it
starts, either.  I don't see the issue.  Anyone using this in a
production situation will *always* have a prior backup and a continuous
WAL trail since the start of the prior backup.  Whether the
synchronization point is the start time or end time of each new backup
isn't a hot issue for them --- at least it shouldn't be if they are
following appropriate procedures.

The only way I can see to make it work the other way is to dump the
entire contents of any to-be-deleted table into the WAL log before
deleting it.  That is way too much overhead to be a reasonable tradeoff
for letting people think of the start time rather than end time of their
backups as the cutover point.
        regards, tom lane


Re: point in time recovery and moving datafiles online

From
Marc Munro
Date:
On Thu, 2002-03-07 at 18:00, Tatsuo Ishii wrote:
> > > (1) backup process starts (and records LSN)
> > > (2) DROP TABLE t1 starts
> > > (3) DROP TABLE t1 commits
> > > (4) backup process ends
> > 
> > > I think the database status should be able to go back to (1) using
> > > the archive log recovery. No?
> > 
> > No.  It is not reasonable to expect the backup to allow you to recreate
> > any state occurring before the *end* of the backup process.  After the
> > backup is complete, you can use the backup and the WAL to duplicate the
> > state of any later instant.

> I guess Oracle and other commercial DBMSs declare the start of backup
> process explicitly and that would be the point where the archive log
> recovery could go back. I'm interested in how Oracle accomplishes this
> (I know DROP TABLE is not rollbackable in Orale).

Actually it is possible to get back to state 1 but for this we need a
backup from earlier.  To restate the scenario:

0) Backup A completes
1) Database activity happens
2) Backup B begins
3) Drop table t1 starts
4) Drop table t1 commits
5) Backup B completes
6) More database activity

We can recover any database activity for which we have WALs from backups
prior to that event.  If we want to restore to any point in time prior
to step 5, we must use backup A.  For points after step 5 we can use
backup B.

This is exactly the way that we would recover databases using Oracle.  I
have known DBAs attempt to restore a backup only to discover media
problems, and then revert to backup N-1, N-2 and so on.  Once a good
backup is finally found, recovery (roll-forward) is performed.  It takes
longer from an older backup and there is more risk of encountering a
lost log file but it works just fine.


-- 
Marc        marc@bloodnok.com