Thread: Issues Outstanding for Point In Time Recovery (PITR)

Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

04 July 2002, 11:58:52

Hello:

I've got the logging system to the point where I can take a shutdown
consistent copy of a system, and play forward through multiple
checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
properly, and things are moving forward well. Recovery to an arbitrary
point-in-time should be just as easy, but will need some administrative
interface for it.

At this point, some input would be useful on how I should handle things.

The most important questions that need answering are in sections 2 & 5,
since they impact the most other parts of the system. They will also
require good documentation for sysadmins.

Issues Outstanding for Point In Time Recovery (PITR)
$Date: 2002/07/04 14:23:37 $
$Revision: 1.4 $
J.R. Nield
(Enc: ISO 8859-15 Latin-9)

§0 - Introduction
This file is where I'm keeping track of all the issues I run into while trying to get PITR to work properly.
Hopefullyit will evolve into a description of how PITR actually works once it is implemented.
I will also try to add feedback as it comes in.
The big items so-far are: §1 - Logging Relation file creation, truncation, and removal This is
mostlydone. Can do infinte play-forward from online logs. §2 - Partial-Write and Bad Block detection
Need input before starting. Migration issues. §3 - Detecting Shutdown Consistent System Recovery
Mostlydone. §4 - Interactive Play-Forward Recovery for an Entire System Need input before starting.
§5 - Individual file consistent recovery Need input. Semi-Major changes required.

§1 - Logging Relation file creation, truncation, and removal
§1.1 - Problem:
Without file creation in the log, we can't replay committed transactions that create relations. The current
codeassumes that any transaction reaching commit has already ensured it's files exist, and that those files will
neverbe removed. This is true now, but not for log-replay from an old backup database system. The current XLOG
codesilently ignores block-write requests for non-existent files, and assumes that the transaction generating those
requestsmust have aborted.
Right now a crash during TRUNCATE TABLE will leave the table in an inconsistent state (partially truncated). This
wouldnot work when doing replay from before the last checkpoint.
§1.1.1 - CREATE DATABASE is also unlogged
This will cause the same replay problems as above. §1.2 - Proposal:
a) Augment the SMGR code to log relation file operations, and to handle redo requests properly. This is simple in
thecase of create. Drop must be logged only IN the commit record. For truncate see (b).
The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink', and 'smgr_retruncate'. smgr_recreate
shouldaccept a RelFileNode instead of a Relation.
Transactions that abort through system failure (ie. unlogged aborts) will simply continue to leak files.
b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table may be used in a future command, and a
replay-recovereddatabase may end-up with different data than the original.
WAL must be flushed before truncate as well.
WAL does not need to be flushed before create, if we don't mind leaking files sometimes.
c) Redo code should treat writes to non-existent files as an error. Changes affect heap & nbtree AM's. [Check
others]
d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of a database should mark all the rtree
indicesas corrupt. [ actually we should do that now, are we? ]
e) CREATE DATABASE must be logged properly, not use system(cp...)
§1.3 - Status:
All logged SMGR operations are now in a START_CRIT_SECTION()/ END_CRIT_SECTION() pair enclosing the XLogInsert()
andthe underlying fs operations.
Code has been added to smgr and xact modules to log: create (no XLogFlush) truncate (XLogFlush)
pendingdeletes on commit record files to delete on abort record
Code added to md.c to support redo ops
Code added to smgr for RMGR redo/desc callbacks
Code added to xact RMGR callbacks for redo/desc
Database will do infinite shutdown consistent system recovery from the online logs, if you manually munge the
controlfile to set state == DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
Still need to do: Item (c), recovery cleanup in all AM's Item (d), logging in other index AM's
Item(e), CREATE DATABASE stuff

§2 - Partial-Write and Bad Block detection
§2.1 - Problem:
In order to protect against partial writes without logging pages twice, we need to detect partial pages in system
filesand report them to the system administrator. We also might want to be able to detect damaged pages from other
causes,like memory corruption, OS errors, etc. or in the case where the disk doesn't report bad blocks, but returns
baddata.
We should also decide what should happen when a file is marked as containing corrupt pages, and requires
log-archiverecovery from a backup.
§2.2 - Proposal:
Add a 1 byte 'pd_flags' field to PageHeaderData, with the following flag definitions:
PD_BLOCK_CHECKING (1) PD_BC_METHOD_BIT (1<<1)
PageHasBlockChecking(page) ((page)->pd_flags & PD_BLOCK_CHECKING) PageBCMethodIsCRC64(page)
((page)->pd_flags& PD_BC_METHOD_BIT) PageBCMethodIsLSNLast(page) (!PageBCMethodIsCRC64(page))
The last 64 bits of a page are reserved for use by the block checking code.
[ Is it worth the trouble to allow the last 8 bytes of a page to contain data when block checking is turned off
fora Page?
This proposal does not allow that. ]
If the block checking method is CRC64, then that field will contain the CRC64 of the block computed at write time.
If the block checking method is LSNLast, then the field contains a duplicate of the pd_lsn field.
§2.2.1 - Changes to Page handling routines
All the page handling routines need to understand that pd_special == (pd_special - (specialSize + 8))
Change header comment in bufpage.h to reflect this.
§2.2.2 - When Reading a Page
Block corruption is detected on read in the obvious way with CRC64.
In the case of LSNLast, we check to see if pd_lsn == the lsn in the last 64 bits of the page. If not, we assume
thepage is corrupt from a partial write (although it could be something else).
IMPORTANT ASSUMPTION: The OS/disk device will never write both the first part and last part of a block
withoutwriting the middle as well. This might be wrong in some cases, but at least it's fast.
§2.2.4 - GUC Variables
The user should be able to configure what method is used:
block_checking_write_method = [ checksum | torn_page_flag | none ]
Which method should be used for blocks we write?
check_blocks_on_read = [ true | false ]
When true, verify that the blocks we read are not corrupt, using whatever method is in the block
header.
When false, ignore the block checking information.
§2.3 - Status:
Waiting for input from pgsql-hackers.
Questions:
Should we allow the user to have more detailed control over which parts of a database use block checking?
For example: use 'checksum' on all system catalogs in all databases, 'torn_page_flag' on the non-catalog
partsof the production database, and 'none' on everything else?

§3 - Detecting Shutdown Consistent System Recovery
§3.1 - Problem:
How to notice that we need to do log-replay for a system backup, when the restored control file points to a
shutdowncheckpoint record that is before the most recent checkpoint record in the log, and may point into an
archivedfile.
§3.2 - Proposal:
At startup, after reading the ControlFile, scan the log directory to get the list of active log files, and find
thelowest logId and logSeg of the files. Ensure that the files cover a contiguous range of LSN's.
There are three cases:
1) ControlFile points to the last valid checkpoint (either checkPoint or prevCheckPoint, but one of them is
thegreatest valid checkpoint record in the log stream).
2) ControlFile points to a valid checkpoint record in an active log file, but there are more valid
checkpointrecords beyond it.
3) ControlFile points to a checkpoint record that should be in the archive logs, and is presumably valid.
Case 1 is what we handle now.
Cases 2 and 3 would result from restoring an entire system from backup in preparation to do a play-forward
recovery.
We need to:
Detect cases 2 and 3.
Alert the administrator and abort startup. [Question: Is this always the desired behavior? We can
handlecase 2 without intervention. ]
Let the administrator start a standalone backend, and perform a play-forward recovery for the system.
§3.3 - Status:
In progress.

§4 - Interactive Play-Forward Recovery for an Entire System
Play-Forward File Recovery from a backup file must be interactive, because not all log files that we need are
necessarilyin the archive directory. It may be possible that not all the archive files we need can even fit on
diskat one time.
The system needs to be able to prompt the system administrator to feed it more log files.
TODO: More here

§5 - Individual file consistent recovery
§5.1 - Problem:
If a file detects corruption, and we restore it from backup, how do we know what archived files we need for
recovery?
Should file corruption (partial write, bad disk block, etc.) outside the system catalog cause us to abort the
system,or should we just take the relation or database off-line?
Given a backup file, how do we determine the point in the log where we should start recovery for the file? What
isthe highest LSN we can use that will fully recover the file?
§5.2 - Proposal: Put a file header on each file, and update that header to the last checkpoint LSN at least
onceevery 'file_lsn_time_slack' minutes, or at least once every dbsize/'file_lsn_log_slack' megabytes of log
written,where dbsize is the estimated size of the database. Have these values be settable from the config file. These
updateswould be distributed throughout the hour, or interspersed between regular amounts of log generation.
If we have a database backup program or command, it can update the header on the file before backup to the
greatestvalue it can assure to be safe.
§5.3 - Status:
Waiting for input from pgsql-hackers.
Questions:
There are alternate methods than using a file header to get a known-good LSN lower bound for the starting
pointto recover a backup file. Is this the best way?

A) The Definitions
This stuff is obtuse, but I need it here to keep track of what I'm saying. Someday I should use it consistently
inthe rest of this document.
"system" or "database system":
A collection of postgres "databases" in one $PGDATA directory, managed by one postmaster instance at a
time(and having one WAL log, etc.)
All the files composing such a system, as a group.
"up to date" or "now" or "current" or "current LSN":
The most recent durable LSN for the system.
"block consistent copy":
When referring to a file:
A copy of a file, which may be written to during the process of copying, but where each BLCKSZ size block
iscopied atomically.
When referring to multiple files (in the same system):
A copy of all the files, such that each is independently a "block consistent copy"
"file consistent copy":
When referring to a file:
A copy of a file that is not written to between the start and end of the copy operation.
When referring to multiple files (in the same system):
A copy of all the files, such that each is independently a "file consistent copy"
"system consistent copy":
When referring to a file: A copy of a file, where the entire system of which it is a member
isnot written to during the copy.
When referring to multiple files (in the same system):
A copy of all the files, where the entire system of which they are members was not written to between the
startand end of the copying of all the files, as a group.
"shutdown consistent copy":
When referring to a file:
A copy of a file, where the entire system of which it is a member had been cleanly shutdown before the
startof and for the duration of the copy.
When referring to multiple files (in the same system):
A copy of all the files, where the entire system of which they are members had been cleanly shutdown
beforethe start of and for the duration of the copying of all the files, as a group.
"consistent copy":
A block, file, system, or shutdown consistent copy.
"known-good LSN lower bound" or "LSN lower bound" or "LSN-LB":
When referring to a group of blocks, a file, or a group of files:
An LSN known to be old enough that no log entries before it are needed to bring the blocks or files
up-to-date.
"known-good LSN greatest lower bound" or "LSN greatest lower bound" or "LSN-GLB":
When referring to a group of blocks, a file, or a group of files:
The greatest possible LSN that is a known-good LSN lower bound for the group.
"backup file":
A consistent copy of a data file used by the system, for which we have a known-good LSN lower bound.
"optimal backup file":
A backup file, for which we have the known-good LSN greatest lower bound.
"backup system":

"Play-Forward File Recovery" or "PFFR":
The process of bringing an individual backup file up to date.

--
J. R. Nield
jrnield@usol.com

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

04 July 2002, 13:21:00

On Thu, 2002-07-04 at 11:45, J. R. Nield wrote:

One other item that should be here:
>     The big items so-far are:
>         §1 - Logging Relation file creation, truncation, and removal
>             This is mostly done. Can do infinte play-forward from
>             online logs.
>         §2 - Partial-Write and Bad Block detection
>             Need input before starting. Migration issues.
>         §3 - Detecting Shutdown Consistent System Recovery
>             Mostly done.
>         §4 - Interactive Play-Forward Recovery for an Entire System
>             Need input before starting.
>         §5 - Individual file consistent recovery
>             Need input. Semi-Major changes required.         §6 - btbuild is not logged             Not logged
becauseof same assumptions as for file create.             Only need to log the index build parameters to recreate
      the index, not each page change. 


--
J. R. Nield
jrnield@usol.com

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

05 July 2002, 01:42:42

I noticed no one has responded to your questions yet.

I think it is because we are sort of in shock.  We have needed
point-in-time recovery for a long time, but the people who were capable
of doing it weren't able to focus on it.  Then, all of a sudden, we get
an email from someone who is focusing on it and wants to get the job
done.  GREAT!

I will give you my short analysis and see how many other questions I can
answer.

We have always known there was a way to do PITR with WAL, but WAL needed
a few extra pieces of information.  Unfortunately, we weren't able to
analyze what was required.  Seems you have gotten very far here, and
that is great.

Also, we thought about having PITR as part of replication (reply of
replication traffic logs) but having it tied to WAL is much cleaner and
has better performance, I bet.

I will do whatever I can to help.  My chat addresses are:AIM    bmomjianICQ    151255111Yahoo    bmomjianMSN
root@candle.pha.pa.usIRC   #postgresql vis efnet
 

---------------------------------------------------------------------------

J. R. Nield wrote:
> Hello:
> 
> I've got the logging system to the point where I can take a shutdown
> consistent copy of a system, and play forward through multiple
> checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE

Yes, that was always something we knew was lacking in the current WAL
contents.

> properly, and things are moving forward well. Recovery to an arbitrary
> point-in-time should be just as easy, but will need some administrative
> interface for it.

The adminstrative part can be done easily.  We can get that part done. 
It is the low-level stuff we always needed help with.

> At this point, some input would be useful on how I should handle things.
> 
> The most important questions that need answering are in sections 2 & 5,
> since they impact the most other parts of the system. They will also
> require good documentation for sysadmins.

> ?0 - Introduction
> 
>     This file is where I'm keeping track of all the issues I run into while
>     trying to get PITR to work properly. Hopefully it will evolve into a
>     description of how PITR actually works once it is implemented.
> 
>     I will also try to add feedback as it comes in.
> 
>     The big items so-far are:
>         ?1 - Logging Relation file creation, truncation, and removal 
>             This is mostly done. Can do infinte play-forward from
>             online logs.

Excellent!

>         ?2 - Partial-Write and Bad Block detection
>             Need input before starting. Migration issues.

Uh, we do log do pre-page writes to WAL to recover from partial page
writes to disk.  Is there something more we need here?

As for bad block detection, we have thought about adding a CRC to each
page header, or at least making it optional.  WAL already has a CRC.


>         ?3 - Detecting Shutdown Consistent System Recovery
>             Mostly done.
>         ?4 - Interactive Play-Forward Recovery for an Entire System
>             Need input before starting.

You mean user interface?  My idea would be to just get some command-line
tool working and we can write some GUI app to manage it and use the
command-line tool as an interface into the system.

>         ?5 - Individual file consistent recovery
>             Need input. Semi-Major changes required.
> 

OK, here are the specific questions.  Got it.

> ?1 - Logging Relation file creation, truncation, and removal 
> 
>   ?1.1 - Problem:
> 
>     Without file creation in the log, we can't replay committed 
>     transactions that create relations. 
>     
>     The current code assumes that any transaction reaching commit has already
>     ensured it's files exist, and that those files will never be removed. This
>     is true now, but not for log-replay from an old backup database system. 
>     
>     The current XLOG code silently ignores block-write requests for
>     non-existent files, and assumes that the transaction generating those
>     requests must have aborted.
> 
>     Right now a crash during TRUNCATE TABLE will leave the table in an
>     inconsistent state (partially truncated). This would not work when doing
>     replay from before the last checkpoint.

Yes, there are a few places where we actually create file, and if the
server crashes, the file remains out the forever.  We need to track that
better.  I think we may need some utility that compares pg_class with
the files in the directory and cleans out unused files on server
startup.  I started working on such code as part of VACUUM but it made
too many assumptions because it knew other backends were working at the
same time.  On recovery, you don't have that problem and can easily do
almost an 'ls' and clean out just left over from the crash.  Seems that
would solve several of those problems.

> 
>     ?1.1.1 - CREATE DATABASE is also unlogged
> 
>       This will cause the same replay problems as above.

Yep.  Again, seems a master cleanup on startup is needed.

>   ?1.2 - Proposal:
> 
>     a) Augment the SMGR code to log relation file operations, and to handle
>     redo requests properly. This is simple in the case of create. Drop must be
>     logged only IN the commit record. For truncate see (b).

Yep, we knew we needed that.

>     The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink',
>     and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead
>     of a Relation.

No problem.  Clearly required.

>     Transactions that abort through system failure (ie.  unlogged aborts) 
>     will simply continue to leak files.

Yep, need a cleanup process on start.

>     b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
>     may be used in a future command, and a replay-recovered database may
>     end-up with different data than the original.

We number based on oids.  You mean oid wraparound could cause the file
to be used again?

>     WAL must be flushed before truncate as well.
> 
>     WAL does not need to be flushed before create, if we don't mind 
>     leaking files sometimes.

Cleanup?

>     c) Redo code should treat writes to non-existent files as an error.
>     Changes affect heap & nbtree AM's. [Check others]

Yep, once you log create/drop, if something doesn't match, it is an
error, while before, we could ignore it.

>     d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
>       a database should mark all the rtree indices as corrupt.
>       [ actually we should do that now, are we? ]

Known problem.  Not sure what is being done.  TODO has:
* Add WAL index reliability improvement to non-btree indexes

so it is a known problem, and we aren't doing anything about it.  What
more can I say?  ;-)

>     e) CREATE DATABASE must be logged properly, not use system(cp...)

OK, should be interesting.

>   ?1.3 - Status:
> 
>     All logged SMGR operations are now in a START_CRIT_SECTION()/
>     END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs
>     operations.
> 
>     Code has been added to smgr and xact modules to log:
>         create (no XLogFlush)
>         truncate (XLogFlush)
>         pending deletes on commit record
>         files to delete on abort record
> 
>     Code added to md.c to support redo ops
> 
>     Code added to smgr for RMGR redo/desc callbacks
> 
>     Code added to xact RMGR callbacks for redo/desc
> 
>     Database will do infinite shutdown consistent system recovery from the
>     online logs, if you manually munge the control file to set state ==
>     DB_IN_PRODUCTION instead of DB_SHUTDOWNED.

Wow, how did you get so far?

>     Still need to do:
>         Item (c), recovery cleanup in all AM's
>         Item (d), logging in other index AM's
>         Item (e), CREATE DATABASE stuff
> 
> 
> 
> ?2 - Partial-Write and Bad Block detection
> 
>   ?2.1 - Problem:
> 
>     In order to protect against partial writes without logging pages
>     twice, we need to detect partial pages in system files and report them
>     to the system administrator. We also might want to be able to detect
>     damaged pages from other causes, like memory corruption, OS errors,
>     etc. or in the case where the disk doesn't report bad blocks, but
>     returns bad data.

Interesting.  We just had a discussion about MSSQL page tare bits on
every 512-byte block that are set to the same value before the write. 
On recover, if any the bits in a block are different, they recommend
recover using PITR.  We don't have PITR (yet) so there was no need to
implement it (we just logged whole pages to WAL before writing).  I
think we may go with just a CRC per page for partial-write detection.

>     We should also decide what should happen when a file is marked as
>     containing corrupt pages, and requires log-archive recovery from a
>     backup.

We can offer the option of no-wal change-page writing which will require
PITR on partial-write detection or they can keep the existing system
with the performance hit.

> 
>   ?2.2 - Proposal:
> 
>     Add a 1 byte 'pd_flags' field to PageHeaderData, with the following
>     flag definitions:
> 
>         PD_BLOCK_CHECKING           (1)
>         PD_BC_METHOD_BIT            (1<<1)
> 
>         PageHasBlockChecking(page)     ((page)->pd_flags & PD_BLOCK_CHECKING)
>         PageBCMethodIsCRC64(page)      ((page)->pd_flags & PD_BC_METHOD_BIT)
>         PageBCMethodIsLSNLast(page)    (!PageBCMethodIsCRC64(page))
> 
>     The last 64 bits of a page are reserved for use by the block checking 
>     code.

OK, so you already are on the CRC route.

>     [ Is it worth the trouble to allow the last 8 bytes of a
>       page to contain data when block checking is turned off for a Page? 
>       This proposal does not allow that. ]

You can leave the 8-byte empty if no CRC.  You may want to turn CRC
on/off without dump.
>     If the block checking method is CRC64, then that field will contain
>     the CRC64 of the block computed at write time.

Cool.

>     If the block checking method is LSNLast, then the field contains a
>     duplicate of the pd_lsn field.
> 
>     ?2.2.1 - Changes to Page handling routines
> 
>      All the page handling routines need to understand that 
>      pd_special == (pd_special - (specialSize + 8))
> 
>      Change header comment in bufpage.h to reflect this.

Yes, we should add a format version to the heap page tail anyway like
btree has, i.e. some constant on every page that describes the format
used in that PostgreSQL version.

>     ?2.2.2 - When Reading a Page
> 
>      Block corruption is detected on read in the obvious way with CRC64.
> 
>      In the case of LSNLast, we check to see if pd_lsn == the lsn in the
>      last 64 bits of the page. If not, we assume the page is corrupt from
>      a partial write (although it could be something else).

LSN?

>      IMPORTANT ASSUMPTION:
>         The OS/disk device will never write both the first part and
>         last part of a block without writing the middle as well.
>         This might be wrong in some cases, but at least it's fast.
> 
>     ?2.2.4 - GUC Variables
> 
>      The user should be able to configure what method is used:
> 
>        block_checking_write_method  = [ checksum | torn_page_flag | none ]
> 
>           Which method should be used for blocks we write?

Do we want torn page flag?  Seems like a pain to get that on every 512
byte section of the 8k page.

>        check_blocks_on_read      = [ true | false ]
> 
>           When true, verify that the blocks we read are not corrupt, using
>           whatever method is in the block header.
> 
>           When false, ignore the block checking information.
> 

Good idea.  We always check on crash, but check on read only when set. 
Good for detecting hardware problems.


>   ?2.3 - Status:
> 
>     Waiting for input from pgsql-hackers.
> 
>     Questions:
> 
>         Should we allow the user to have more detailed control over
>         which parts of a database use block checking?

I don't think that is needed;  installation-wide settings are fine.

>         For example: use 'checksum' on all system catalogs in all databases, 
>         'torn_page_flag' on the non-catalog parts of the production database,
>         and 'none' on everything else?

Too complicated.  Let's get it implemented and in the field and see what
people ask for.

> ?3 - Detecting Shutdown Consistent System Recovery
> 
>     ?3.1 - Problem:
> 
>      How to notice that we need to do log-replay for a system backup, when the
>      restored control file points to a shutdown checkpoint record that is
>      before the most recent checkpoint record in the log, and may point into
>      an archived file.
> 
>     ?3.2 - Proposal:
> 
>      At startup, after reading the ControlFile, scan the log directory to
>      get the list of active log files, and find the lowest logId and
>      logSeg of the files. Ensure that the files cover a contiguous range
>      of LSN's.
> 
>      There are three cases:
> 
>       1) ControlFile points to the last valid checkpoint (either
>          checkPoint or prevCheckPoint, but one of them is the greatest
>          valid checkpoint record in the log stream).
> 
>       2) ControlFile points to a valid checkpoint record in an active
>          log file, but there are more valid checkpoint records beyond
>          it.
> 
>       3) ControlFile points to a checkpoint record that should be in the
>          archive logs, and is presumably valid.
> 
>      Case 1 is what we handle now.
> 
>      Cases 2 and 3 would result from restoring an entire system from
>      backup in preparation to do a play-forward recovery.
> 
>      We need to:
> 
>         Detect cases 2 and 3.
> 
>         Alert the administrator and abort startup.
>         [Question: Is this always the desired behavior?  We can
>          handle case 2 without intervention. ]
> 
>         Let the administrator start a standalone backend, and
>         perform a play-forward recovery for the system.
> 
>     ?3.3 - Status:
> 
>        In progress.

Sorry, I was confused by this.

> ?4 - Interactive Play-Forward Recovery for an Entire System
> 
>     Play-Forward File Recovery from a backup file must be interactive,
>     because not all log files that we need are necessarily in the 
>     archive directory. It may be possible that not all the archive files
>     we need can even fit on disk at one time.
> 
>     The system needs to be able to prompt the system administrator to feed
>     it more log files.
> 
>     TODO: More here

Yes, we can have someone working on the GUI once the command-line
interface is defined.

> ?5 - Individual file consistent recovery
> 
>   ?5.1 - Problem:
> 
>     If a file detects corruption, and we restore it from backup, how do 
>     we know what archived files we need for recovery?
> 
>     Should file corruption (partial write, bad disk block, etc.) outside 
>     the system catalog cause us to abort the system, or should we just 
>     take the relation or database off-line?

Offline is often best so they can get in there and recover if needed. 
We usually allow them in with a special flag or utility like
pg_resetxlog.


>     Given a backup file, how do we determine the point in the log 
>     where we should start recovery for the file? What is the highest LSN
>     we can use that will fully recover the file?

That is tricky.  We have discussed it and your backup has to deal with
some pretty strange things that can happen while 'tar' is traversing the
directory.


>   ?5.2 - Proposal: 
>     
>     Put a file header on each file, and update that header to the last
>     checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
>     at least once every dbsize/'file_lsn_log_slack' megabytes of log
>     written, where dbsize is the estimated size of the database. Have
>     these values be settable from the config file. These updates would be
>     distributed throughout the hour, or interspersed between regular
>     amounts of log generation.
> 
>     If we have a database backup program or command, it can update the
>     header on the file before backup to the greatest value it can assure
>     to be safe.

Not sure.

>   ?5.3 - Status:
> 
>     Waiting for input from pgsql-hackers.
> 
>     Questions:
> 
>         There are alternate methods than using a file header to get a
>         known-good LSN lower bound for the starting point to recover a backup
>         file. Is this the best way?

Not sure.

I am sure others will chime in with more information.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

05 July 2002, 02:01:13

J. R. Nield wrote:
> ?5 - Individual file consistent recovery
> 
>   ?5.1 - Problem:
> 
>     If a file detects corruption, and we restore it from backup, how do 
>     we know what archived files we need for recovery?
> 
>     Should file corruption (partial write, bad disk block, etc.) outside 
>     the system catalog cause us to abort the system, or should we just 
>     take the relation or database off-line?
> 
>     Given a backup file, how do we determine the point in the log 
>     where we should start recovery for the file? What is the highest LSN
>     we can use that will fully recover the file?
> 
>   ?5.2 - Proposal: 
>     
>     Put a file header on each file, and update that header to the last
>     checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
>     at least once every dbsize/'file_lsn_log_slack' megabytes of log
>     written, where dbsize is the estimated size of the database. Have
>     these values be settable from the config file. These updates would be
>     distributed throughout the hour, or interspersed between regular
>     amounts of log generation.
> 
>     If we have a database backup program or command, it can update the
>     header on the file before backup to the greatest value it can assure
>     to be safe.

I know there was discussion about this.  The issue was when you are
doing the backup, how do you handle changes in the file that happen
during the backup?    I think there was some idea of remembering the WAL
pointer location at the start of the tar backup, and somehow playing
forward from that point.  Now, the trick is to know how much of the WAL
is duplicated in the backup, and how much needs to be applied to roll
forward.

Because we have a non-overwriting storage manager, I think one idea was
to just replay everything, knowing that some of it may be unnecessary. 
However, VACUUM FULL would complicate that because it does overwrite
tuples and you may get into trouble playing all of that back.

I am sure someone has analyzed this better than me.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

05 July 2002, 06:03:25

On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote:
> 
> We have needed
> point-in-time recovery for a long time, 

Most thanks should go to vadim (and whoever else worked on this), since
his WAL code already does most of the work. The key thing is auditing
the backend to look for every case where we assume some action is not
visible until after commit, and therefore don't log its effects. Those
are the main cases that must be changed.

> ---------------------------------------------------------------------------
> 
> J. R. Nield wrote:
> > Hello:
> > 
> > I've got the logging system to the point where I can take a shutdown
> > consistent copy of a system, and play forward through multiple
> > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE

But notably not for the btree indexes! It looked like they were working,
because the files were there, and all indexes created before the backup
would work under insert/delete (including sys catalog indexes). This is
because btree insert/delete is logged, just not during build. So I
missed that one case.

You will end-up with up-to-date table data though, so it is something.

Adding logging support to btbuild is the next step, and I don't think it
should be too hard. I am working this now.

It is also a major advantage that most everything in the system gets
stored in the catalog tables, and so is logged already.

> Uh, we do log pre-page writes to WAL to recover from partial page
> writes to disk.  Is there something more we need here?
> 
> As for bad block detection, we have thought about adding a CRC to each
> page header, or at least making it optional.  WAL already has a CRC.
>

Yes this should be last to do, because it is not necessary for PITR,
only for performance (the option not to write pre-images without fear of
data loss). 
> Yes, there are a few places where we actually create a file, and if the
> server crashes, the file remains out there forever.  We need to track that
> better.  

OK, there is a bigger problem then just tracking the file though. We
sometimes do stuff to that file that we don't log. We assume that if we
commit, the file must be OK and will not need replay because the
transaction would not have committed if the file was not in a commitable
state. If we abort, the system never sees the file, so in a sense we
undo everything we did to the file. It is a kind of poor-man's rollback
for certain operations, like btbuild, create table, etc. But it means
that we cannot recover the file from the log, even after a commit.

> 
> > 
> >     ?1.1.1 - CREATE DATABASE is also unlogged
> > 
> >       This will cause the same replay problems as above.
> 
> Yep.  Again, seems a master cleanup on startup is needed.

The cleanup is not the problem, only a nuisance. Creating the files
during replay is the problem. I must recreate CREATE DATABASE from the
log exactly as it was done originally. I think just logging the
parameters to the command function should be sufficient, but I need to
think more about it.

> 
> >     b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
> >     may be used in a future command, and a replay-recovered database may
> >     end-up with different data than the original.
> 
> We number based on oids.  You mean oid wraparound could cause the file
> to be used again?

That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then,
right before smgrtruncate is called, I do an XLogInsert saying "Redo a
TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails
and we do an elog(ERROR)

Now the user decides that since TRUNCATE TABLE didn't work, he might as
well use the table, so he inserts some records into it, generating log
entries.

When I replay this log sequence later, what happens if the TRUNCATE
succeeds instead of failing?

I admit that there are other ways of handling it than to PANIC if the
truncate fails. All the ones I can come up with seem to amount to some
kind of ad-hoc UNDO log.

> 
> >     WAL must be flushed before truncate as well.
> > 
> >     WAL does not need to be flushed before create, if we don't mind 
> >     leaking files sometimes.
> 
> Cleanup?

Yes, we could garbage-collect leaked files. XLogFlush is not that
expensive though, so I don't have an opinion on this yet.

> 
> >     c) Redo code should treat writes to non-existent files as an error.
> >     Changes affect heap & nbtree AM's. [Check others]
> 
> Yep, once you log create/drop, if something doesn't match, it is an
> error, while before, we could ignore it.
> 
> >     d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
> >       a database should mark all the rtree indices as corrupt.
> >       [ actually we should do that now, are we? ]
> 
> Known problem.  Not sure what is being done.  TODO has:
> 
>     * Add WAL index reliability improvement to non-btree indexes
> 
> so it is a known problem, and we aren't doing anything about it.  What
> more can I say?  ;-)

Once the other stuff works reliably, I will turn to rtree logging, which
I have looked at somewhat, although I could really use a copy of the
paper it is supposed to be based on. I have not figured out GiST enough
to work on it yet.

> 
> >     e) CREATE DATABASE must be logged properly, not use system(cp...)
> 
> OK, should be interesting.
> 
> >   ?1.3 - Status:
> > 
> >     All logged SMGR operations are now in a START_CRIT_SECTION()/
> >     END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs
> >     operations.
> > 
> >     Code has been added to smgr and xact modules to log:
> >         create (no XLogFlush)
> >         truncate (XLogFlush)
> >         pending deletes on commit record
> >         files to delete on abort record
> > 
> >     Code added to md.c to support redo ops
> > 
> >     Code added to smgr for RMGR redo/desc callbacks
> > 
> >     Code added to xact RMGR callbacks for redo/desc
> > 
> >     Database will do infinite shutdown consistent system recovery from the
> >     online logs, if you manually munge the control file to set state ==
> >     DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
> 
> Wow, how did you get so far?

Because it was almost there to start with :-)

Besides, it sounded better before I realized there was still a remaining
problem with btree logging to fix. 

> >      In the case of LSNLast, we check to see if pd_lsn == the lsn in the
> >      last 64 bits of the page. If not, we assume the page is corrupt from
> >      a partial write (although it could be something else).
> 
> LSN?

Log Sequence Number (XLogRecPtr)

> 
> >      IMPORTANT ASSUMPTION:
> >         The OS/disk device will never write both the first part and
> >         last part of a block without writing the middle as well.
> >         This might be wrong in some cases, but at least it's fast.
> > 
> >     ?2.2.4 - GUC Variables
> > 
> >      The user should be able to configure what method is used:
> > 
> >        block_checking_write_method  = [ checksum | torn_page_flag | none ]
> > 
> >           Which method should be used for blocks we write?
> 
> Do we want torn page flag?  Seems like a pain to get that on every 512
> byte section of the 8k page.

Ok, this section (2.2) was badly written and hard to understand. What I
am proposing is that we put a copy of the log sequence number, which is
at the head of the page, into the 8 byte field that we are creating at
the end of the page, in place of the CRC. The log sequence number
increases every time the page is written (it already does this). I have
called this method 'LSNLast' internally, and the user would call it the
'torn_page_flag' method.

So when we read the page, we compare the Log Sequence Number at the
beginning and end of the page, and if they are different we assume a
torn page.

This version is weaker than the MS one we were talking about, because it
is not on every 512 byte section of the page, only the beginning and the
end. I'm simply looking for a fast alternative to CRC64, that doesn't
require massive reorganization of the page layout code.

> 
> >   ?2.3 - Status:
> > 
> >     Waiting for input from pgsql-hackers.
> > 
> >     Questions:
> > 
> >         Should we allow the user to have more detailed control over
> >         which parts of a database use block checking?
> 
> I don't think that is needed;  installation-wide settings are fine.
> 
> >         For example: use 'checksum' on all system catalogs in all databases, 
> >         'torn_page_flag' on the non-catalog parts of the production database,
> >         and 'none' on everything else?
> 
> Too complicated.  Let's get it implemented and in the field and see what
> people ask for.

Ok. I agree.

> 
> > ?3 - Detecting Shutdown Consistent System Recovery

> > 
> >     ?3.3 - Status:
> > 
> >        In progress.
> 
> Sorry, I was confused by this.

Let me re-write it, and I'll post it in the next version. The section
dealt with what to do when you have a valid restored controlfile from a
backup system, which is in the DB_SHUTDOWNED state, and that points to a
valid shutdown/checkpoint record in the log; only the checkpoint record
happens not to be the last one in the log. This is a situation that
could never happen now, but would in PITR.

> 
> > ?4 - Interactive Play-Forward Recovery for an Entire System
> > 
> >     Play-Forward File Recovery from a backup file must be interactive,
> >     because not all log files that we need are necessarily in the 
> >     archive directory. It may be possible that not all the archive files
> >     we need can even fit on disk at one time.
> > 
> >     The system needs to be able to prompt the system administrator to feed
> >     it more log files.
> > 
> >     TODO: More here
> 
> Yes, we can have someone working on the GUI once the command-line
> interface is defined.

Yes, and the system must not allow any concurrent activity during
recovery either. So it looks like a standalone backend operation.

> 
> > ?5 - Individual file consistent recovery
> > 
> >   ?5.1 - Problem:
> > 
> >     If a file detects corruption, and we restore it from backup, how do 
> >     we know what archived files we need for recovery?
> > 
> >     Should file corruption (partial write, bad disk block, etc.) outside 
> >     the system catalog cause us to abort the system, or should we just 
> >     take the relation or database off-line?
> 
> Offline is often best so they can get in there and recover if needed. 
> We usually allow them in with a special flag or utility like
> pg_resetxlog.
> 
> 
> >     Given a backup file, how do we determine the point in the log 
> >     where we should start recovery for the file? What is the highest LSN
> >     we can use that will fully recover the file?
> 
> That is tricky.  We have discussed it and your backup has to deal with
> some pretty strange things that can happen while 'tar' is traversing the
> directory.

Even if we shutdown before we copy the file, we don't want a file that
hasn't been written to in 5 weeks before it was backed up to require
five weeks of old log files to recover. So we need to track that
information somehow, because right now if we scanned the blocks in the
file looking for at the page LSN's, we greatest LSN we would see might
be much older than where it would be safe to recover from. That is the
biggest problem, I think.

> 
> 
> >   ?5.2 - Proposal: 
> >     
> >     Put a file header on each file, and update that header to the last
> >     checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
> >     at least once every dbsize/'file_lsn_log_slack' megabytes of log
> >     written, where dbsize is the estimated size of the database. Have
> >     these values be settable from the config file. These updates would be
> >     distributed throughout the hour, or interspersed between regular
> >     amounts of log generation.
> > 
> >     If we have a database backup program or command, it can update the
> >     header on the file before backup to the greatest value it can assure
> >     to be safe.
> 
> Not sure.
> 
> >   ?5.3 - Status:
> > 
> >     Waiting for input from pgsql-hackers.
> > 
> >     Questions:
> > 
> >         There are alternate methods than using a file header to get a
> >         known-good LSN lower bound for the starting point to recover a backup
> >         file. Is this the best way?
> 
> Not sure.
> 
> I am sure others will chime in with more information.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 
-- 
J. R. Nield
jrnield@usol.com

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"Zeugswetter Andreas SB SD"

Date:

05 July 2002, 13:21:58

> Let me re-write it, and I'll post it in the next version. The section
> dealt with what to do when you have a valid restored controlfile from a
> backup system, which is in the DB_SHUTDOWNED state, and that points to a
> valid shutdown/checkpoint record in the log; only the checkpoint record
> happens not to be the last one in the log. This is a situation that
> could never happen now, but would in PITR.

But it would need to be restore's responsibility to set the flag to
DB_IN_PRODUCTION, no?

> Even if we shutdown before we copy the file, we don't want a file that
> hasn't been written to in 5 weeks before it was backed up to require
> five weeks of old log files to recover. So we need to track that
> information somehow, because right now if we scanned the blocks in the
> file looking for at the page LSN's, we greatest LSN we would see might
> be much older than where it would be safe to recover from. That is the
> biggest problem, I think.

Well, if you skip a validity test it could be restore's responsibility
to know which checkpoint was last before the file backup was taken.
(When doing a backup you would need to include the last checkpoint info
== pg_control at start of backup)

Andreas

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

05 July 2002, 16:07:51

J. R. Nield wrote:
> On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote:
> > 
> > We have needed
> > point-in-time recovery for a long time, 
> 
> Most thanks should go to vadim (and whoever else worked on this), since
> his WAL code already does most of the work. The key thing is auditing
> the backend to look for every case where we assume some action is not
> visible until after commit, and therefore don't log its effects. Those
> are the main cases that must be changed.

Yep.  Glad you can focus on that.

> > ---------------------------------------------------------------------------
> > 
> > J. R. Nield wrote:
> > > Hello:
> > > 
> > > I've got the logging system to the point where I can take a shutdown
> > > consistent copy of a system, and play forward through multiple
> > > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
> 
> But notably not for the btree indexes! It looked like they were working,
> because the files were there, and all indexes created before the backup
> would work under insert/delete (including sys catalog indexes). This is
> because btree insert/delete is logged, just not during build. So I
> missed that one case.
> 
> You will end-up with up-to-date table data though, so it is something.
> 
> Adding logging support to btbuild is the next step, and I don't think it
> should be too hard. I am working this now.

Great.

> It is also a major advantage that most everything in the system gets
> stored in the catalog tables, and so is logged already.
> 
> 
> > Uh, we do log pre-page writes to WAL to recover from partial page
> > writes to disk.  Is there something more we need here?
> > 
> > As for bad block detection, we have thought about adding a CRC to each
> > page header, or at least making it optional.  WAL already has a CRC.
> >
> 
> Yes this should be last to do, because it is not necessary for PITR,
> only for performance (the option not to write pre-images without fear of
> data loss). 

Yep.

> > Yes, there are a few places where we actually create a file, and if the
> > server crashes, the file remains out there forever.  We need to track that
> > better.  
> 
> OK, there is a bigger problem then just tracking the file though. We
> sometimes do stuff to that file that we don't log. We assume that if we
> commit, the file must be OK and will not need replay because the
> transaction would not have committed if the file was not in a commitable
> state. If we abort, the system never sees the file, so in a sense we
> undo everything we did to the file. It is a kind of poor-man's rollback
> for certain operations, like btbuild, create table, etc. But it means
> that we cannot recover the file from the log, even after a commit.

Yep.

> > 
> > > 
> > >     ?1.1.1 - CREATE DATABASE is also unlogged
> > > 
> > >       This will cause the same replay problems as above.
> > 
> > Yep.  Again, seems a master cleanup on startup is needed.
> 
> The cleanup is not the problem, only a nuisance. Creating the files
> during replay is the problem. I must recreate CREATE DATABASE from the
> log exactly as it was done originally. I think just logging the
> parameters to the command function should be sufficient, but I need to
> think more about it.

OK, makes sense.  Nice when you can bundle a complex action into the
logging of one command and its parameters.

> > 
> > >     b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
> > >     may be used in a future command, and a replay-recovered database may
> > >     end-up with different data than the original.
> > 
> > We number based on oids.  You mean oid wraparound could cause the file
> > to be used again?
> 
> That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then,
> right before smgrtruncate is called, I do an XLogInsert saying "Redo a
> TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails
> and we do an elog(ERROR)
> 
> Now the user decides that since TRUNCATE TABLE didn't work, he might as
> well use the table, so he inserts some records into it, generating log
> entries.
> 
> When I replay this log sequence later, what happens if the TRUNCATE
> succeeds instead of failing?

You mean the user is now accessing a partially truncated table? That's
just too weird.  I don't see how the WAL would know how far truncation
had gone.  I see why you would need the panic and it seems acceptable.

> I admit that there are other ways of handling it than to PANIC if the
> truncate fails. All the ones I can come up with seem to amount to some
> kind of ad-hoc UNDO log.

Yea, truncate failure seems so rare/impossible to happen, we can do a
panic and see if it ever happens to anyone.  I bet it will not.  Those
are usually cases of an OS crash, so it is the same as a panic.

> > >     WAL must be flushed before truncate as well.
> > > 
> > >     WAL does not need to be flushed before create, if we don't mind 
> > >     leaking files sometimes.
> > 
> > Cleanup?
> 
> Yes, we could garbage-collect leaked files. XLogFlush is not that
> expensive though, so I don't have an opinion on this yet.

Right now, if we do CREATE TABLE, and the backend crashes, I think it
leaves a nonreferenced file around.  Not something you need to worry
about for replication, I guess.  We can address it later.

> > >     c) Redo code should treat writes to non-existent files as an error.
> > >     Changes affect heap & nbtree AM's. [Check others]
> > 
> > Yep, once you log create/drop, if something doesn't match, it is an
> > error, while before, we could ignore it.
> > 
> > >     d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
> > >       a database should mark all the rtree indices as corrupt.
> > >       [ actually we should do that now, are we? ]
> > 
> > Known problem.  Not sure what is being done.  TODO has:
> > 
> >     * Add WAL index reliability improvement to non-btree indexes
> > 
> > so it is a known problem, and we aren't doing anything about it.  What
> > more can I say?  ;-)
> 
> Once the other stuff works reliably, I will turn to rtree logging, which
> I have looked at somewhat, although I could really use a copy of the
> paper it is supposed to be based on. I have not figured out GiST enough
> to work on it yet.

There has been talk of retiring rtree and using the GIST version of
rtree.  I thought it had some advantages/disadvantages. I don't remember
for sure.

> > >     Database will do infinite shutdown consistent system recovery from the
> > >     online logs, if you manually munge the control file to set state ==
> > >     DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
> > 
> > Wow, how did you get so far?
> 
> Because it was almost there to start with :-)
> 
> Besides, it sounded better before I realized there was still a remaining
> problem with btree logging to fix. 

Our major problem is that we have a very few people who like to work
at this level in the code.  Glad you are around.

> > >      In the case of LSNLast, we check to see if pd_lsn == the lsn in the
> > >      last 64 bits of the page. If not, we assume the page is corrupt from
> > >      a partial write (although it could be something else).
> > 
> > LSN?
> 
> Log Sequence Number (XLogRecPtr)

Yep, I remembered later.

> > >      IMPORTANT ASSUMPTION:
> > >         The OS/disk device will never write both the first part and
> > >         last part of a block without writing the middle as well.
> > >         This might be wrong in some cases, but at least it's fast.
> > > 
> > >     ?2.2.4 - GUC Variables
> > > 
> > >      The user should be able to configure what method is used:
> > > 
> > >        block_checking_write_method  = [ checksum | torn_page_flag | none ]
> > > 
> > >           Which method should be used for blocks we write?
> > 
> > Do we want torn page flag?  Seems like a pain to get that on every 512
> > byte section of the 8k page.
> 
> Ok, this section (2.2) was badly written and hard to understand. What I
> am proposing is that we put a copy of the log sequence number, which is
> at the head of the page, into the 8 byte field that we are creating at
> the end of the page, in place of the CRC. The log sequence number
> increases every time the page is written (it already does this). I have
> called this method 'LSNLast' internally, and the user would call it the
> 'torn_page_flag' method.
> 
> So when we read the page, we compare the Log Sequence Number at the
> beginning and end of the page, and if they are different we assume a
> torn page.

Excellent.

> This version is weaker than the MS one we were talking about, because it
> is not on every 512 byte section of the page, only the beginning and the
> end. I'm simply looking for a fast alternative to CRC64, that doesn't
> require massive reorganization of the page layout code.

Great idea, and cheap.

> > >     The system needs to be able to prompt the system administrator to feed
> > >     it more log files.
> > > 
> > >     TODO: More here
> > 
> > Yes, we can have someone working on the GUI once the command-line
> > interface is defined.
> 
> Yes, and the system must not allow any concurrent activity during
> recovery either. So it looks like a standalone backend operation.

OK, we can address this capability.

> > That is tricky.  We have discussed it and your backup has to deal with
> > some pretty strange things that can happen while 'tar' is traversing the
> > directory.

OK, first, are you thinking of having the nightly backup operate at the
file system level, or accessing the pages through the PostgreSQL shared
buffers?

> Even if we shutdown before we copy the file, we don't want a file that

Oh, so you are thinking of some kind of tar while the db is shutdown,
and using that as the backup?

> hasn't been written to in 5 weeks before it was backed up to require
> five weeks of old log files to recover. So we need to track that
> information somehow, because right now if we scanned the blocks in the
> file looking for at the page LSN's, we greatest LSN we would see might
> be much older than where it would be safe to recover from. That is the
> biggest problem, I think.

You are saying, "How do we know what WAL records go with that backup
snapshot of the file?"  OK, lets assume we are shutdown.  You can grab
the WAL log info from pg_control using contrib/pg_controldata and that
tells you what WAL logs to roll forward when you need to PIT recover
that backup later.  If you store that info in the first file you backup,
you can have that WAL pointer available for later recovery in case you
are restoring from that backup.  Is that the issue?

What seems more complicated is doing the backup while the database is
active, and this may be a requirement for a final PITR solution.  Some
think we can grab the WAL pointer at 'tar' start and replay that on the
backup even if the file changes during backup.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

05 July 2002, 22:09:17

Bruce Momjian wrote:
> You are saying, "How do we know what WAL records go with that backup
> snapshot of the file?"  OK, lets assume we are shutdown.  You can grab
> the WAL log info from pg_control using contrib/pg_controldata and that
> tells you what WAL logs to roll forward when you need to PIT recover
> that backup later.  If you store that info in the first file you backup,
> you can have that WAL pointer available for later recovery in case you
> are restoring from that backup.  Is that the issue?
> 
> What seems more complicated is doing the backup while the database is
> active, and this may be a requirement for a final PITR solution.  Some
> think we can grab the WAL pointer at 'tar' start and replay that on the
> backup even if the file changes during backup.

OK, I think I understand live backups now using tar and PITR.  Someone
explained this to me months ago but now I understand it.

First, a key issue is that PostgreSQL doesn't fiddle with individual
items on disk.  It reads an 8k block, modifies it, (writes it to WAL if
it hasn't been written to that WAL segment before), and writes it to
disk.  That is key.  (Are there cases where don't do this, like
pg_controldata?)

OK, so you do a tar backup of a file.  While you are doing the tar,
certain 8k blocks are being modified in the file.  There is no way to
know what blocks are modified as you are doing the tar, and in fact you
could read partial page writes during the tar.

One solution would be to read the file using the PostgreSQL page buffer,
but even then, getting a stable snapshot of the file would be difficult.
Now, we could lock the table and prevent writes while it is being backed
up, but there is a better way.

We already have pre-change page images in WAL.  When we do the backup,
any page that was modified while we were backing up is in the WAL.  On
restore, we can recover whatever tar saw of the file, knowing that the
WAL page images will recover any page changes made during the tar.

Now, you mentioned we may not want pre-change page images in WAL
because, with PITR, we can more easily recover from the WAL rather than
having this performance hit for many page writes.

What I suggest is a way for the backup tar to turn on pre-change page
images while the tar is happening, and turn it off after the tar is
done.

We already have this TODO item:
* Turn off after-change writes if fsync is disabled (?)

No sense in doing after-change WAL writes without fsync.  We could
extend this so those after-changes writes could be turned on an off,
allowing fill tar backups and PITR recovery.  In fact, for people with
reliable hardware, we should already be giving them the option of
turning off pre-change writes.  We don't have a way of detecting partial
page writes, but then again, we can't detect failures with fsync off
anyway so it seems to be the same vulnerability.  I guess that's why we
were going to wrap the effect into the same variable, but for PITR, can
see wanting fsync always on and the ability to turn pre-change writes on
and off.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Patrick Macdonald

Date:

06 July 2002, 23:49:46

J.R.,

Nice first draft and a good read. Was going to comment 
in-line but thought this method would be easier to follow. 
The comments/suggestions below assume that PIT recovery is
being performed at the cluster level with a data backup 
image created by a tar-like utility. 

As noted, one of the main problems is knowing where to begin
in the log.  This can be handled by having backup processing 
update the control file with the first lsn and log file 
required.  At the time of the backup, this information is or 
can be made available.  The control file can be the last file
added to the tar and can contain information spanning the entire
backup process.

For data consistency, since the backup is being performed on 
an active cluster, we have to make sure to mark the end of the 
backup.  On restore, to make the cluster consistent, you have 
to force the user to perform forward recovery passed the point
of the backup completion marker in the (archived) log.  This 
can be handled using a backup end log record.  The backup end
log record would have to contain an identifier unique to this 
backup.  If a user requests to stop PIT recovery before this 
log record is encountered, consistency is not guaranteed. 
PIT should either disallow the action or warn of possible / 
impending doom.

The necessary logging for rtee (and others) insertions/deletions
can be added to the base code.  Not much of a worry but I would
expect to encounter other missing log items during testing.

The idea of using the last lsn on the page to detect a partial
write is used by other dbms systems.  You already have that 
information available so there is no overhead in computing it. 
Nothing wrong with CRC though.

As for the DB_SHUTDOWNED state, this could be handled by having
the backup processing update the control file field to 
DB_PIT_REQUIRED (or some such identifier).  After a restore,
users would be blocked from connecting to the cluster's databases 
until a forward recovery passed the backup end log record has
completed successfully.  

At the end of normal crash recovery, the user has to go digging
to identify in-flight transactions still in the system and abort
them manually.  It would be nice if PIT recovery automatically
aborted all in-flight transactions at the end. 

As PostgreSQL heads towards forward recovery functionality, it
may be wise to add headers to the log files.  As the logs from
any cluster are identically named, the header would allow unique
identification of the file and contents (cluster name, unique 
log id, id of the prior log file for chaining purposes, lsn 
ranges, etc). Most helpful.

Just a few notes from the administrative side. PIT recovery
should probably offer the user the following actions:

. forward recover to end of logs [and stop] Process log files located in the current directory until you read through
thelast one.  Allow the user the option to stop or not, just in case the logs are archived.  Send back the timestamp of
thelast encountered commit log record and the series of log files scanned.   
 

. forward recover to PIT [and stop] Similar to that described above but use the commit timestamps to gauge PIT
progress.  
 
. forward recover query Send back the log series covered and the last commit timestamp encountered. 

. forward recover stop Stop the current forward recovery session. Undo all in-flight transactions and bring the
databasesdown in a consistent state.  No other external user actions should be required.
 

Looking forward to reading draft 2.

Cheers,
Patrick
--
Patrick Macdonald 
Red Hat Canada 

"J. R. Nield" wrote:
> 
> Hello:
> 
> I've got the logging system to the point where I can take a shutdown
> consistent copy of a system, and play forward through multiple
> checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
> properly, and things are moving forward well. Recovery to an arbitrary
> point-in-time should be just as easy, but will need some administrative
> interface for it.
> 
> At this point, some input would be useful on how I should handle things.
> 
> The most important questions that need answering are in sections 2 & 5,
> since they impact the most other parts of the system. They will also
> require good documentation for sysadmins.
> 
> Issues Outstanding for Point In Time Recovery (PITR)
> 
>     $Date: 2002/07/04 14:23:37 $
> 
>     $Revision: 1.4 $
> 
>     J.R. Nield
> 
>     (Enc: ISO 8859-15 Latin-9)
> 
> §0 - Introduction
> 
>     This file is where I'm keeping track of all the issues I run into while
>     trying to get PITR to work properly. Hopefully it will evolve into a
>     description of how PITR actually works once it is implemented.
> 
>     I will also try to add feedback as it comes in.
> 
>     The big items so-far are:
>         §1 - Logging Relation file creation, truncation, and removal
>             This is mostly done. Can do infinte play-forward from
>             online logs.
>         §2 - Partial-Write and Bad Block detection
>             Need input before starting. Migration issues.
>         §3 - Detecting Shutdown Consistent System Recovery
>             Mostly done.
>         §4 - Interactive Play-Forward Recovery for an Entire System
>             Need input before starting.
>         §5 - Individual file consistent recovery
>             Need input. Semi-Major changes required.
> 
> §1 - Logging Relation file creation, truncation, and removal
> 
>   §1.1 - Problem:
> 
>     Without file creation in the log, we can't replay committed
>     transactions that create relations.
> 
>     The current code assumes that any transaction reaching commit has already
>     ensured it's files exist, and that those files will never be removed. This
>     is true now, but not for log-replay from an old backup database system.
> 
>     The current XLOG code silently ignores block-write requests for
>     non-existent files, and assumes that the transaction generating those
>     requests must have aborted.
> 
>     Right now a crash during TRUNCATE TABLE will leave the table in an
>     inconsistent state (partially truncated). This would not work when doing
>     replay from before the last checkpoint.
> 
>     §1.1.1 - CREATE DATABASE is also unlogged
> 
>       This will cause the same replay problems as above.
> 
>   §1.2 - Proposal:
> 
>     a) Augment the SMGR code to log relation file operations, and to handle
>     redo requests properly. This is simple in the case of create. Drop must be
>     logged only IN the commit record. For truncate see (b).
> 
>     The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink',
>     and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead
>     of a Relation.
> 
>     Transactions that abort through system failure (ie.  unlogged aborts)
>     will simply continue to leak files.
> 
>     b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
>     may be used in a future command, and a replay-recovered database may
>     end-up with different data than the original.
> 
>     WAL must be flushed before truncate as well.
> 
>     WAL does not need to be flushed before create, if we don't mind
>     leaking files sometimes.
> 
>     c) Redo code should treat writes to non-existent files as an error.
>     Changes affect heap & nbtree AM's. [Check others]
> 
>     d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
>       a database should mark all the rtree indices as corrupt.
>       [ actually we should do that now, are we? ]
> 
>     e) CREATE DATABASE must be logged properly, not use system(cp...)
> 
>   §1.3 - Status:
> 
>     All logged SMGR operations are now in a START_CRIT_SECTION()/
>     END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs
>     operations.
> 
>     Code has been added to smgr and xact modules to log:
>         create (no XLogFlush)
>         truncate (XLogFlush)
>         pending deletes on commit record
>         files to delete on abort record
> 
>     Code added to md.c to support redo ops
> 
>     Code added to smgr for RMGR redo/desc callbacks
> 
>     Code added to xact RMGR callbacks for redo/desc
> 
>     Database will do infinite shutdown consistent system recovery from the
>     online logs, if you manually munge the control file to set state ==
>     DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
> 
>     Still need to do:
>         Item (c), recovery cleanup in all AM's
>         Item (d), logging in other index AM's
>         Item (e), CREATE DATABASE stuff
> 
> §2 - Partial-Write and Bad Block detection
> 
>   §2.1 - Problem:
> 
>     In order to protect against partial writes without logging pages
>     twice, we need to detect partial pages in system files and report them
>     to the system administrator. We also might want to be able to detect
>     damaged pages from other causes, like memory corruption, OS errors,
>     etc. or in the case where the disk doesn't report bad blocks, but
>     returns bad data.
> 
>     We should also decide what should happen when a file is marked as
>     containing corrupt pages, and requires log-archive recovery from a
>     backup.
> 
>   §2.2 - Proposal:
> 
>     Add a 1 byte 'pd_flags' field to PageHeaderData, with the following
>     flag definitions:
> 
>         PD_BLOCK_CHECKING           (1)
>         PD_BC_METHOD_BIT            (1<<1)
> 
>         PageHasBlockChecking(page)     ((page)->pd_flags & PD_BLOCK_CHECKING)
>         PageBCMethodIsCRC64(page)      ((page)->pd_flags & PD_BC_METHOD_BIT)
>         PageBCMethodIsLSNLast(page)    (!PageBCMethodIsCRC64(page))
> 
>     The last 64 bits of a page are reserved for use by the block checking
>     code.
> 
>     [ Is it worth the trouble to allow the last 8 bytes of a
>       page to contain data when block checking is turned off for a Page?
> 
>       This proposal does not allow that. ]
> 
>     If the block checking method is CRC64, then that field will contain
>     the CRC64 of the block computed at write time.
> 
>     If the block checking method is LSNLast, then the field contains a
>     duplicate of the pd_lsn field.
> 
>     §2.2.1 - Changes to Page handling routines
> 
>      All the page handling routines need to understand that
>      pd_special == (pd_special - (specialSize + 8))
> 
>      Change header comment in bufpage.h to reflect this.
> 
>     §2.2.2 - When Reading a Page
> 
>      Block corruption is detected on read in the obvious way with CRC64.
> 
>      In the case of LSNLast, we check to see if pd_lsn == the lsn in the
>      last 64 bits of the page. If not, we assume the page is corrupt from
>      a partial write (although it could be something else).
> 
>      IMPORTANT ASSUMPTION:
>         The OS/disk device will never write both the first part and
>         last part of a block without writing the middle as well.
>         This might be wrong in some cases, but at least it's fast.
> 
>     §2.2.4 - GUC Variables
> 
>      The user should be able to configure what method is used:
> 
>        block_checking_write_method  = [ checksum | torn_page_flag | none ]
> 
>           Which method should be used for blocks we write?
> 
>        check_blocks_on_read      = [ true | false ]
> 
>           When true, verify that the blocks we read are not corrupt, using
>           whatever method is in the block header.
> 
>           When false, ignore the block checking information.
> 
>   §2.3 - Status:
> 
>     Waiting for input from pgsql-hackers.
> 
>     Questions:
> 
>         Should we allow the user to have more detailed control over
>         which parts of a database use block checking?
> 
>         For example: use 'checksum' on all system catalogs in all databases,
>         'torn_page_flag' on the non-catalog parts of the production database,
>         and 'none' on everything else?
> 
> §3 - Detecting Shutdown Consistent System Recovery
> 
>     §3.1 - Problem:
> 
>      How to notice that we need to do log-replay for a system backup, when the
>      restored control file points to a shutdown checkpoint record that is
>      before the most recent checkpoint record in the log, and may point into
>      an archived file.
> 
>     §3.2 - Proposal:
> 
>      At startup, after reading the ControlFile, scan the log directory to
>      get the list of active log files, and find the lowest logId and
>      logSeg of the files. Ensure that the files cover a contiguous range
>      of LSN's.
> 
>      There are three cases:
> 
>       1) ControlFile points to the last valid checkpoint (either
>          checkPoint or prevCheckPoint, but one of them is the greatest
>          valid checkpoint record in the log stream).
> 
>       2) ControlFile points to a valid checkpoint record in an active
>          log file, but there are more valid checkpoint records beyond
>          it.
> 
>       3) ControlFile points to a checkpoint record that should be in the
>          archive logs, and is presumably valid.
> 
>      Case 1 is what we handle now.
> 
>      Cases 2 and 3 would result from restoring an entire system from
>      backup in preparation to do a play-forward recovery.
> 
>      We need to:
> 
>         Detect cases 2 and 3.
> 
>         Alert the administrator and abort startup.
>         [Question: Is this always the desired behavior?  We can
>          handle case 2 without intervention. ]
> 
>         Let the administrator start a standalone backend, and
>         perform a play-forward recovery for the system.
> 
>     §3.3 - Status:
> 
>        In progress.
> 
> §4 - Interactive Play-Forward Recovery for an Entire System
> 
>     Play-Forward File Recovery from a backup file must be interactive,
>     because not all log files that we need are necessarily in the
>     archive directory. It may be possible that not all the archive files
>     we need can even fit on disk at one time.
> 
>     The system needs to be able to prompt the system administrator to feed
>     it more log files.
> 
>     TODO: More here
> 
> §5 - Individual file consistent recovery
> 
>   §5.1 - Problem:
> 
>     If a file detects corruption, and we restore it from backup, how do
>     we know what archived files we need for recovery?
> 
>     Should file corruption (partial write, bad disk block, etc.) outside
>     the system catalog cause us to abort the system, or should we just
>     take the relation or database off-line?
> 
>     Given a backup file, how do we determine the point in the log
>     where we should start recovery for the file? What is the highest LSN
>     we can use that will fully recover the file?
> 
>   §5.2 - Proposal:
> 
>     Put a file header on each file, and update that header to the last
>     checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
>     at least once every dbsize/'file_lsn_log_slack' megabytes of log
>     written, where dbsize is the estimated size of the database. Have
>     these values be settable from the config file. These updates would be
>     distributed throughout the hour, or interspersed between regular
>     amounts of log generation.
> 
>     If we have a database backup program or command, it can update the
>     header on the file before backup to the greatest value it can assure
>     to be safe.
> 
>   §5.3 - Status:
> 
>     Waiting for input from pgsql-hackers.
> 
>     Questions:
> 
>         There are alternate methods than using a file header to get a
>         known-good LSN lower bound for the starting point to recover a backup
>         file. Is this the best way?
> 
> A) The Definitions
> 
>     This stuff is obtuse, but I need it here to keep track of what I'm
>     saying. Someday I should use it consistently in the rest of this
>     document.
> 
>     "system" or "database system":
> 
>         A collection of postgres "databases" in one $PGDATA directory,
>         managed by one postmaster instance at a time (and having one WAL
>         log, etc.)
> 
>         All the files composing such a system, as a group.
> 
>     "up to date" or "now" or "current" or "current LSN":
> 
>         The most recent durable LSN for the system.
> 
>     "block consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file, which may be written to during the process of
>         copying, but where each BLCKSZ size block is copied atomically.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, such that each is independently a "block
>         consistent copy"
> 
>     "file consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file that is not written to between the start and end
>         of the copy operation.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, such that each is independently a "file
>         consistent copy"
> 
>     "system consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file, where the entire system of which it is a member
>         is not written to during the copy.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, where the entire system of which they are
>         members was not written to between the start and end of the
>         copying of all the files, as a group.
> 
>     "shutdown consistent copy":
> 
>         When referring to a file:
> 
>         A copy of a file, where the entire system of which it is a member
>         had been cleanly shutdown before the start of and for the duration
>         of the copy.
> 
>         When referring to multiple files (in the same system):
> 
>         A copy of all the files, where the entire system of which they are
>         members had been cleanly shutdown before the start of and for the
>         duration of the copying of all the files, as a group.
> 
>     "consistent copy":
> 
>         A block, file, system, or shutdown consistent copy.
> 
>     "known-good LSN lower bound"
>     or "LSN lower bound"
>     or "LSN-LB":
> 
>         When referring to a group of blocks, a file, or a group of files:
> 
>         An LSN known to be old enough that no log entries before it are needed
>         to bring the blocks or files up-to-date.
> 
>     "known-good LSN greatest lower bound"
>     or "LSN greatest lower bound"
>     or "LSN-GLB":
> 
>         When referring to a group of blocks, a file, or a group of files:
> 
>         The greatest possible LSN that is a known-good LSN lower bound for
>         the group.
> 
>     "backup file":
> 
>         A consistent copy of a data file used by the system, for which
>         we have a known-good LSN lower bound.
> 
>     "optimal backup file":
> 
>         A backup file, for which we have the known-good LSN greatest lower
>         bound.
> 
>     "backup system":
> 
> 
>     "Play-Forward File Recovery" or "PFFR":
> 
>         The process of bringing an individual backup file up to date.
> 
> 
> --
> J. R. Nield
> jrnield@usol.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

07 July 2002, 19:47:24

Patrick Macdonald wrote:
> The idea of using the last lsn on the page to detect a partial
> write is used by other dbms systems.  You already have that 
> information available so there is no overhead in computing it. 
> Nothing wrong with CRC though.

Agreed.  Just thought I would point out that is not guaranteed.  Suppose
the 8k block is spread over 16 512 sectors in two cylinders.  The OS or
SCSI tagged queuing could wrote the second part of the page (sectors
9-16) before the first group (1-8).  If it writes 9-16, then writes 1-8
but fails in the middle of 1-8, the LSN will match at the front and back
of the page, but the page will be partially written.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"Zeugswetter Andreas SB SD"

Date:

08 July 2002, 08:15:21

> OK, so you do a tar backup of a file.  While you are doing the tar,
> certain 8k blocks are being modified in the file.  There is no way to
> know what blocks are modified as you are doing the tar, and in fact you
> could read partial page writes during the tar.

No, I think all OS's (Unix and NT at least) guard against this, as long as
the whole 8k block is written in one call. It is only the physical layer (disk)
that is prone to partial writes.

> any page that was modified while we were backing up is in the WAL.  On
> restore, we can recover whatever tar saw of the file, knowing that the
> WAL page images will recover any page changes made during the tar.

Assuming above, I do not think this is necessary.

> What I suggest is a way for the backup tar to turn on pre-change page
> images while the tar is happening, and turn it off after the tar is
> done.

Again, I do not think this is necessary.

Andreas

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"Zeugswetter Andreas SB SD"

Date:

08 July 2002, 08:51:38

> As noted, one of the main problems is knowing where to begin
> in the log.  This can be handled by having backup processing
> update the control file with the first lsn and log file
> required.  At the time of the backup, this information is or
> can be made available.  The control file can be the last file
> added to the tar and can contain information spanning the entire
> backup process.

lsn and logfile number (of latest checkpoints) is already in the control
file, thus you need control file at start of backup. (To reduce the number
of logs needed for restore of an online backup you could force a checkpoint
before starting file backup)

You will also need lsn and logfile number after file backup, to know how much
log needs to at least be replayed to regain a consistent state.

Andreas

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Patrick Macdonald

Date:

08 July 2002, 09:37:19

Zeugswetter Andreas SB SD wrote:
> 
> > As noted, one of the main problems is knowing where to begin
> > in the log.  This can be handled by having backup processing
> > update the control file with the first lsn and log file
> > required.  At the time of the backup, this information is or
> > can be made available.  The control file can be the last file
> > added to the tar and can contain information spanning the entire
> > backup process.
> 
> lsn and logfile number (of latest checkpoints) is already in the control
> file, thus you need control file at start of backup. (To reduce the number
> of logs needed for restore of an online backup you could force a checkpoint
> before starting file backup)

Maybe I should have been more clear.  The control file snapshot must 
be taken at backup start (as you mention) but can be stored in cache.
The fields can then be modified as we see fit.  At the end of backup,
we can write this to a temp file and add it to the tar.  Therefore,
as mentioned, the snapshot spans the entire backup process.
> You will also need lsn and logfile number after file backup, to know how much
> log needs to at least be replayed to regain a consistent state.

This is a nicety but not a necessity. If you have a backup end log 
record, you just have to enforce that the PIT recovery encounters 
that particular log record on forward recovery.  Once encountered,
you know that you at passed the point of back up end.

Cheers,
Patrick

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

08 July 2002, 17:30:25

Zeugswetter Andreas SB SD wrote:
> 
> > OK, so you do a tar backup of a file.  While you are doing the tar,
> > certain 8k blocks are being modified in the file.  There is no way to
> > know what blocks are modified as you are doing the tar, and in fact you
> > could read partial page writes during the tar.
> 
> No, I think all OS's (Unix and NT at least) guard against this, as long as 
> the whole 8k block is written in one call. It is only the physical layer (disk) 
> that is prone to partial writes.

Yes, good point.  The kernel will present a unified view of the 8k
block.  Of course, there are still cases where 8k blocks are being
changed in front/behind in the tarred file.  Will WAL allow us to
re-synchronize that file even if part of it has pages from an earlier in
time than other pages.  Uh, I think so.  So maybe we don't need the
pre-write images in WAL after all.  Can we replay the WAL when some
pages in the restored file _have_ the WAL changes and some don't?  Maybe
the LSN on the pages helps with this.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Barry Lind

Date:

08 July 2002, 18:47:08

I know that in Oracle there are 'alter database begin backup' and 'alter 
database end backup' commands that allow you to script your hot backups 
through a cron job by calling the begin backup command first, then using 
disk backup method of choice and then finally call the end backup command.

--Barry

Patrick Macdonald wrote:

>Zeugswetter Andreas SB SD wrote:
>  
>
>>>As noted, one of the main problems is knowing where to begin
>>>in the log.  This can be handled by having backup processing
>>>update the control file with the first lsn and log file
>>>required.  At the time of the backup, this information is or
>>>can be made available.  The control file can be the last file
>>>added to the tar and can contain information spanning the entire
>>>backup process.
>>>      
>>>
>>lsn and logfile number (of latest checkpoints) is already in the control
>>file, thus you need control file at start of backup. (To reduce the number
>>of logs needed for restore of an online backup you could force a checkpoint
>>before starting file backup)
>>    
>>
>
>Maybe I should have been more clear.  The control file snapshot must 
>be taken at backup start (as you mention) but can be stored in cache.
>The fields can then be modified as we see fit.  At the end of backup,
>we can write this to a temp file and add it to the tar.  Therefore,
>as mentioned, the snapshot spans the entire backup process.
> 
>  
>
>>You will also need lsn and logfile number after file backup, to know how much
>>log needs to at least be replayed to regain a consistent state.
>>    
>>
>
>This is a nicety but not a necessity. If you have a backup end log 
>record, you just have to enforce that the PIT recovery encounters 
>that particular log record on forward recovery.  Once encountered,
>you know that you at passed the point of back up end.
>
>Cheers,
>Patrick
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html
>
>
>
>  
>

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Hannu Krosing

Date:

09 July 2002, 08:46:46

On Mon, 2002-07-08 at 21:53, Barry Lind wrote:
> I know that in Oracle there are 'alter database begin backup' and 'alter 
> database end backup' commands that allow you to script your hot backups 
> through a cron job by calling the begin backup command first, then using 
> disk backup method of choice and then finally call the end backup command.

This gave me an idea of a not-too-difficult-to-implement way of doing
consistent online backups (thanks to MVCC it is probably much easier
than Oracle's):

Backup:

1) record the lowest uncommitted transaction number (LUTN) , this may
have problems with wraparound, but I guess they are solvable. Disllow
VACUUM. Do a CHECKPOINT ('alter database begin backup')

3) make a file-level (.tar) backup of data directory.

4) Allow VACUUM. ('alter database end backup')

Restore:

1) restore the data directory from file-level backup

2) mark all transactions committed after LUTN as aborted, effectively
deleting all tuples inserted and resurrecting those deleted/updated
after start of backups. 

3) make sure that new transaction number is large enough.

PS. It would be nice if our OID-based filenames had some type indicator
in their names - it is usually waste of time and space to backup indexes
and temp tables. The names could be of form
pg_class.relkind:pg_class.relfilenode instead of just
pg_class.relfilenode they are now.

-------------------
Hannu

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Tom Lane

Date:

09 July 2002, 11:50:58

Hannu Krosing <hannu@tm.ee> writes:
> 1) record the lowest uncommitted transaction number (LUTN) , this may
> have problems with wraparound, but I guess they are solvable. Disllow
> VACUUM. Do a CHECKPOINT ('alter database begin backup')
> 3) make a file-level (.tar) backup of data directory.
> 4) Allow VACUUM. ('alter database end backup')

Transactions don't necessarily commit in sequence number order, so the
concept of LUTN seems meaningless.

Why is it necessary (or even good) to disallow VACUUM?  I really dislike
a design that allows the DBA to cripple the database by forgetting the
last step in a (long) process.
        regards, tom lane

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Hannu Krosing

Date:

09 July 2002, 12:48:44

On Tue, 2002-07-09 at 17:26, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > 1) record the lowest uncommitted transaction number (LUTN) , this may
> > have problems with wraparound, but I guess they are solvable. Disllow
> > VACUUM. Do a CHECKPOINT ('alter database begin backup')
> > 3) make a file-level (.tar) backup of data directory.
> > 4) Allow VACUUM. ('alter database end backup')
> 
> Transactions don't necessarily commit in sequence number order, so the
> concept of LUTN seems meaningless.

Not quite. It is the most simple way to be sure that if we invalidate
all transactions >= than it we get back to a fairly recent
Point-In-Time.

The real solution would of course be to remember all committed
transactions at this PIT, which can probably be done by remembering LUTN
and all individual committed transactions > LUTN

> Why is it necessary (or even good) to disallow VACUUM?

So that it would be possible to resurrect these tuples that have been
deleted/updated during disk-level backup.

I would like better the ability to tell VACUUM not to touch tuples where
deleting transaction number >= LUTN . IIRC the original postgres was
able to do that.

> I really dislike
> a design that allows the DBA to cripple the database by forgetting the
> last step in a (long) process.

There are several ways around it.

1. do it in a script, that will not forget.

2. Closing the session that did 'alter database begin backup' session
could do it automatically, but this would make the backup script
trickier.

3. VACUUM should not block but report a warning about being restricted
from running.

4. database can be instructed to send a message to DBA's pager if it has
been in 'begin backup' state too long ;)

----------------
Hannu

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

16 July 2002, 15:36:25

J.R., just checking to see how PITR recovery is going.  Do you need any
assistance or have any questions for us?

Also, do you have any idea how close you are to having something
completed?  Are you aware we are closing development of 7.3 at the end
of August and start beta September 1?  Is there any way we can help you?

---------------------------------------------------------------------------

J. R. Nield wrote:
> On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote:
> > 
> > We have needed
> > point-in-time recovery for a long time, 
> 
> Most thanks should go to vadim (and whoever else worked on this), since
> his WAL code already does most of the work. The key thing is auditing
> the backend to look for every case where we assume some action is not
> visible until after commit, and therefore don't log its effects. Those
> are the main cases that must be changed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

17 July 2002, 00:25:40

On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote:
> 
> J.R., just checking to see how PITR recovery is going.  Do you need any
> assistance or have any questions for us?
> 
> Also, do you have any idea how close you are to having something
> completed?  Are you aware we are closing development of 7.3 at the end
> of August and start beta September 1?  Is there any way we can help you?
> 

It should be ready to go into CVS by the end of the month. 

That will include: logging all operations except for rtree and GiST,
archival of logfiles (with new postgresql.conf params), headers on the
logfiles to verify the system that created them, standalone backend
recovery to a point-in-time, and a rudimentary hot backup capability.

I could use some advice on the proper way to add tests to configure.in,
given that the autoconf output is in CVS. Would you ever want a patch to
include the generated 'configure' file?

Related to that, the other place I need advice is on adding Ted Tso's
LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we
allowed to use this? There is a free OSF/DCE spec for UUID's, so I can
re-implement the library if required.

We also haven't discussed commands for backup/restore, but I will use
what I think is appropriate and we can change the grammar if needed. The
initial hot-backup capability will require the database to be in
read-only mode and use tar for backup, and I will add the ability to
allow writes later.

Does this sound like a reasonable timeframe/feature-set to make the 7.3
release?

-- 
J. R. Nield
jrnield@usol.com

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Tom Lane

Date:

17 July 2002, 01:15:48

"J. R. Nield" <jrnield@usol.com> writes:
> Related to that, the other place I need advice is on adding Ted Tso's
> LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we
> allowed to use this?

Uh, why exactly is UUID essential for this?  (The correct answer is
"it's not", IMHO.)

> We also haven't discussed commands for backup/restore, but I will use
> what I think is appropriate and we can change the grammar if needed. The
> initial hot-backup capability will require the database to be in
> read-only mode and use tar for backup, and I will add the ability to
> allow writes later.

There is no read-only mode, and I for one will resist adding one.
        regards, tom lane

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

17 July 2002, 01:26:03

J. R. Nield wrote:
> On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote:
> > 
> > J.R., just checking to see how PITR recovery is going.  Do you need any
> > assistance or have any questions for us?
> > 
> > Also, do you have any idea how close you are to having something
> > completed?  Are you aware we are closing development of 7.3 at the end
> > of August and start beta September 1?  Is there any way we can help you?
> > 
> 
> It should be ready to go into CVS by the end of the month. 
> 
> That will include: logging all operations except for rtree and GiST,
> archival of logfiles (with new postgresql.conf params), headers on the
> logfiles to verify the system that created them, standalone backend
> recovery to a point-in-time, and a rudimentary hot backup capability.

Sounds great.  That gives us another month to iron out any remaining
issues.  This will be a great 7.3 feature!

> I could use some advice on the proper way to add tests to configure.in,
> given that the autoconf output is in CVS. Would you ever want a patch to
> include the generated 'configure' file?

We only patch configure.in.  If you post to hackers, they can give you
assistance and I will try to help however I can.  I can so some
configure.in stuff for you myself.

> Related to that, the other place I need advice is on adding Ted Tso's
> LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we
> allowed to use this? There is a free OSF/DCE spec for UUID's, so I can
> re-implement the library if required.

We talked about this on the replication mailing list.  We decided that
hostname, properly hashed to an integer, was the proper way to get this
value.  Also, there should be a postgresql.conf variable so you can
override the hostname-generated value if you wish.  I think that is
sufficient.

> We also haven't discussed commands for backup/restore, but I will use
> what I think is appropriate and we can change the grammar if needed. The
> initial hot-backup capability will require the database to be in
> read-only mode and use tar for backup, and I will add the ability to
> allow writes later.

Yea, I saw Tom balked at that. I think we have enough manpower and time
that we can get hot backup in normal read/write mode working before 7.3
beta so I would just code it assuming the system is live and we can deal
with making it hot-capable once it is in CVS.  It doesn't have to work
100% until beta time.

> Does this sound like a reasonable timeframe/feature-set to make the 7.3
> release?

Sounds great.  This is another killer 7.3 feature, and we really need
this for greater enterprise acceptance of PostgreSQL.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Richard Tucker

Date:

17 July 2002, 17:45:55

We also have implemented a roll forward recovery mechanism. We modified a
7.2.1 version of Postgres.
The mechanism is designed to provide a means of recoverying from the loss or
corruption of media.  It provides for duplicating wal_files so that if a
wal_file is lost roll forward recovery can recover the database using the
duplicated wal_files.  Hooks were also added so that the roll forward
recovery mechanism can be used to implement a hot standby database.
Along with the roll forward recovery mechanism we have also implemented an
online database backup utility which is synchronized with the recovery log
so that the backup can be a starting point for a roll forward recovery
session.

Roll forward recovery is enabled for a database cluster by specifying one of
two postmaster configuration parameters.

wal_file_reuse = false  This parameter tells the wal system to not reuse wal
files.  This option is intended for sites wishing to implement a hot standby
database where wal files will be periodically copied to another machine
where they will be rolled forward into the standby database.

wal_file_duplicate = <directory_path>  This parameter tells the system to
mirror the files in the pg_xlog directory to the specified directory.  This
allows for the recovery of a database where a wal file has been damaged or
lost.  It also allows for a variant of a hot standby database where the
duplicate directory is the pg_xlog directory of the standby database.

Since both of these options cause wal files to accumulate indefinately the
dba needs a means of purging wal files when they are no longer needed.  So
an sql command, "ALTER SYSTEM PURGE WAL_FILES <wal_file_name>", has also
been implemented.  This command deletes all wal files up to and including
the specified <wal_file_name> as long as those wal files are not needed to
recover the database in the event of a system crash.  To find out the status
of the wal files a function has been implemented to return a wal file name.
The function is:

Wal_file( <request_tye>)
Request_type := [ current | last | checkpoint | oldest]

Wal_file ('current') returns the name of the log file currently being
written to.
Wal_file('last') returns the name of the last log file filled.
Wal_file('checkpoint') returns the name of the file containing the current
redo position.  The current redo position is the position in the recovery
log where crash recovery would start if the system were to crash now.  All
logs prior to this one will not be needed to recover the database cluster
and could be safely removed.
Wal_file('oldest') returns the oldest xlog file found in the pg_xlog
directory.


To actually perform a roll forward you use the postmaster configuration
parameter "roll_forward=yes".  This parameter tells the startup process to
perform crash recovery even though the state of the database as found in the
pg_control file indicates a normal shutdown.  This is necessary since the
starting point of roll forward session could be the restore of a database
cluster that was shutdown in order to back it up.  Furthermore this
parameter tells the startup process not to write out a checkpoint record at
the end of the roll forward session. This allows for the database cluster to
receive subsequent wal files and to have those rolled forward as well.  When
starting the postmaster with the roll_forward=yes option, it shuts down the
database as soon as the startup process completes.  So the idea is to
restore a backup, copy all of your saved/duplicated wal files into the
pg_xlog directory of the restored database and start the postmaster with the
roll_forward option.

For point in time recovery there is also a roll_forward_until = <time> which
rolls forward through the wal files until the first transaction commit note
that is greater than or equal to the specified time.

The pg_copy utility performs an on line copy of a database cluster.  Its
syntax is:
pg_copy <backup_directory> [-h host] [-p port] ...
This makes a copy of the database where backup_directory is what you would
set PGDATA to in order start a postmaster against the backup copy.  The
database can be being updated while the copy occurs.  If you start a
postmaster against this copy it will appear to the startup process as a
database that crashed at the instant the pg_copy operation completed.
Futhermore the pg_copy utility automatically removes any wal files not
needed to recover the database from either pg_xlog directory or the
wal_file_duplicate directory.

So a DBA to protect the database from media loss just needs to set the
wal_file_duplicate paramater and periodically pg_copy the database.


The BIG THING we have not done is address the issue that add/drop tables and
indexes do not propagate through the roll forward recovery mechanism
properly.



-regards
Richard Tucker

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

17 July 2002, 18:17:36

Richard Tucker wrote:
> 
> We also have implemented a roll forward recovery mechanism. We modified a
> 7.2.1 version of Postgres.
> The mechanism is designed to provide a means of recoverying from the loss or
> corruption of media.  It provides for duplicating wal_files so that if a
> wal_file is lost roll forward recovery can recover the database using the
> duplicated wal_files.  Hooks were also added so that the roll forward
> recovery mechanism can be used to implement a hot standby database.
> Along with the roll forward recovery mechanism we have also implemented an
> online database backup utility which is synchronized with the recovery log
> so that the backup can be a starting point for a roll forward recovery
> session.

In researching who has done this work, I found that
http://www.multera.com/ is Progress Software.  I assume this
"Distributed but Connected" is partly based on PostgreSQL.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

17 July 2002, 18:19:59

Bruce Momjian wrote:
> Richard Tucker wrote:
> > 
> > We also have implemented a roll forward recovery mechanism. We modified a
> > 7.2.1 version of Postgres.
> > The mechanism is designed to provide a means of recoverying from the loss or
> > corruption of media.  It provides for duplicating wal_files so that if a
> > wal_file is lost roll forward recovery can recover the database using the
> > duplicated wal_files.  Hooks were also added so that the roll forward
> > recovery mechanism can be used to implement a hot standby database.
> > Along with the roll forward recovery mechanism we have also implemented an
> > online database backup utility which is synchronized with the recovery log
> > so that the backup can be a starting point for a roll forward recovery
> > session.
> 
> In researching who has done this work, I found that
> http://www.multera.com/ is Progress Software.  I assume this
> "Distributed but Connected" is partly based on PostgreSQL.

Oh, I see it now, PostgreSQL is right there:
   * OuterEdge     A distributed site technology suite that allows remote users to
maintain the integrity, not the congestion and price, of centralized
data. The OuterEdge includes the affordable UltraSQL^(TM) database
server powered by PostgreSQL, the third most popular database, and a
      ^^^^^^^^^^^^^^^^^^^^^
Secure Application/web server that supports the industry's most popular
Internet languages. Together with the Replication Engine software, the
OuterEdge gives remote users seamless access to important data without
relying on bandwidth.

Wonder why we are "the third most popular database".  I think that's
good?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Rod Taylor

Date:

17 July 2002, 18:26:59

> server powered by PostgreSQL, the third most popular database, and a
> 
>        ^^^^^^^^^^^^^^^^^^^^^
> Wonder why we are "the third most popular database".  I think that's
> good?

You'll notice they didn't qualify where.  On this list, it's probably
#1.  Within Progress software perhaps we're third most popular (whatever
two are typically used in the InnerEdge are 1 and 2).

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

17 July 2002, 18:28:59

Rod Taylor wrote:
> > server powered by PostgreSQL, the third most popular database, and a
> > 
> >        ^^^^^^^^^^^^^^^^^^^^^
>  
> > Wonder why we are "the third most popular database".  I think that's
> > good?
> 
> You'll notice they didn't qualify where.  On this list, it's probably
> #1.  Within Progress software perhaps we're third most popular (whatever
> two are typically used in the InnerEdge are 1 and 2).

Yea, using that logic, #1 would be the Progress internal db system, #2
would be MySQL (though that seems doubtful at this point with Nusphere),
and PostgreSQL.

Actually, PostgreSQL is #1 in my home.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"D'Arcy J.M. Cain"

Date:

17 July 2002, 18:43:31

On July 17, 2002 05:45 pm, Richard Tucker wrote:
> We also have implemented a roll forward recovery mechanism. We modified a
> 7.2.1 version of Postgres.
> ...

Excellent!  I can't wait.  When will it be in current?

> The BIG THING we have not done is address the issue that add/drop tables
> and indexes do not propagate through the roll forward recovery mechanism
> properly.

I can live with that.  Schemas shouldn't change so often that you can't just 
dup any changes to the backup(s).

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Richard Tucker

Date:

17 July 2002, 18:55:56

I don't know how our marketing came up third most popular but I think the
order is, Oracle, MySQL, and PostgreSQL or maybe Oracle, MSSQL and
PostgreSQL.  I'm sure there is some criterion by which PostgreSQL is tenth
and by some other its number one.
Of course, my posting was about Point In Time Recovery and not multera
marketing spin.
-regards
richt

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Wednesday, July 17, 2002 6:29 PM
To: Rod Taylor
Cc: PostgreSQL-development; richt@multera.com; J. R. Nield
Subject: Re: [HACKERS] Issues Outstanding for Point In Time Recovery
(PITR)


Rod Taylor wrote:
> > server powered by PostgreSQL, the third most popular database, and a
> >
> >        ^^^^^^^^^^^^^^^^^^^^^
>
> > Wonder why we are "the third most popular database".  I think that's
> > good?
>
> You'll notice they didn't qualify where.  On this list, it's probably
> #1.  Within Progress software perhaps we're third most popular (whatever
> two are typically used in the InnerEdge are 1 and 2).

Yea, using that logic, #1 would be the Progress internal db system, #2
would be MySQL (though that seems doubtful at this point with Nusphere),
and PostgreSQL.

Actually, PostgreSQL is #1 in my home.

-- Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

17 July 2002, 19:03:15

On Wed, 2002-07-17 at 01:25, Bruce Momjian wrote:
> 
> We only patch configure.in.  If you post to hackers, they can give you
> assistance and I will try to help however I can.  I can so some
> configure.in stuff for you myself.

Thanks for the offer. The only thing I was changing it for was to test
whether and how to get a ethernet MAC address using ioctl, so libuuid
could use it if available. That is dropped now.

> 
> > Related to that, the other place I need advice is on adding Ted Tso's
> > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we
> > allowed to use this? There is a free OSF/DCE spec for UUID's, so I can
> > re-implement the library if required.
> 
> We talked about this on the replication mailing list.  We decided that
> hostname, properly hashed to an integer, was the proper way to get this
> value.  Also, there should be a postgresql.conf variable so you can
> override the hostname-generated value if you wish.  I think that is
> sufficient.

I will do something like this, but reserve 16 bytes for it just in case
we change our minds. It needs to be different among systems on the same
machine, so there needs to be a time value and a pseudo-random part as
well. Also, 'hostname' will likely be the same on many machines
(localhost.localdomain or similar).

The only reason I bothered with UUID's before is because they have a
standard setup to make the possibility of collision extremely small, and
I figured replication will end up using it someday.

> 
> > We also haven't discussed commands for backup/restore, but I will use
> > what I think is appropriate and we can change the grammar if needed. The
> > initial hot-backup capability will require the database to be in
> > read-only mode and use tar for backup, and I will add the ability to
> > allow writes later.
> 
> Yea, I saw Tom balked at that. I think we have enough manpower and time
> that we can get hot backup in normal read/write mode working before 7.3
> beta so I would just code it assuming the system is live and we can deal
> with making it hot-capable once it is in CVS.  It doesn't have to work
> 100% until beta time.

Hot backup read/write requires that we force an advance in the logfile
segment after the backup. We need to save all the logs between backup
start and completion. Otherwise the files will be useless as a
standalone system if the current logs somehow get destroyed (fire in the
machine room, etc.).

The way I would do this is:
 create a checkpoint do the block-by-block walk of the files using the bufmgr create a second checkpoint force the log
toadvance past the end of the current segment save the log segments containing records between the      first & second
checkpontwith the backup

Then if you restore the backup, you can recover to the point of the
second checkpoint, even if the logs since then are all gone.

Right now the log segment size is fixed, so this means that we'd waste
8MB of log space on average to do a backup. Also, the way XLOG reads
records right now, we have to write placeholder records into the empty
space, because that's how it finds the end of the log stream. So I need
to change XLOG to handle "skip records", and then to truncate the file
when it gets archived, so we don't have to save up to 16MB of zeros.

Also, if archiving is turned off, then we can't recycle or delete any
logs for the duration of the backup, and we have to save them.

So I'll finish the XLOG support for this, and then think about the
correct way to walk through all the files.
-- 
J. R. Nield
jrnield@usol.com

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Bruce Momjian

Date:

17 July 2002, 20:27:14

J. R. Nield wrote:
> I will do something like this, but reserve 16 bytes for it just in case
> we change our minds. It needs to be different among systems on the same
> machine, so there needs to be a time value and a pseudo-random part as
> well. Also, 'hostname' will likely be the same on many machines
> (localhost.localdomain or similar).
> 
> The only reason I bothered with UUID's before is because they have a
> standard setup to make the possibility of collision extremely small, and
> I figured replication will end up using it someday.

Sure.  Problem is, we support so many platforms that any trickery is a
problem.  If they can change it in postgresql.conf, that should be
sufficient.

> Hot backup read/write requires that we force an advance in the logfile
> segment after the backup. We need to save all the logs between backup
> start and completion. Otherwise the files will be useless as a
> standalone system if the current logs somehow get destroyed (fire in the
> machine room, etc.).
> 
> The way I would do this is:
> 
>   create a checkpoint
>   do the block-by-block walk of the files using the bufmgr
>   create a second checkpoint
>   force the log to advance past the end of the current segment
>   save the log segments containing records between the
>        first & second checkpont with the backup

Sounds good.

> Then if you restore the backup, you can recover to the point of the
> second checkpoint, even if the logs since then are all gone.

Good, you put the logs that happened during the backup inside the same
backup, make it consistent.  Makes sense.

> Right now the log segment size is fixed, so this means that we'd waste
> 8MB of log space on average to do a backup. Also, the way XLOG reads
> records right now, we have to write placeholder records into the empty
> space, because that's how it finds the end of the log stream. So I need
> to change XLOG to handle "skip records", and then to truncate the file
> when it gets archived, so we don't have to save up to 16MB of zeros.
> 
> Also, if archiving is turned off, then we can't recycle or delete any
> logs for the duration of the backup, and we have to save them.
> 
> So I'll finish the XLOG support for this, and then think about the
> correct way to walk through all the files.

Sounds like a good plan.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

18 July 2002, 14:35:41

Richard:

I can't quite follow this; maybe you sent a draft by accident. If you
want to post a patch against 7.2.1, or even better against HEAD in CVS,
that would be great. Or if you'd rather point me to your source online,
that would be good too.

I just want to clarify though: is this work released to the PostgreSQL
Development group by Progress and Multera, or do they still claim
copyright interest in it?

Regards,J.R. Nield


On Thu, 2002-07-18 at 12:56, Richard Tucker wrote:
> 
> 
> -----Original Message-----
> From: J. R. Nield [mailto:jrnield@usol.com]
> Sent: Wednesday, July 17, 2002 8:13 PM
> To: richt@multera.com
> Cc: Bruce Momjian
> Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery
> (PITR)
> 
> 
> On Wed, 2002-07-17 at 19:25, Richard Tucker wrote:
> > Regarding hot backup.  Our implementation of "pg_copy" does a hot backup.
> > It turns off database checkpointing for the duration of the backup.
> Backups
> > all the files of the database cluster up to the wal file currently being
> > logged to.  It then acquires the WalInsertLock lock long enough to backup
> > the current wal file.
> 
> Does it then allow more writes to that WAL file? It would seem like you
> want to advance the log to the next file, so the sysadmin wouldn't have
> to choose which one of log-file number 3 he wants to use at restore.
> 
> > Writes to the wal file are allowed during the backup except for the
> backing of the wal file current when the
> > backup completes.  That is the pg_xlog directory is the last directory to
> be backed up.  The wal_files are backed
> > up in the order they were used.  Continued wal file logging is allowed
> until the backup reaches the current wal
> > file being written to.  To back up this last wal file the WalInsertLock is
> held until the copy of the wal file
> > is complete.  So the backup stops update activity only long enough to copy
> this last 16mb file.
> 
> Also, what do you mean by 'turns off checkpointing'. You have to do a
> checkpoint, or at least flush the buffers, when you start the backup.
> Otherwise how do you know what LSN to start from at restore?
> 
> > The pg_control file also gets backed up.  It contains the point in the log
> at which to begin
> > the redo/roll forward.
> > By not allowing the redo point to advance while the backup goes on means
> that the startup processes' crash
> > recovery code will capture all the changes made to the database cluster
> while the backup was running.
> 
> 
> Anyway: Yes we'd love to see the code.
> 
> > In what form would you like me to send the code to you e.g. as a patch,
> copy our whole source ...
> 
> Since I've pretty-much got create/drop and index stuff working, if your
> code does the rest then we should be good to go.
> 
> ;jrnield
> 
> 
> --
> J. R. Nield
> jrnield@usol.com
> 
> 
> 
> 
-- 
J. R. Nield
jrnield@usol.com

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

Richard Tucker

Date:

18 July 2002, 15:56:29

Progress/Multera would release the hot backup/roll forward work to the
PostgreSQL Development group.
-regards
richt

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of J. R. Nield
Sent: Thursday, July 18, 2002 2:34 PM
To: richt@multera.com
Cc: Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] Issues Outstanding for Point In Time Recovery
(PITR)


Richard:

I can't quite follow this; maybe you sent a draft by accident. If you
want to post a patch against 7.2.1, or even better against HEAD in CVS,
that would be great. Or if you'd rather point me to your source online,
that would be good too.

I just want to clarify though: is this work released to the PostgreSQL
Development group by Progress and Multera, or do they still claim
copyright interest in it?

Regards,J.R. Nield


On Thu, 2002-07-18 at 12:56, Richard Tucker wrote:
>
>
> -----Original Message-----
> From: J. R. Nield [mailto:jrnield@usol.com]
> Sent: Wednesday, July 17, 2002 8:13 PM
> To: richt@multera.com
> Cc: Bruce Momjian
> Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery
> (PITR)
>
>
> On Wed, 2002-07-17 at 19:25, Richard Tucker wrote:
> > Regarding hot backup.  Our implementation of "pg_copy" does a hot
backup.
> > It turns off database checkpointing for the duration of the backup.
> Backups
> > all the files of the database cluster up to the wal file currently being
> > logged to.  It then acquires the WalInsertLock lock long enough to
backup
> > the current wal file.
>
> Does it then allow more writes to that WAL file? It would seem like you
> want to advance the log to the next file, so the sysadmin wouldn't have
> to choose which one of log-file number 3 he wants to use at restore.
>
> > Writes to the wal file are allowed during the backup except for the
> backing of the wal file current when the
> > backup completes.  That is the pg_xlog directory is the last directory
to
> be backed up.  The wal_files are backed
> > up in the order they were used.  Continued wal file logging is allowed
> until the backup reaches the current wal
> > file being written to.  To back up this last wal file the WalInsertLock
is
> held until the copy of the wal file
> > is complete.  So the backup stops update activity only long enough to
copy
> this last 16mb file.
>
> Also, what do you mean by 'turns off checkpointing'. You have to do a
> checkpoint, or at least flush the buffers, when you start the backup.
> Otherwise how do you know what LSN to start from at restore?
>
> > The pg_control file also gets backed up.  It contains the point in the
log
> at which to begin
> > the redo/roll forward.
> > By not allowing the redo point to advance while the backup goes on means
> that the startup processes' crash
> > recovery code will capture all the changes made to the database cluster
> while the backup was running.
>
>
> Anyway: Yes we'd love to see the code.
>
> > In what form would you like me to send the code to you e.g. as a patch,
> copy our whole source ...
>
> Since I've pretty-much got create/drop and index stuff working, if your
> code does the rest then we should be good to go.
>
> ;jrnield
>
>
> --
> J. R. Nield
> jrnield@usol.com
>
>
>
>
--
J. R. Nield
jrnield@usol.com




---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command   (send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)

Re: Issues Outstanding for Point In Time Recovery (PITR)

From

"J. R. Nield"

Date:

18 July 2002, 16:26:49

On Thu, 2002-07-18 at 15:36, Richard Tucker wrote:
> Sorry, I didn't delimit my comments correctly before.
> I'm offering our pg_copy/roll forward implementation to PostgreSQL.org if it
> finds it an acceptable contribution. Progress/Multera would hand over all
> rights to any code accepted.
> 
> I'd be willing to post a patch to HEAD -- where can I find instructions on
> how to do this?
> 

Instructions for how to get the latest development source are at: http://developer.postgresql.org/TODO/docs/cvs.html

If you want to post a context diff against 7.2.1 to pgsql-hackers first,
that would let us see what it does.

Let me know if there is anything else I can help you with.

-- 
J. R. Nield
jrnield@usol.com