Issues Outstanding for Point In Time Recovery (PITR) - Mailing list pgsql-hackers

From J. R. Nield
Subject Issues Outstanding for Point In Time Recovery (PITR)
Date
Msg-id 1025797552.11285.149.camel@localhost.localdomain
Whole thread Raw
Responses Re: Issues Outstanding for Point In Time Recovery (PITR)
Re: Issues Outstanding for Point In Time Recovery (PITR)
Re: Issues Outstanding for Point In Time Recovery (PITR)
List pgsql-hackers
Hello:

I've got the logging system to the point where I can take a shutdown
consistent copy of a system, and play forward through multiple
checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE
properly, and things are moving forward well. Recovery to an arbitrary
point-in-time should be just as easy, but will need some administrative
interface for it.

At this point, some input would be useful on how I should handle things.

The most important questions that need answering are in sections 2 & 5,
since they impact the most other parts of the system. They will also
require good documentation for sysadmins.



Issues Outstanding for Point In Time Recovery (PITR)
   $Date: 2002/07/04 14:23:37 $
   $Revision: 1.4 $
   J.R. Nield
   (Enc: ISO 8859-15 Latin-9)


§0 - Introduction
   This file is where I'm keeping track of all the issues I run into while   trying to get PITR to work properly.
Hopefullyit will evolve into a   description of how PITR actually works once it is implemented. 
   I will also try to add feedback as it comes in.
   The big items so-far are:       §1 - Logging Relation file creation, truncation, and removal            This is
mostlydone. Can do infinte play-forward from           online logs.       §2 - Partial-Write and Bad Block detection
      Need input before starting. Migration issues.       §3 - Detecting Shutdown Consistent System Recovery
Mostlydone.       §4 - Interactive Play-Forward Recovery for an Entire System           Need input before starting.
 §5 - Individual file consistent recovery           Need input. Semi-Major changes required. 

§1 - Logging Relation file creation, truncation, and removal
 §1.1 - Problem:
   Without file creation in the log, we can't replay committed    transactions that create relations.       The current
codeassumes that any transaction reaching commit has already   ensured it's files exist, and that those files will
neverbe removed. This   is true now, but not for log-replay from an old backup database system.       The current XLOG
codesilently ignores block-write requests for   non-existent files, and assumes that the transaction generating those
requestsmust have aborted. 
   Right now a crash during TRUNCATE TABLE will leave the table in an   inconsistent state (partially truncated). This
wouldnot work when doing   replay from before the last checkpoint. 
   §1.1.1 - CREATE DATABASE is also unlogged
     This will cause the same replay problems as above.    §1.2 - Proposal:
   a) Augment the SMGR code to log relation file operations, and to handle   redo requests properly. This is simple in
thecase of create. Drop must be   logged only IN the commit record. For truncate see (b). 
   The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink',   and 'smgr_retruncate'. smgr_recreate
shouldaccept a RelFileNode instead   of a Relation. 
   Transactions that abort through system failure (ie.  unlogged aborts)    will simply continue to leak files.
   b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table   may be used in a future command, and a
replay-recovereddatabase may   end-up with different data than the original. 
   WAL must be flushed before truncate as well.
   WAL does not need to be flushed before create, if we don't mind    leaking files sometimes.
   c) Redo code should treat writes to non-existent files as an error.   Changes affect heap & nbtree AM's. [Check
others]
   d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of     a database should mark all the rtree
indicesas corrupt.     [ actually we should do that now, are we? ] 
   e) CREATE DATABASE must be logged properly, not use system(cp...)
 §1.3 - Status:
   All logged SMGR operations are now in a START_CRIT_SECTION()/   END_CRIT_SECTION() pair enclosing the XLogInsert()
andthe underlying fs   operations. 
   Code has been added to smgr and xact modules to log:       create (no XLogFlush)       truncate (XLogFlush)
pendingdeletes on commit record       files to delete on abort record 
   Code added to md.c to support redo ops
   Code added to smgr for RMGR redo/desc callbacks
   Code added to xact RMGR callbacks for redo/desc
   Database will do infinite shutdown consistent system recovery from the   online logs, if you manually munge the
controlfile to set state ==   DB_IN_PRODUCTION instead of DB_SHUTDOWNED. 
   Still need to do:       Item (c), recovery cleanup in all AM's       Item (d), logging in other index AM's
Item(e), CREATE DATABASE stuff 



§2 - Partial-Write and Bad Block detection
 §2.1 - Problem:
   In order to protect against partial writes without logging pages   twice, we need to detect partial pages in system
filesand report them   to the system administrator. We also might want to be able to detect   damaged pages from other
causes,like memory corruption, OS errors,   etc. or in the case where the disk doesn't report bad blocks, but   returns
baddata. 
   We should also decide what should happen when a file is marked as   containing corrupt pages, and requires
log-archiverecovery from a   backup. 
 §2.2 - Proposal:
   Add a 1 byte 'pd_flags' field to PageHeaderData, with the following   flag definitions:
       PD_BLOCK_CHECKING           (1)       PD_BC_METHOD_BIT            (1<<1)
       PageHasBlockChecking(page)     ((page)->pd_flags & PD_BLOCK_CHECKING)       PageBCMethodIsCRC64(page)
((page)->pd_flags& PD_BC_METHOD_BIT)       PageBCMethodIsLSNLast(page)    (!PageBCMethodIsCRC64(page)) 
   The last 64 bits of a page are reserved for use by the block checking    code.
   [ Is it worth the trouble to allow the last 8 bytes of a     page to contain data when block checking is turned off
fora Page?  
     This proposal does not allow that. ]
   If the block checking method is CRC64, then that field will contain   the CRC64 of the block computed at write time.
   If the block checking method is LSNLast, then the field contains a   duplicate of the pd_lsn field.
   §2.2.1 - Changes to Page handling routines
    All the page handling routines need to understand that     pd_special == (pd_special - (specialSize + 8))
    Change header comment in bufpage.h to reflect this.
   §2.2.2 - When Reading a Page
    Block corruption is detected on read in the obvious way with CRC64.
    In the case of LSNLast, we check to see if pd_lsn == the lsn in the    last 64 bits of the page. If not, we assume
thepage is corrupt from    a partial write (although it could be something else). 
    IMPORTANT ASSUMPTION:       The OS/disk device will never write both the first part and       last part of a block
withoutwriting the middle as well.       This might be wrong in some cases, but at least it's fast. 
   §2.2.4 - GUC Variables
    The user should be able to configure what method is used:
      block_checking_write_method  = [ checksum | torn_page_flag | none ]
         Which method should be used for blocks we write?
      check_blocks_on_read      = [ true | false ]
         When true, verify that the blocks we read are not corrupt, using         whatever method is in the block
header.
         When false, ignore the block checking information.
 §2.3 - Status:
   Waiting for input from pgsql-hackers.
   Questions:
       Should we allow the user to have more detailed control over       which parts of a database use block checking?
       For example: use 'checksum' on all system catalogs in all databases,        'torn_page_flag' on the non-catalog
partsof the production database,       and 'none' on everything else? 

§3 - Detecting Shutdown Consistent System Recovery
   §3.1 - Problem:
    How to notice that we need to do log-replay for a system backup, when the    restored control file points to a
shutdowncheckpoint record that is    before the most recent checkpoint record in the log, and may point into    an
archivedfile. 
   §3.2 - Proposal:
    At startup, after reading the ControlFile, scan the log directory to    get the list of active log files, and find
thelowest logId and    logSeg of the files. Ensure that the files cover a contiguous range    of LSN's. 
    There are three cases:
     1) ControlFile points to the last valid checkpoint (either        checkPoint or prevCheckPoint, but one of them is
thegreatest        valid checkpoint record in the log stream). 
     2) ControlFile points to a valid checkpoint record in an active        log file, but there are more valid
checkpointrecords beyond        it. 
     3) ControlFile points to a checkpoint record that should be in the        archive logs, and is presumably valid.
    Case 1 is what we handle now.
    Cases 2 and 3 would result from restoring an entire system from    backup in preparation to do a play-forward
recovery.
    We need to:
       Detect cases 2 and 3.
       Alert the administrator and abort startup.       [Question: Is this always the desired behavior?  We can
handlecase 2 without intervention. ] 
       Let the administrator start a standalone backend, and       perform a play-forward recovery for the system.
   §3.3 - Status:
      In progress.

§4 - Interactive Play-Forward Recovery for an Entire System
   Play-Forward File Recovery from a backup file must be interactive,   because not all log files that we need are
necessarilyin the    archive directory. It may be possible that not all the archive files   we need can even fit on
diskat one time. 
   The system needs to be able to prompt the system administrator to feed   it more log files.
   TODO: More here

§5 - Individual file consistent recovery
 §5.1 - Problem:
   If a file detects corruption, and we restore it from backup, how do    we know what archived files we need for
recovery?
   Should file corruption (partial write, bad disk block, etc.) outside    the system catalog cause us to abort the
system,or should we just    take the relation or database off-line? 
   Given a backup file, how do we determine the point in the log    where we should start recovery for the file? What
isthe highest LSN   we can use that will fully recover the file? 
 §5.2 - Proposal:       Put a file header on each file, and update that header to the last   checkpoint LSN at least
onceevery 'file_lsn_time_slack' minutes, or   at least once every dbsize/'file_lsn_log_slack' megabytes of log
written,where dbsize is the estimated size of the database. Have   these values be settable from the config file. These
updateswould be   distributed throughout the hour, or interspersed between regular   amounts of log generation. 
   If we have a database backup program or command, it can update the   header on the file before backup to the
greatestvalue it can assure   to be safe. 
 §5.3 - Status:
   Waiting for input from pgsql-hackers.
   Questions:
       There are alternate methods than using a file header to get a       known-good LSN lower bound for the starting
pointto recover a backup       file. Is this the best way? 

A) The Definitions
   This stuff is obtuse, but I need it here to keep track of what I'm    saying. Someday I should use it consistently
inthe rest of this   document. 
   "system" or "database system":
       A collection of postgres "databases" in one $PGDATA directory,       managed by one postmaster instance at a
time(and having one WAL       log, etc.) 
       All the files composing such a system, as a group.
   "up to date" or "now" or "current" or "current LSN":
       The most recent durable LSN for the system.
   "block consistent copy":
       When referring to a file:
       A copy of a file, which may be written to during the process of       copying, but where each BLCKSZ size block
iscopied atomically. 
       When referring to multiple files (in the same system):
       A copy of all the files, such that each is independently a "block       consistent copy"
   "file consistent copy":
       When referring to a file:
       A copy of a file that is not written to between the start and end       of the copy operation.
       When referring to multiple files (in the same system):
       A copy of all the files, such that each is independently a "file       consistent copy"
   "system consistent copy":
       When referring to a file:              A copy of a file, where the entire system of which it is a member
isnot written to during the copy. 
       When referring to multiple files (in the same system):
       A copy of all the files, where the entire system of which they are       members was not written to between the
startand end of the       copying of all the files, as a group. 
   "shutdown consistent copy":
       When referring to a file:
       A copy of a file, where the entire system of which it is a member       had been cleanly shutdown before the
startof and for the duration       of the copy. 
       When referring to multiple files (in the same system):
       A copy of all the files, where the entire system of which they are       members had been cleanly shutdown
beforethe start of and for the       duration of the copying of all the files, as a group. 
   "consistent copy":
       A block, file, system, or shutdown consistent copy.
   "known-good LSN lower bound"    or "LSN lower bound"   or "LSN-LB":
       When referring to a group of blocks, a file, or a group of files:
       An LSN known to be old enough that no log entries before it are needed       to bring the blocks or files
up-to-date.
   "known-good LSN greatest lower bound"    or "LSN greatest lower bound"    or "LSN-GLB":
       When referring to a group of blocks, a file, or a group of files:
       The greatest possible LSN that is a known-good LSN lower bound for       the group.
   "backup file":
       A consistent copy of a data file used by the system, for which        we have a known-good LSN lower bound.
   "optimal backup file":
       A backup file, for which we have the known-good LSN greatest lower       bound.
   "backup system":

   "Play-Forward File Recovery" or "PFFR":
       The process of bringing an individual backup file up to date.

--
J. R. Nield
jrnield@usol.com







pgsql-hackers by date:

Previous
From: Thomas Lockhart
Date:
Subject: IS DISTINCT FROM and TREAT() committed
Next
From: Tom Lane
Date:
Subject: Re: BETWEEN Node & DROP COLUMN