Thread: Issues Outstanding for Point In Time Recovery (PITR)
Hello: I've got the logging system to the point where I can take a shutdown consistent copy of a system, and play forward through multiple checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE properly, and things are moving forward well. Recovery to an arbitrary point-in-time should be just as easy, but will need some administrative interface for it. At this point, some input would be useful on how I should handle things. The most important questions that need answering are in sections 2 & 5, since they impact the most other parts of the system. They will also require good documentation for sysadmins. Issues Outstanding for Point In Time Recovery (PITR) $Date: 2002/07/04 14:23:37 $ $Revision: 1.4 $ J.R. Nield (Enc: ISO 8859-15 Latin-9) §0 - Introduction This file is where I'm keeping track of all the issues I run into while trying to get PITR to work properly. Hopefullyit will evolve into a description of how PITR actually works once it is implemented. I will also try to add feedback as it comes in. The big items so-far are: §1 - Logging Relation file creation, truncation, and removal This is mostlydone. Can do infinte play-forward from online logs. §2 - Partial-Write and Bad Block detection Need input before starting. Migration issues. §3 - Detecting Shutdown Consistent System Recovery Mostlydone. §4 - Interactive Play-Forward Recovery for an Entire System Need input before starting. §5 - Individual file consistent recovery Need input. Semi-Major changes required. §1 - Logging Relation file creation, truncation, and removal §1.1 - Problem: Without file creation in the log, we can't replay committed transactions that create relations. The current codeassumes that any transaction reaching commit has already ensured it's files exist, and that those files will neverbe removed. This is true now, but not for log-replay from an old backup database system. The current XLOG codesilently ignores block-write requests for non-existent files, and assumes that the transaction generating those requestsmust have aborted. Right now a crash during TRUNCATE TABLE will leave the table in an inconsistent state (partially truncated). This wouldnot work when doing replay from before the last checkpoint. §1.1.1 - CREATE DATABASE is also unlogged This will cause the same replay problems as above. §1.2 - Proposal: a) Augment the SMGR code to log relation file operations, and to handle redo requests properly. This is simple in thecase of create. Drop must be logged only IN the commit record. For truncate see (b). The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink', and 'smgr_retruncate'. smgr_recreate shouldaccept a RelFileNode instead of a Relation. Transactions that abort through system failure (ie. unlogged aborts) will simply continue to leak files. b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table may be used in a future command, and a replay-recovereddatabase may end-up with different data than the original. WAL must be flushed before truncate as well. WAL does not need to be flushed before create, if we don't mind leaking files sometimes. c) Redo code should treat writes to non-existent files as an error. Changes affect heap & nbtree AM's. [Check others] d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of a database should mark all the rtree indicesas corrupt. [ actually we should do that now, are we? ] e) CREATE DATABASE must be logged properly, not use system(cp...) §1.3 - Status: All logged SMGR operations are now in a START_CRIT_SECTION()/ END_CRIT_SECTION() pair enclosing the XLogInsert() andthe underlying fs operations. Code has been added to smgr and xact modules to log: create (no XLogFlush) truncate (XLogFlush) pendingdeletes on commit record files to delete on abort record Code added to md.c to support redo ops Code added to smgr for RMGR redo/desc callbacks Code added to xact RMGR callbacks for redo/desc Database will do infinite shutdown consistent system recovery from the online logs, if you manually munge the controlfile to set state == DB_IN_PRODUCTION instead of DB_SHUTDOWNED. Still need to do: Item (c), recovery cleanup in all AM's Item (d), logging in other index AM's Item(e), CREATE DATABASE stuff §2 - Partial-Write and Bad Block detection §2.1 - Problem: In order to protect against partial writes without logging pages twice, we need to detect partial pages in system filesand report them to the system administrator. We also might want to be able to detect damaged pages from other causes,like memory corruption, OS errors, etc. or in the case where the disk doesn't report bad blocks, but returns baddata. We should also decide what should happen when a file is marked as containing corrupt pages, and requires log-archiverecovery from a backup. §2.2 - Proposal: Add a 1 byte 'pd_flags' field to PageHeaderData, with the following flag definitions: PD_BLOCK_CHECKING (1) PD_BC_METHOD_BIT (1<<1) PageHasBlockChecking(page) ((page)->pd_flags & PD_BLOCK_CHECKING) PageBCMethodIsCRC64(page) ((page)->pd_flags& PD_BC_METHOD_BIT) PageBCMethodIsLSNLast(page) (!PageBCMethodIsCRC64(page)) The last 64 bits of a page are reserved for use by the block checking code. [ Is it worth the trouble to allow the last 8 bytes of a page to contain data when block checking is turned off fora Page? This proposal does not allow that. ] If the block checking method is CRC64, then that field will contain the CRC64 of the block computed at write time. If the block checking method is LSNLast, then the field contains a duplicate of the pd_lsn field. §2.2.1 - Changes to Page handling routines All the page handling routines need to understand that pd_special == (pd_special - (specialSize + 8)) Change header comment in bufpage.h to reflect this. §2.2.2 - When Reading a Page Block corruption is detected on read in the obvious way with CRC64. In the case of LSNLast, we check to see if pd_lsn == the lsn in the last 64 bits of the page. If not, we assume thepage is corrupt from a partial write (although it could be something else). IMPORTANT ASSUMPTION: The OS/disk device will never write both the first part and last part of a block withoutwriting the middle as well. This might be wrong in some cases, but at least it's fast. §2.2.4 - GUC Variables The user should be able to configure what method is used: block_checking_write_method = [ checksum | torn_page_flag | none ] Which method should be used for blocks we write? check_blocks_on_read = [ true | false ] When true, verify that the blocks we read are not corrupt, using whatever method is in the block header. When false, ignore the block checking information. §2.3 - Status: Waiting for input from pgsql-hackers. Questions: Should we allow the user to have more detailed control over which parts of a database use block checking? For example: use 'checksum' on all system catalogs in all databases, 'torn_page_flag' on the non-catalog partsof the production database, and 'none' on everything else? §3 - Detecting Shutdown Consistent System Recovery §3.1 - Problem: How to notice that we need to do log-replay for a system backup, when the restored control file points to a shutdowncheckpoint record that is before the most recent checkpoint record in the log, and may point into an archivedfile. §3.2 - Proposal: At startup, after reading the ControlFile, scan the log directory to get the list of active log files, and find thelowest logId and logSeg of the files. Ensure that the files cover a contiguous range of LSN's. There are three cases: 1) ControlFile points to the last valid checkpoint (either checkPoint or prevCheckPoint, but one of them is thegreatest valid checkpoint record in the log stream). 2) ControlFile points to a valid checkpoint record in an active log file, but there are more valid checkpointrecords beyond it. 3) ControlFile points to a checkpoint record that should be in the archive logs, and is presumably valid. Case 1 is what we handle now. Cases 2 and 3 would result from restoring an entire system from backup in preparation to do a play-forward recovery. We need to: Detect cases 2 and 3. Alert the administrator and abort startup. [Question: Is this always the desired behavior? We can handlecase 2 without intervention. ] Let the administrator start a standalone backend, and perform a play-forward recovery for the system. §3.3 - Status: In progress. §4 - Interactive Play-Forward Recovery for an Entire System Play-Forward File Recovery from a backup file must be interactive, because not all log files that we need are necessarilyin the archive directory. It may be possible that not all the archive files we need can even fit on diskat one time. The system needs to be able to prompt the system administrator to feed it more log files. TODO: More here §5 - Individual file consistent recovery §5.1 - Problem: If a file detects corruption, and we restore it from backup, how do we know what archived files we need for recovery? Should file corruption (partial write, bad disk block, etc.) outside the system catalog cause us to abort the system,or should we just take the relation or database off-line? Given a backup file, how do we determine the point in the log where we should start recovery for the file? What isthe highest LSN we can use that will fully recover the file? §5.2 - Proposal: Put a file header on each file, and update that header to the last checkpoint LSN at least onceevery 'file_lsn_time_slack' minutes, or at least once every dbsize/'file_lsn_log_slack' megabytes of log written,where dbsize is the estimated size of the database. Have these values be settable from the config file. These updateswould be distributed throughout the hour, or interspersed between regular amounts of log generation. If we have a database backup program or command, it can update the header on the file before backup to the greatestvalue it can assure to be safe. §5.3 - Status: Waiting for input from pgsql-hackers. Questions: There are alternate methods than using a file header to get a known-good LSN lower bound for the starting pointto recover a backup file. Is this the best way? A) The Definitions This stuff is obtuse, but I need it here to keep track of what I'm saying. Someday I should use it consistently inthe rest of this document. "system" or "database system": A collection of postgres "databases" in one $PGDATA directory, managed by one postmaster instance at a time(and having one WAL log, etc.) All the files composing such a system, as a group. "up to date" or "now" or "current" or "current LSN": The most recent durable LSN for the system. "block consistent copy": When referring to a file: A copy of a file, which may be written to during the process of copying, but where each BLCKSZ size block iscopied atomically. When referring to multiple files (in the same system): A copy of all the files, such that each is independently a "block consistent copy" "file consistent copy": When referring to a file: A copy of a file that is not written to between the start and end of the copy operation. When referring to multiple files (in the same system): A copy of all the files, such that each is independently a "file consistent copy" "system consistent copy": When referring to a file: A copy of a file, where the entire system of which it is a member isnot written to during the copy. When referring to multiple files (in the same system): A copy of all the files, where the entire system of which they are members was not written to between the startand end of the copying of all the files, as a group. "shutdown consistent copy": When referring to a file: A copy of a file, where the entire system of which it is a member had been cleanly shutdown before the startof and for the duration of the copy. When referring to multiple files (in the same system): A copy of all the files, where the entire system of which they are members had been cleanly shutdown beforethe start of and for the duration of the copying of all the files, as a group. "consistent copy": A block, file, system, or shutdown consistent copy. "known-good LSN lower bound" or "LSN lower bound" or "LSN-LB": When referring to a group of blocks, a file, or a group of files: An LSN known to be old enough that no log entries before it are needed to bring the blocks or files up-to-date. "known-good LSN greatest lower bound" or "LSN greatest lower bound" or "LSN-GLB": When referring to a group of blocks, a file, or a group of files: The greatest possible LSN that is a known-good LSN lower bound for the group. "backup file": A consistent copy of a data file used by the system, for which we have a known-good LSN lower bound. "optimal backup file": A backup file, for which we have the known-good LSN greatest lower bound. "backup system": "Play-Forward File Recovery" or "PFFR": The process of bringing an individual backup file up to date. -- J. R. Nield jrnield@usol.com
On Thu, 2002-07-04 at 11:45, J. R. Nield wrote: One other item that should be here: > The big items so-far are: > §1 - Logging Relation file creation, truncation, and removal > This is mostly done. Can do infinte play-forward from > online logs. > §2 - Partial-Write and Bad Block detection > Need input before starting. Migration issues. > §3 - Detecting Shutdown Consistent System Recovery > Mostly done. > §4 - Interactive Play-Forward Recovery for an Entire System > Need input before starting. > §5 - Individual file consistent recovery > Need input. Semi-Major changes required. §6 - btbuild is not logged Not logged becauseof same assumptions as for file create. Only need to log the index build parameters to recreate the index, not each page change. -- J. R. Nield jrnield@usol.com
I noticed no one has responded to your questions yet. I think it is because we are sort of in shock. We have needed point-in-time recovery for a long time, but the people who were capable of doing it weren't able to focus on it. Then, all of a sudden, we get an email from someone who is focusing on it and wants to get the job done. GREAT! I will give you my short analysis and see how many other questions I can answer. We have always known there was a way to do PITR with WAL, but WAL needed a few extra pieces of information. Unfortunately, we weren't able to analyze what was required. Seems you have gotten very far here, and that is great. Also, we thought about having PITR as part of replication (reply of replication traffic logs) but having it tied to WAL is much cleaner and has better performance, I bet. I will do whatever I can to help. My chat addresses are:AIM bmomjianICQ 151255111Yahoo bmomjianMSN root@candle.pha.pa.usIRC #postgresql vis efnet --------------------------------------------------------------------------- J. R. Nield wrote: > Hello: > > I've got the logging system to the point where I can take a shutdown > consistent copy of a system, and play forward through multiple > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE Yes, that was always something we knew was lacking in the current WAL contents. > properly, and things are moving forward well. Recovery to an arbitrary > point-in-time should be just as easy, but will need some administrative > interface for it. The adminstrative part can be done easily. We can get that part done. It is the low-level stuff we always needed help with. > At this point, some input would be useful on how I should handle things. > > The most important questions that need answering are in sections 2 & 5, > since they impact the most other parts of the system. They will also > require good documentation for sysadmins. > ?0 - Introduction > > This file is where I'm keeping track of all the issues I run into while > trying to get PITR to work properly. Hopefully it will evolve into a > description of how PITR actually works once it is implemented. > > I will also try to add feedback as it comes in. > > The big items so-far are: > ?1 - Logging Relation file creation, truncation, and removal > This is mostly done. Can do infinte play-forward from > online logs. Excellent! > ?2 - Partial-Write and Bad Block detection > Need input before starting. Migration issues. Uh, we do log do pre-page writes to WAL to recover from partial page writes to disk. Is there something more we need here? As for bad block detection, we have thought about adding a CRC to each page header, or at least making it optional. WAL already has a CRC. > ?3 - Detecting Shutdown Consistent System Recovery > Mostly done. > ?4 - Interactive Play-Forward Recovery for an Entire System > Need input before starting. You mean user interface? My idea would be to just get some command-line tool working and we can write some GUI app to manage it and use the command-line tool as an interface into the system. > ?5 - Individual file consistent recovery > Need input. Semi-Major changes required. > OK, here are the specific questions. Got it. > ?1 - Logging Relation file creation, truncation, and removal > > ?1.1 - Problem: > > Without file creation in the log, we can't replay committed > transactions that create relations. > > The current code assumes that any transaction reaching commit has already > ensured it's files exist, and that those files will never be removed. This > is true now, but not for log-replay from an old backup database system. > > The current XLOG code silently ignores block-write requests for > non-existent files, and assumes that the transaction generating those > requests must have aborted. > > Right now a crash during TRUNCATE TABLE will leave the table in an > inconsistent state (partially truncated). This would not work when doing > replay from before the last checkpoint. Yes, there are a few places where we actually create file, and if the server crashes, the file remains out the forever. We need to track that better. I think we may need some utility that compares pg_class with the files in the directory and cleans out unused files on server startup. I started working on such code as part of VACUUM but it made too many assumptions because it knew other backends were working at the same time. On recovery, you don't have that problem and can easily do almost an 'ls' and clean out just left over from the crash. Seems that would solve several of those problems. > > ?1.1.1 - CREATE DATABASE is also unlogged > > This will cause the same replay problems as above. Yep. Again, seems a master cleanup on startup is needed. > ?1.2 - Proposal: > > a) Augment the SMGR code to log relation file operations, and to handle > redo requests properly. This is simple in the case of create. Drop must be > logged only IN the commit record. For truncate see (b). Yep, we knew we needed that. > The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink', > and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead > of a Relation. No problem. Clearly required. > Transactions that abort through system failure (ie. unlogged aborts) > will simply continue to leak files. Yep, need a cleanup process on start. > b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table > may be used in a future command, and a replay-recovered database may > end-up with different data than the original. We number based on oids. You mean oid wraparound could cause the file to be used again? > WAL must be flushed before truncate as well. > > WAL does not need to be flushed before create, if we don't mind > leaking files sometimes. Cleanup? > c) Redo code should treat writes to non-existent files as an error. > Changes affect heap & nbtree AM's. [Check others] Yep, once you log create/drop, if something doesn't match, it is an error, while before, we could ignore it. > d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of > a database should mark all the rtree indices as corrupt. > [ actually we should do that now, are we? ] Known problem. Not sure what is being done. TODO has: * Add WAL index reliability improvement to non-btree indexes so it is a known problem, and we aren't doing anything about it. What more can I say? ;-) > e) CREATE DATABASE must be logged properly, not use system(cp...) OK, should be interesting. > ?1.3 - Status: > > All logged SMGR operations are now in a START_CRIT_SECTION()/ > END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs > operations. > > Code has been added to smgr and xact modules to log: > create (no XLogFlush) > truncate (XLogFlush) > pending deletes on commit record > files to delete on abort record > > Code added to md.c to support redo ops > > Code added to smgr for RMGR redo/desc callbacks > > Code added to xact RMGR callbacks for redo/desc > > Database will do infinite shutdown consistent system recovery from the > online logs, if you manually munge the control file to set state == > DB_IN_PRODUCTION instead of DB_SHUTDOWNED. Wow, how did you get so far? > Still need to do: > Item (c), recovery cleanup in all AM's > Item (d), logging in other index AM's > Item (e), CREATE DATABASE stuff > > > > ?2 - Partial-Write and Bad Block detection > > ?2.1 - Problem: > > In order to protect against partial writes without logging pages > twice, we need to detect partial pages in system files and report them > to the system administrator. We also might want to be able to detect > damaged pages from other causes, like memory corruption, OS errors, > etc. or in the case where the disk doesn't report bad blocks, but > returns bad data. Interesting. We just had a discussion about MSSQL page tare bits on every 512-byte block that are set to the same value before the write. On recover, if any the bits in a block are different, they recommend recover using PITR. We don't have PITR (yet) so there was no need to implement it (we just logged whole pages to WAL before writing). I think we may go with just a CRC per page for partial-write detection. > We should also decide what should happen when a file is marked as > containing corrupt pages, and requires log-archive recovery from a > backup. We can offer the option of no-wal change-page writing which will require PITR on partial-write detection or they can keep the existing system with the performance hit. > > ?2.2 - Proposal: > > Add a 1 byte 'pd_flags' field to PageHeaderData, with the following > flag definitions: > > PD_BLOCK_CHECKING (1) > PD_BC_METHOD_BIT (1<<1) > > PageHasBlockChecking(page) ((page)->pd_flags & PD_BLOCK_CHECKING) > PageBCMethodIsCRC64(page) ((page)->pd_flags & PD_BC_METHOD_BIT) > PageBCMethodIsLSNLast(page) (!PageBCMethodIsCRC64(page)) > > The last 64 bits of a page are reserved for use by the block checking > code. OK, so you already are on the CRC route. > [ Is it worth the trouble to allow the last 8 bytes of a > page to contain data when block checking is turned off for a Page? > This proposal does not allow that. ] You can leave the 8-byte empty if no CRC. You may want to turn CRC on/off without dump. > If the block checking method is CRC64, then that field will contain > the CRC64 of the block computed at write time. Cool. > If the block checking method is LSNLast, then the field contains a > duplicate of the pd_lsn field. > > ?2.2.1 - Changes to Page handling routines > > All the page handling routines need to understand that > pd_special == (pd_special - (specialSize + 8)) > > Change header comment in bufpage.h to reflect this. Yes, we should add a format version to the heap page tail anyway like btree has, i.e. some constant on every page that describes the format used in that PostgreSQL version. > ?2.2.2 - When Reading a Page > > Block corruption is detected on read in the obvious way with CRC64. > > In the case of LSNLast, we check to see if pd_lsn == the lsn in the > last 64 bits of the page. If not, we assume the page is corrupt from > a partial write (although it could be something else). LSN? > IMPORTANT ASSUMPTION: > The OS/disk device will never write both the first part and > last part of a block without writing the middle as well. > This might be wrong in some cases, but at least it's fast. > > ?2.2.4 - GUC Variables > > The user should be able to configure what method is used: > > block_checking_write_method = [ checksum | torn_page_flag | none ] > > Which method should be used for blocks we write? Do we want torn page flag? Seems like a pain to get that on every 512 byte section of the 8k page. > check_blocks_on_read = [ true | false ] > > When true, verify that the blocks we read are not corrupt, using > whatever method is in the block header. > > When false, ignore the block checking information. > Good idea. We always check on crash, but check on read only when set. Good for detecting hardware problems. > ?2.3 - Status: > > Waiting for input from pgsql-hackers. > > Questions: > > Should we allow the user to have more detailed control over > which parts of a database use block checking? I don't think that is needed; installation-wide settings are fine. > For example: use 'checksum' on all system catalogs in all databases, > 'torn_page_flag' on the non-catalog parts of the production database, > and 'none' on everything else? Too complicated. Let's get it implemented and in the field and see what people ask for. > ?3 - Detecting Shutdown Consistent System Recovery > > ?3.1 - Problem: > > How to notice that we need to do log-replay for a system backup, when the > restored control file points to a shutdown checkpoint record that is > before the most recent checkpoint record in the log, and may point into > an archived file. > > ?3.2 - Proposal: > > At startup, after reading the ControlFile, scan the log directory to > get the list of active log files, and find the lowest logId and > logSeg of the files. Ensure that the files cover a contiguous range > of LSN's. > > There are three cases: > > 1) ControlFile points to the last valid checkpoint (either > checkPoint or prevCheckPoint, but one of them is the greatest > valid checkpoint record in the log stream). > > 2) ControlFile points to a valid checkpoint record in an active > log file, but there are more valid checkpoint records beyond > it. > > 3) ControlFile points to a checkpoint record that should be in the > archive logs, and is presumably valid. > > Case 1 is what we handle now. > > Cases 2 and 3 would result from restoring an entire system from > backup in preparation to do a play-forward recovery. > > We need to: > > Detect cases 2 and 3. > > Alert the administrator and abort startup. > [Question: Is this always the desired behavior? We can > handle case 2 without intervention. ] > > Let the administrator start a standalone backend, and > perform a play-forward recovery for the system. > > ?3.3 - Status: > > In progress. Sorry, I was confused by this. > ?4 - Interactive Play-Forward Recovery for an Entire System > > Play-Forward File Recovery from a backup file must be interactive, > because not all log files that we need are necessarily in the > archive directory. It may be possible that not all the archive files > we need can even fit on disk at one time. > > The system needs to be able to prompt the system administrator to feed > it more log files. > > TODO: More here Yes, we can have someone working on the GUI once the command-line interface is defined. > ?5 - Individual file consistent recovery > > ?5.1 - Problem: > > If a file detects corruption, and we restore it from backup, how do > we know what archived files we need for recovery? > > Should file corruption (partial write, bad disk block, etc.) outside > the system catalog cause us to abort the system, or should we just > take the relation or database off-line? Offline is often best so they can get in there and recover if needed. We usually allow them in with a special flag or utility like pg_resetxlog. > Given a backup file, how do we determine the point in the log > where we should start recovery for the file? What is the highest LSN > we can use that will fully recover the file? That is tricky. We have discussed it and your backup has to deal with some pretty strange things that can happen while 'tar' is traversing the directory. > ?5.2 - Proposal: > > Put a file header on each file, and update that header to the last > checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or > at least once every dbsize/'file_lsn_log_slack' megabytes of log > written, where dbsize is the estimated size of the database. Have > these values be settable from the config file. These updates would be > distributed throughout the hour, or interspersed between regular > amounts of log generation. > > If we have a database backup program or command, it can update the > header on the file before backup to the greatest value it can assure > to be safe. Not sure. > ?5.3 - Status: > > Waiting for input from pgsql-hackers. > > Questions: > > There are alternate methods than using a file header to get a > known-good LSN lower bound for the starting point to recover a backup > file. Is this the best way? Not sure. I am sure others will chime in with more information. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
J. R. Nield wrote: > ?5 - Individual file consistent recovery > > ?5.1 - Problem: > > If a file detects corruption, and we restore it from backup, how do > we know what archived files we need for recovery? > > Should file corruption (partial write, bad disk block, etc.) outside > the system catalog cause us to abort the system, or should we just > take the relation or database off-line? > > Given a backup file, how do we determine the point in the log > where we should start recovery for the file? What is the highest LSN > we can use that will fully recover the file? > > ?5.2 - Proposal: > > Put a file header on each file, and update that header to the last > checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or > at least once every dbsize/'file_lsn_log_slack' megabytes of log > written, where dbsize is the estimated size of the database. Have > these values be settable from the config file. These updates would be > distributed throughout the hour, or interspersed between regular > amounts of log generation. > > If we have a database backup program or command, it can update the > header on the file before backup to the greatest value it can assure > to be safe. I know there was discussion about this. The issue was when you are doing the backup, how do you handle changes in the file that happen during the backup? I think there was some idea of remembering the WAL pointer location at the start of the tar backup, and somehow playing forward from that point. Now, the trick is to know how much of the WAL is duplicated in the backup, and how much needs to be applied to roll forward. Because we have a non-overwriting storage manager, I think one idea was to just replay everything, knowing that some of it may be unnecessary. However, VACUUM FULL would complicate that because it does overwrite tuples and you may get into trouble playing all of that back. I am sure someone has analyzed this better than me. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote: > > We have needed > point-in-time recovery for a long time, Most thanks should go to vadim (and whoever else worked on this), since his WAL code already does most of the work. The key thing is auditing the backend to look for every case where we assume some action is not visible until after commit, and therefore don't log its effects. Those are the main cases that must be changed. > --------------------------------------------------------------------------- > > J. R. Nield wrote: > > Hello: > > > > I've got the logging system to the point where I can take a shutdown > > consistent copy of a system, and play forward through multiple > > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE But notably not for the btree indexes! It looked like they were working, because the files were there, and all indexes created before the backup would work under insert/delete (including sys catalog indexes). This is because btree insert/delete is logged, just not during build. So I missed that one case. You will end-up with up-to-date table data though, so it is something. Adding logging support to btbuild is the next step, and I don't think it should be too hard. I am working this now. It is also a major advantage that most everything in the system gets stored in the catalog tables, and so is logged already. > Uh, we do log pre-page writes to WAL to recover from partial page > writes to disk. Is there something more we need here? > > As for bad block detection, we have thought about adding a CRC to each > page header, or at least making it optional. WAL already has a CRC. > Yes this should be last to do, because it is not necessary for PITR, only for performance (the option not to write pre-images without fear of data loss). > Yes, there are a few places where we actually create a file, and if the > server crashes, the file remains out there forever. We need to track that > better. OK, there is a bigger problem then just tracking the file though. We sometimes do stuff to that file that we don't log. We assume that if we commit, the file must be OK and will not need replay because the transaction would not have committed if the file was not in a commitable state. If we abort, the system never sees the file, so in a sense we undo everything we did to the file. It is a kind of poor-man's rollback for certain operations, like btbuild, create table, etc. But it means that we cannot recover the file from the log, even after a commit. > > > > > ?1.1.1 - CREATE DATABASE is also unlogged > > > > This will cause the same replay problems as above. > > Yep. Again, seems a master cleanup on startup is needed. The cleanup is not the problem, only a nuisance. Creating the files during replay is the problem. I must recreate CREATE DATABASE from the log exactly as it was done originally. I think just logging the parameters to the command function should be sufficient, but I need to think more about it. > > > b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table > > may be used in a future command, and a replay-recovered database may > > end-up with different data than the original. > > We number based on oids. You mean oid wraparound could cause the file > to be used again? That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then, right before smgrtruncate is called, I do an XLogInsert saying "Redo a TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails and we do an elog(ERROR) Now the user decides that since TRUNCATE TABLE didn't work, he might as well use the table, so he inserts some records into it, generating log entries. When I replay this log sequence later, what happens if the TRUNCATE succeeds instead of failing? I admit that there are other ways of handling it than to PANIC if the truncate fails. All the ones I can come up with seem to amount to some kind of ad-hoc UNDO log. > > > WAL must be flushed before truncate as well. > > > > WAL does not need to be flushed before create, if we don't mind > > leaking files sometimes. > > Cleanup? Yes, we could garbage-collect leaked files. XLogFlush is not that expensive though, so I don't have an opinion on this yet. > > > c) Redo code should treat writes to non-existent files as an error. > > Changes affect heap & nbtree AM's. [Check others] > > Yep, once you log create/drop, if something doesn't match, it is an > error, while before, we could ignore it. > > > d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of > > a database should mark all the rtree indices as corrupt. > > [ actually we should do that now, are we? ] > > Known problem. Not sure what is being done. TODO has: > > * Add WAL index reliability improvement to non-btree indexes > > so it is a known problem, and we aren't doing anything about it. What > more can I say? ;-) Once the other stuff works reliably, I will turn to rtree logging, which I have looked at somewhat, although I could really use a copy of the paper it is supposed to be based on. I have not figured out GiST enough to work on it yet. > > > e) CREATE DATABASE must be logged properly, not use system(cp...) > > OK, should be interesting. > > > ?1.3 - Status: > > > > All logged SMGR operations are now in a START_CRIT_SECTION()/ > > END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs > > operations. > > > > Code has been added to smgr and xact modules to log: > > create (no XLogFlush) > > truncate (XLogFlush) > > pending deletes on commit record > > files to delete on abort record > > > > Code added to md.c to support redo ops > > > > Code added to smgr for RMGR redo/desc callbacks > > > > Code added to xact RMGR callbacks for redo/desc > > > > Database will do infinite shutdown consistent system recovery from the > > online logs, if you manually munge the control file to set state == > > DB_IN_PRODUCTION instead of DB_SHUTDOWNED. > > Wow, how did you get so far? Because it was almost there to start with :-) Besides, it sounded better before I realized there was still a remaining problem with btree logging to fix. > > In the case of LSNLast, we check to see if pd_lsn == the lsn in the > > last 64 bits of the page. If not, we assume the page is corrupt from > > a partial write (although it could be something else). > > LSN? Log Sequence Number (XLogRecPtr) > > > IMPORTANT ASSUMPTION: > > The OS/disk device will never write both the first part and > > last part of a block without writing the middle as well. > > This might be wrong in some cases, but at least it's fast. > > > > ?2.2.4 - GUC Variables > > > > The user should be able to configure what method is used: > > > > block_checking_write_method = [ checksum | torn_page_flag | none ] > > > > Which method should be used for blocks we write? > > Do we want torn page flag? Seems like a pain to get that on every 512 > byte section of the 8k page. Ok, this section (2.2) was badly written and hard to understand. What I am proposing is that we put a copy of the log sequence number, which is at the head of the page, into the 8 byte field that we are creating at the end of the page, in place of the CRC. The log sequence number increases every time the page is written (it already does this). I have called this method 'LSNLast' internally, and the user would call it the 'torn_page_flag' method. So when we read the page, we compare the Log Sequence Number at the beginning and end of the page, and if they are different we assume a torn page. This version is weaker than the MS one we were talking about, because it is not on every 512 byte section of the page, only the beginning and the end. I'm simply looking for a fast alternative to CRC64, that doesn't require massive reorganization of the page layout code. > > > ?2.3 - Status: > > > > Waiting for input from pgsql-hackers. > > > > Questions: > > > > Should we allow the user to have more detailed control over > > which parts of a database use block checking? > > I don't think that is needed; installation-wide settings are fine. > > > For example: use 'checksum' on all system catalogs in all databases, > > 'torn_page_flag' on the non-catalog parts of the production database, > > and 'none' on everything else? > > Too complicated. Let's get it implemented and in the field and see what > people ask for. Ok. I agree. > > > ?3 - Detecting Shutdown Consistent System Recovery > > > > ?3.3 - Status: > > > > In progress. > > Sorry, I was confused by this. Let me re-write it, and I'll post it in the next version. The section dealt with what to do when you have a valid restored controlfile from a backup system, which is in the DB_SHUTDOWNED state, and that points to a valid shutdown/checkpoint record in the log; only the checkpoint record happens not to be the last one in the log. This is a situation that could never happen now, but would in PITR. > > > ?4 - Interactive Play-Forward Recovery for an Entire System > > > > Play-Forward File Recovery from a backup file must be interactive, > > because not all log files that we need are necessarily in the > > archive directory. It may be possible that not all the archive files > > we need can even fit on disk at one time. > > > > The system needs to be able to prompt the system administrator to feed > > it more log files. > > > > TODO: More here > > Yes, we can have someone working on the GUI once the command-line > interface is defined. Yes, and the system must not allow any concurrent activity during recovery either. So it looks like a standalone backend operation. > > > ?5 - Individual file consistent recovery > > > > ?5.1 - Problem: > > > > If a file detects corruption, and we restore it from backup, how do > > we know what archived files we need for recovery? > > > > Should file corruption (partial write, bad disk block, etc.) outside > > the system catalog cause us to abort the system, or should we just > > take the relation or database off-line? > > Offline is often best so they can get in there and recover if needed. > We usually allow them in with a special flag or utility like > pg_resetxlog. > > > > Given a backup file, how do we determine the point in the log > > where we should start recovery for the file? What is the highest LSN > > we can use that will fully recover the file? > > That is tricky. We have discussed it and your backup has to deal with > some pretty strange things that can happen while 'tar' is traversing the > directory. Even if we shutdown before we copy the file, we don't want a file that hasn't been written to in 5 weeks before it was backed up to require five weeks of old log files to recover. So we need to track that information somehow, because right now if we scanned the blocks in the file looking for at the page LSN's, we greatest LSN we would see might be much older than where it would be safe to recover from. That is the biggest problem, I think. > > > > ?5.2 - Proposal: > > > > Put a file header on each file, and update that header to the last > > checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or > > at least once every dbsize/'file_lsn_log_slack' megabytes of log > > written, where dbsize is the estimated size of the database. Have > > these values be settable from the config file. These updates would be > > distributed throughout the hour, or interspersed between regular > > amounts of log generation. > > > > If we have a database backup program or command, it can update the > > header on the file before backup to the greatest value it can assure > > to be safe. > > Not sure. > > > ?5.3 - Status: > > > > Waiting for input from pgsql-hackers. > > > > Questions: > > > > There are alternate methods than using a file header to get a > > known-good LSN lower bound for the starting point to recover a backup > > file. Is this the best way? > > Not sure. > > I am sure others will chime in with more information. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > -- J. R. Nield jrnield@usol.com
> Let me re-write it, and I'll post it in the next version. The section > dealt with what to do when you have a valid restored controlfile from a > backup system, which is in the DB_SHUTDOWNED state, and that points to a > valid shutdown/checkpoint record in the log; only the checkpoint record > happens not to be the last one in the log. This is a situation that > could never happen now, but would in PITR. But it would need to be restore's responsibility to set the flag to DB_IN_PRODUCTION, no? > Even if we shutdown before we copy the file, we don't want a file that > hasn't been written to in 5 weeks before it was backed up to require > five weeks of old log files to recover. So we need to track that > information somehow, because right now if we scanned the blocks in the > file looking for at the page LSN's, we greatest LSN we would see might > be much older than where it would be safe to recover from. That is the > biggest problem, I think. Well, if you skip a validity test it could be restore's responsibility to know which checkpoint was last before the file backup was taken. (When doing a backup you would need to include the last checkpoint info == pg_control at start of backup) Andreas
J. R. Nield wrote: > On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote: > > > > We have needed > > point-in-time recovery for a long time, > > Most thanks should go to vadim (and whoever else worked on this), since > his WAL code already does most of the work. The key thing is auditing > the backend to look for every case where we assume some action is not > visible until after commit, and therefore don't log its effects. Those > are the main cases that must be changed. Yep. Glad you can focus on that. > > --------------------------------------------------------------------------- > > > > J. R. Nield wrote: > > > Hello: > > > > > > I've got the logging system to the point where I can take a shutdown > > > consistent copy of a system, and play forward through multiple > > > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE > > But notably not for the btree indexes! It looked like they were working, > because the files were there, and all indexes created before the backup > would work under insert/delete (including sys catalog indexes). This is > because btree insert/delete is logged, just not during build. So I > missed that one case. > > You will end-up with up-to-date table data though, so it is something. > > Adding logging support to btbuild is the next step, and I don't think it > should be too hard. I am working this now. Great. > It is also a major advantage that most everything in the system gets > stored in the catalog tables, and so is logged already. > > > > Uh, we do log pre-page writes to WAL to recover from partial page > > writes to disk. Is there something more we need here? > > > > As for bad block detection, we have thought about adding a CRC to each > > page header, or at least making it optional. WAL already has a CRC. > > > > Yes this should be last to do, because it is not necessary for PITR, > only for performance (the option not to write pre-images without fear of > data loss). Yep. > > Yes, there are a few places where we actually create a file, and if the > > server crashes, the file remains out there forever. We need to track that > > better. > > OK, there is a bigger problem then just tracking the file though. We > sometimes do stuff to that file that we don't log. We assume that if we > commit, the file must be OK and will not need replay because the > transaction would not have committed if the file was not in a commitable > state. If we abort, the system never sees the file, so in a sense we > undo everything we did to the file. It is a kind of poor-man's rollback > for certain operations, like btbuild, create table, etc. But it means > that we cannot recover the file from the log, even after a commit. Yep. > > > > > > > > ?1.1.1 - CREATE DATABASE is also unlogged > > > > > > This will cause the same replay problems as above. > > > > Yep. Again, seems a master cleanup on startup is needed. > > The cleanup is not the problem, only a nuisance. Creating the files > during replay is the problem. I must recreate CREATE DATABASE from the > log exactly as it was done originally. I think just logging the > parameters to the command function should be sufficient, but I need to > think more about it. OK, makes sense. Nice when you can bundle a complex action into the logging of one command and its parameters. > > > > > b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table > > > may be used in a future command, and a replay-recovered database may > > > end-up with different data than the original. > > > > We number based on oids. You mean oid wraparound could cause the file > > to be used again? > > That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then, > right before smgrtruncate is called, I do an XLogInsert saying "Redo a > TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails > and we do an elog(ERROR) > > Now the user decides that since TRUNCATE TABLE didn't work, he might as > well use the table, so he inserts some records into it, generating log > entries. > > When I replay this log sequence later, what happens if the TRUNCATE > succeeds instead of failing? You mean the user is now accessing a partially truncated table? That's just too weird. I don't see how the WAL would know how far truncation had gone. I see why you would need the panic and it seems acceptable. > I admit that there are other ways of handling it than to PANIC if the > truncate fails. All the ones I can come up with seem to amount to some > kind of ad-hoc UNDO log. Yea, truncate failure seems so rare/impossible to happen, we can do a panic and see if it ever happens to anyone. I bet it will not. Those are usually cases of an OS crash, so it is the same as a panic. > > > WAL must be flushed before truncate as well. > > > > > > WAL does not need to be flushed before create, if we don't mind > > > leaking files sometimes. > > > > Cleanup? > > Yes, we could garbage-collect leaked files. XLogFlush is not that > expensive though, so I don't have an opinion on this yet. Right now, if we do CREATE TABLE, and the backend crashes, I think it leaves a nonreferenced file around. Not something you need to worry about for replication, I guess. We can address it later. > > > c) Redo code should treat writes to non-existent files as an error. > > > Changes affect heap & nbtree AM's. [Check others] > > > > Yep, once you log create/drop, if something doesn't match, it is an > > error, while before, we could ignore it. > > > > > d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of > > > a database should mark all the rtree indices as corrupt. > > > [ actually we should do that now, are we? ] > > > > Known problem. Not sure what is being done. TODO has: > > > > * Add WAL index reliability improvement to non-btree indexes > > > > so it is a known problem, and we aren't doing anything about it. What > > more can I say? ;-) > > Once the other stuff works reliably, I will turn to rtree logging, which > I have looked at somewhat, although I could really use a copy of the > paper it is supposed to be based on. I have not figured out GiST enough > to work on it yet. There has been talk of retiring rtree and using the GIST version of rtree. I thought it had some advantages/disadvantages. I don't remember for sure. > > > Database will do infinite shutdown consistent system recovery from the > > > online logs, if you manually munge the control file to set state == > > > DB_IN_PRODUCTION instead of DB_SHUTDOWNED. > > > > Wow, how did you get so far? > > Because it was almost there to start with :-) > > Besides, it sounded better before I realized there was still a remaining > problem with btree logging to fix. Our major problem is that we have a very few people who like to work at this level in the code. Glad you are around. > > > In the case of LSNLast, we check to see if pd_lsn == the lsn in the > > > last 64 bits of the page. If not, we assume the page is corrupt from > > > a partial write (although it could be something else). > > > > LSN? > > Log Sequence Number (XLogRecPtr) Yep, I remembered later. > > > IMPORTANT ASSUMPTION: > > > The OS/disk device will never write both the first part and > > > last part of a block without writing the middle as well. > > > This might be wrong in some cases, but at least it's fast. > > > > > > ?2.2.4 - GUC Variables > > > > > > The user should be able to configure what method is used: > > > > > > block_checking_write_method = [ checksum | torn_page_flag | none ] > > > > > > Which method should be used for blocks we write? > > > > Do we want torn page flag? Seems like a pain to get that on every 512 > > byte section of the 8k page. > > Ok, this section (2.2) was badly written and hard to understand. What I > am proposing is that we put a copy of the log sequence number, which is > at the head of the page, into the 8 byte field that we are creating at > the end of the page, in place of the CRC. The log sequence number > increases every time the page is written (it already does this). I have > called this method 'LSNLast' internally, and the user would call it the > 'torn_page_flag' method. > > So when we read the page, we compare the Log Sequence Number at the > beginning and end of the page, and if they are different we assume a > torn page. Excellent. > This version is weaker than the MS one we were talking about, because it > is not on every 512 byte section of the page, only the beginning and the > end. I'm simply looking for a fast alternative to CRC64, that doesn't > require massive reorganization of the page layout code. Great idea, and cheap. > > > The system needs to be able to prompt the system administrator to feed > > > it more log files. > > > > > > TODO: More here > > > > Yes, we can have someone working on the GUI once the command-line > > interface is defined. > > Yes, and the system must not allow any concurrent activity during > recovery either. So it looks like a standalone backend operation. OK, we can address this capability. > > That is tricky. We have discussed it and your backup has to deal with > > some pretty strange things that can happen while 'tar' is traversing the > > directory. OK, first, are you thinking of having the nightly backup operate at the file system level, or accessing the pages through the PostgreSQL shared buffers? > Even if we shutdown before we copy the file, we don't want a file that Oh, so you are thinking of some kind of tar while the db is shutdown, and using that as the backup? > hasn't been written to in 5 weeks before it was backed up to require > five weeks of old log files to recover. So we need to track that > information somehow, because right now if we scanned the blocks in the > file looking for at the page LSN's, we greatest LSN we would see might > be much older than where it would be safe to recover from. That is the > biggest problem, I think. You are saying, "How do we know what WAL records go with that backup snapshot of the file?" OK, lets assume we are shutdown. You can grab the WAL log info from pg_control using contrib/pg_controldata and that tells you what WAL logs to roll forward when you need to PIT recover that backup later. If you store that info in the first file you backup, you can have that WAL pointer available for later recovery in case you are restoring from that backup. Is that the issue? What seems more complicated is doing the backup while the database is active, and this may be a requirement for a final PITR solution. Some think we can grab the WAL pointer at 'tar' start and replay that on the backup even if the file changes during backup. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > You are saying, "How do we know what WAL records go with that backup > snapshot of the file?" OK, lets assume we are shutdown. You can grab > the WAL log info from pg_control using contrib/pg_controldata and that > tells you what WAL logs to roll forward when you need to PIT recover > that backup later. If you store that info in the first file you backup, > you can have that WAL pointer available for later recovery in case you > are restoring from that backup. Is that the issue? > > What seems more complicated is doing the backup while the database is > active, and this may be a requirement for a final PITR solution. Some > think we can grab the WAL pointer at 'tar' start and replay that on the > backup even if the file changes during backup. OK, I think I understand live backups now using tar and PITR. Someone explained this to me months ago but now I understand it. First, a key issue is that PostgreSQL doesn't fiddle with individual items on disk. It reads an 8k block, modifies it, (writes it to WAL if it hasn't been written to that WAL segment before), and writes it to disk. That is key. (Are there cases where don't do this, like pg_controldata?) OK, so you do a tar backup of a file. While you are doing the tar, certain 8k blocks are being modified in the file. There is no way to know what blocks are modified as you are doing the tar, and in fact you could read partial page writes during the tar. One solution would be to read the file using the PostgreSQL page buffer, but even then, getting a stable snapshot of the file would be difficult. Now, we could lock the table and prevent writes while it is being backed up, but there is a better way. We already have pre-change page images in WAL. When we do the backup, any page that was modified while we were backing up is in the WAL. On restore, we can recover whatever tar saw of the file, knowing that the WAL page images will recover any page changes made during the tar. Now, you mentioned we may not want pre-change page images in WAL because, with PITR, we can more easily recover from the WAL rather than having this performance hit for many page writes. What I suggest is a way for the backup tar to turn on pre-change page images while the tar is happening, and turn it off after the tar is done. We already have this TODO item: * Turn off after-change writes if fsync is disabled (?) No sense in doing after-change WAL writes without fsync. We could extend this so those after-changes writes could be turned on an off, allowing fill tar backups and PITR recovery. In fact, for people with reliable hardware, we should already be giving them the option of turning off pre-change writes. We don't have a way of detecting partial page writes, but then again, we can't detect failures with fsync off anyway so it seems to be the same vulnerability. I guess that's why we were going to wrap the effect into the same variable, but for PITR, can see wanting fsync always on and the ability to turn pre-change writes on and off. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
J.R., Nice first draft and a good read. Was going to comment in-line but thought this method would be easier to follow. The comments/suggestions below assume that PIT recovery is being performed at the cluster level with a data backup image created by a tar-like utility. As noted, one of the main problems is knowing where to begin in the log. This can be handled by having backup processing update the control file with the first lsn and log file required. At the time of the backup, this information is or can be made available. The control file can be the last file added to the tar and can contain information spanning the entire backup process. For data consistency, since the backup is being performed on an active cluster, we have to make sure to mark the end of the backup. On restore, to make the cluster consistent, you have to force the user to perform forward recovery passed the point of the backup completion marker in the (archived) log. This can be handled using a backup end log record. The backup end log record would have to contain an identifier unique to this backup. If a user requests to stop PIT recovery before this log record is encountered, consistency is not guaranteed. PIT should either disallow the action or warn of possible / impending doom. The necessary logging for rtee (and others) insertions/deletions can be added to the base code. Not much of a worry but I would expect to encounter other missing log items during testing. The idea of using the last lsn on the page to detect a partial write is used by other dbms systems. You already have that information available so there is no overhead in computing it. Nothing wrong with CRC though. As for the DB_SHUTDOWNED state, this could be handled by having the backup processing update the control file field to DB_PIT_REQUIRED (or some such identifier). After a restore, users would be blocked from connecting to the cluster's databases until a forward recovery passed the backup end log record has completed successfully. At the end of normal crash recovery, the user has to go digging to identify in-flight transactions still in the system and abort them manually. It would be nice if PIT recovery automatically aborted all in-flight transactions at the end. As PostgreSQL heads towards forward recovery functionality, it may be wise to add headers to the log files. As the logs from any cluster are identically named, the header would allow unique identification of the file and contents (cluster name, unique log id, id of the prior log file for chaining purposes, lsn ranges, etc). Most helpful. Just a few notes from the administrative side. PIT recovery should probably offer the user the following actions: . forward recover to end of logs [and stop] Process log files located in the current directory until you read through thelast one. Allow the user the option to stop or not, just in case the logs are archived. Send back the timestamp of thelast encountered commit log record and the series of log files scanned. . forward recover to PIT [and stop] Similar to that described above but use the commit timestamps to gauge PIT progress. . forward recover query Send back the log series covered and the last commit timestamp encountered. . forward recover stop Stop the current forward recovery session. Undo all in-flight transactions and bring the databasesdown in a consistent state. No other external user actions should be required. Looking forward to reading draft 2. Cheers, Patrick -- Patrick Macdonald Red Hat Canada "J. R. Nield" wrote: > > Hello: > > I've got the logging system to the point where I can take a shutdown > consistent copy of a system, and play forward through multiple > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE > properly, and things are moving forward well. Recovery to an arbitrary > point-in-time should be just as easy, but will need some administrative > interface for it. > > At this point, some input would be useful on how I should handle things. > > The most important questions that need answering are in sections 2 & 5, > since they impact the most other parts of the system. They will also > require good documentation for sysadmins. > > Issues Outstanding for Point In Time Recovery (PITR) > > $Date: 2002/07/04 14:23:37 $ > > $Revision: 1.4 $ > > J.R. Nield > > (Enc: ISO 8859-15 Latin-9) > > §0 - Introduction > > This file is where I'm keeping track of all the issues I run into while > trying to get PITR to work properly. Hopefully it will evolve into a > description of how PITR actually works once it is implemented. > > I will also try to add feedback as it comes in. > > The big items so-far are: > §1 - Logging Relation file creation, truncation, and removal > This is mostly done. Can do infinte play-forward from > online logs. > §2 - Partial-Write and Bad Block detection > Need input before starting. Migration issues. > §3 - Detecting Shutdown Consistent System Recovery > Mostly done. > §4 - Interactive Play-Forward Recovery for an Entire System > Need input before starting. > §5 - Individual file consistent recovery > Need input. Semi-Major changes required. > > §1 - Logging Relation file creation, truncation, and removal > > §1.1 - Problem: > > Without file creation in the log, we can't replay committed > transactions that create relations. > > The current code assumes that any transaction reaching commit has already > ensured it's files exist, and that those files will never be removed. This > is true now, but not for log-replay from an old backup database system. > > The current XLOG code silently ignores block-write requests for > non-existent files, and assumes that the transaction generating those > requests must have aborted. > > Right now a crash during TRUNCATE TABLE will leave the table in an > inconsistent state (partially truncated). This would not work when doing > replay from before the last checkpoint. > > §1.1.1 - CREATE DATABASE is also unlogged > > This will cause the same replay problems as above. > > §1.2 - Proposal: > > a) Augment the SMGR code to log relation file operations, and to handle > redo requests properly. This is simple in the case of create. Drop must be > logged only IN the commit record. For truncate see (b). > > The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink', > and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead > of a Relation. > > Transactions that abort through system failure (ie. unlogged aborts) > will simply continue to leak files. > > b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table > may be used in a future command, and a replay-recovered database may > end-up with different data than the original. > > WAL must be flushed before truncate as well. > > WAL does not need to be flushed before create, if we don't mind > leaking files sometimes. > > c) Redo code should treat writes to non-existent files as an error. > Changes affect heap & nbtree AM's. [Check others] > > d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of > a database should mark all the rtree indices as corrupt. > [ actually we should do that now, are we? ] > > e) CREATE DATABASE must be logged properly, not use system(cp...) > > §1.3 - Status: > > All logged SMGR operations are now in a START_CRIT_SECTION()/ > END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs > operations. > > Code has been added to smgr and xact modules to log: > create (no XLogFlush) > truncate (XLogFlush) > pending deletes on commit record > files to delete on abort record > > Code added to md.c to support redo ops > > Code added to smgr for RMGR redo/desc callbacks > > Code added to xact RMGR callbacks for redo/desc > > Database will do infinite shutdown consistent system recovery from the > online logs, if you manually munge the control file to set state == > DB_IN_PRODUCTION instead of DB_SHUTDOWNED. > > Still need to do: > Item (c), recovery cleanup in all AM's > Item (d), logging in other index AM's > Item (e), CREATE DATABASE stuff > > §2 - Partial-Write and Bad Block detection > > §2.1 - Problem: > > In order to protect against partial writes without logging pages > twice, we need to detect partial pages in system files and report them > to the system administrator. We also might want to be able to detect > damaged pages from other causes, like memory corruption, OS errors, > etc. or in the case where the disk doesn't report bad blocks, but > returns bad data. > > We should also decide what should happen when a file is marked as > containing corrupt pages, and requires log-archive recovery from a > backup. > > §2.2 - Proposal: > > Add a 1 byte 'pd_flags' field to PageHeaderData, with the following > flag definitions: > > PD_BLOCK_CHECKING (1) > PD_BC_METHOD_BIT (1<<1) > > PageHasBlockChecking(page) ((page)->pd_flags & PD_BLOCK_CHECKING) > PageBCMethodIsCRC64(page) ((page)->pd_flags & PD_BC_METHOD_BIT) > PageBCMethodIsLSNLast(page) (!PageBCMethodIsCRC64(page)) > > The last 64 bits of a page are reserved for use by the block checking > code. > > [ Is it worth the trouble to allow the last 8 bytes of a > page to contain data when block checking is turned off for a Page? > > This proposal does not allow that. ] > > If the block checking method is CRC64, then that field will contain > the CRC64 of the block computed at write time. > > If the block checking method is LSNLast, then the field contains a > duplicate of the pd_lsn field. > > §2.2.1 - Changes to Page handling routines > > All the page handling routines need to understand that > pd_special == (pd_special - (specialSize + 8)) > > Change header comment in bufpage.h to reflect this. > > §2.2.2 - When Reading a Page > > Block corruption is detected on read in the obvious way with CRC64. > > In the case of LSNLast, we check to see if pd_lsn == the lsn in the > last 64 bits of the page. If not, we assume the page is corrupt from > a partial write (although it could be something else). > > IMPORTANT ASSUMPTION: > The OS/disk device will never write both the first part and > last part of a block without writing the middle as well. > This might be wrong in some cases, but at least it's fast. > > §2.2.4 - GUC Variables > > The user should be able to configure what method is used: > > block_checking_write_method = [ checksum | torn_page_flag | none ] > > Which method should be used for blocks we write? > > check_blocks_on_read = [ true | false ] > > When true, verify that the blocks we read are not corrupt, using > whatever method is in the block header. > > When false, ignore the block checking information. > > §2.3 - Status: > > Waiting for input from pgsql-hackers. > > Questions: > > Should we allow the user to have more detailed control over > which parts of a database use block checking? > > For example: use 'checksum' on all system catalogs in all databases, > 'torn_page_flag' on the non-catalog parts of the production database, > and 'none' on everything else? > > §3 - Detecting Shutdown Consistent System Recovery > > §3.1 - Problem: > > How to notice that we need to do log-replay for a system backup, when the > restored control file points to a shutdown checkpoint record that is > before the most recent checkpoint record in the log, and may point into > an archived file. > > §3.2 - Proposal: > > At startup, after reading the ControlFile, scan the log directory to > get the list of active log files, and find the lowest logId and > logSeg of the files. Ensure that the files cover a contiguous range > of LSN's. > > There are three cases: > > 1) ControlFile points to the last valid checkpoint (either > checkPoint or prevCheckPoint, but one of them is the greatest > valid checkpoint record in the log stream). > > 2) ControlFile points to a valid checkpoint record in an active > log file, but there are more valid checkpoint records beyond > it. > > 3) ControlFile points to a checkpoint record that should be in the > archive logs, and is presumably valid. > > Case 1 is what we handle now. > > Cases 2 and 3 would result from restoring an entire system from > backup in preparation to do a play-forward recovery. > > We need to: > > Detect cases 2 and 3. > > Alert the administrator and abort startup. > [Question: Is this always the desired behavior? We can > handle case 2 without intervention. ] > > Let the administrator start a standalone backend, and > perform a play-forward recovery for the system. > > §3.3 - Status: > > In progress. > > §4 - Interactive Play-Forward Recovery for an Entire System > > Play-Forward File Recovery from a backup file must be interactive, > because not all log files that we need are necessarily in the > archive directory. It may be possible that not all the archive files > we need can even fit on disk at one time. > > The system needs to be able to prompt the system administrator to feed > it more log files. > > TODO: More here > > §5 - Individual file consistent recovery > > §5.1 - Problem: > > If a file detects corruption, and we restore it from backup, how do > we know what archived files we need for recovery? > > Should file corruption (partial write, bad disk block, etc.) outside > the system catalog cause us to abort the system, or should we just > take the relation or database off-line? > > Given a backup file, how do we determine the point in the log > where we should start recovery for the file? What is the highest LSN > we can use that will fully recover the file? > > §5.2 - Proposal: > > Put a file header on each file, and update that header to the last > checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or > at least once every dbsize/'file_lsn_log_slack' megabytes of log > written, where dbsize is the estimated size of the database. Have > these values be settable from the config file. These updates would be > distributed throughout the hour, or interspersed between regular > amounts of log generation. > > If we have a database backup program or command, it can update the > header on the file before backup to the greatest value it can assure > to be safe. > > §5.3 - Status: > > Waiting for input from pgsql-hackers. > > Questions: > > There are alternate methods than using a file header to get a > known-good LSN lower bound for the starting point to recover a backup > file. Is this the best way? > > A) The Definitions > > This stuff is obtuse, but I need it here to keep track of what I'm > saying. Someday I should use it consistently in the rest of this > document. > > "system" or "database system": > > A collection of postgres "databases" in one $PGDATA directory, > managed by one postmaster instance at a time (and having one WAL > log, etc.) > > All the files composing such a system, as a group. > > "up to date" or "now" or "current" or "current LSN": > > The most recent durable LSN for the system. > > "block consistent copy": > > When referring to a file: > > A copy of a file, which may be written to during the process of > copying, but where each BLCKSZ size block is copied atomically. > > When referring to multiple files (in the same system): > > A copy of all the files, such that each is independently a "block > consistent copy" > > "file consistent copy": > > When referring to a file: > > A copy of a file that is not written to between the start and end > of the copy operation. > > When referring to multiple files (in the same system): > > A copy of all the files, such that each is independently a "file > consistent copy" > > "system consistent copy": > > When referring to a file: > > A copy of a file, where the entire system of which it is a member > is not written to during the copy. > > When referring to multiple files (in the same system): > > A copy of all the files, where the entire system of which they are > members was not written to between the start and end of the > copying of all the files, as a group. > > "shutdown consistent copy": > > When referring to a file: > > A copy of a file, where the entire system of which it is a member > had been cleanly shutdown before the start of and for the duration > of the copy. > > When referring to multiple files (in the same system): > > A copy of all the files, where the entire system of which they are > members had been cleanly shutdown before the start of and for the > duration of the copying of all the files, as a group. > > "consistent copy": > > A block, file, system, or shutdown consistent copy. > > "known-good LSN lower bound" > or "LSN lower bound" > or "LSN-LB": > > When referring to a group of blocks, a file, or a group of files: > > An LSN known to be old enough that no log entries before it are needed > to bring the blocks or files up-to-date. > > "known-good LSN greatest lower bound" > or "LSN greatest lower bound" > or "LSN-GLB": > > When referring to a group of blocks, a file, or a group of files: > > The greatest possible LSN that is a known-good LSN lower bound for > the group. > > "backup file": > > A consistent copy of a data file used by the system, for which > we have a known-good LSN lower bound. > > "optimal backup file": > > A backup file, for which we have the known-good LSN greatest lower > bound. > > "backup system": > > > "Play-Forward File Recovery" or "PFFR": > > The process of bringing an individual backup file up to date. > > > -- > J. R. Nield > jrnield@usol.com > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Patrick Macdonald wrote: > The idea of using the last lsn on the page to detect a partial > write is used by other dbms systems. You already have that > information available so there is no overhead in computing it. > Nothing wrong with CRC though. Agreed. Just thought I would point out that is not guaranteed. Suppose the 8k block is spread over 16 512 sectors in two cylinders. The OS or SCSI tagged queuing could wrote the second part of the page (sectors 9-16) before the first group (1-8). If it writes 9-16, then writes 1-8 but fails in the middle of 1-8, the LSN will match at the front and back of the page, but the page will be partially written. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> OK, so you do a tar backup of a file. While you are doing the tar, > certain 8k blocks are being modified in the file. There is no way to > know what blocks are modified as you are doing the tar, and in fact you > could read partial page writes during the tar. No, I think all OS's (Unix and NT at least) guard against this, as long as the whole 8k block is written in one call. It is only the physical layer (disk) that is prone to partial writes. > any page that was modified while we were backing up is in the WAL. On > restore, we can recover whatever tar saw of the file, knowing that the > WAL page images will recover any page changes made during the tar. Assuming above, I do not think this is necessary. > What I suggest is a way for the backup tar to turn on pre-change page > images while the tar is happening, and turn it off after the tar is > done. Again, I do not think this is necessary. Andreas
> As noted, one of the main problems is knowing where to begin > in the log. This can be handled by having backup processing > update the control file with the first lsn and log file > required. At the time of the backup, this information is or > can be made available. The control file can be the last file > added to the tar and can contain information spanning the entire > backup process. lsn and logfile number (of latest checkpoints) is already in the control file, thus you need control file at start of backup. (To reduce the number of logs needed for restore of an online backup you could force a checkpoint before starting file backup) You will also need lsn and logfile number after file backup, to know how much log needs to at least be replayed to regain a consistent state. Andreas
Zeugswetter Andreas SB SD wrote: > > > As noted, one of the main problems is knowing where to begin > > in the log. This can be handled by having backup processing > > update the control file with the first lsn and log file > > required. At the time of the backup, this information is or > > can be made available. The control file can be the last file > > added to the tar and can contain information spanning the entire > > backup process. > > lsn and logfile number (of latest checkpoints) is already in the control > file, thus you need control file at start of backup. (To reduce the number > of logs needed for restore of an online backup you could force a checkpoint > before starting file backup) Maybe I should have been more clear. The control file snapshot must be taken at backup start (as you mention) but can be stored in cache. The fields can then be modified as we see fit. At the end of backup, we can write this to a temp file and add it to the tar. Therefore, as mentioned, the snapshot spans the entire backup process. > You will also need lsn and logfile number after file backup, to know how much > log needs to at least be replayed to regain a consistent state. This is a nicety but not a necessity. If you have a backup end log record, you just have to enforce that the PIT recovery encounters that particular log record on forward recovery. Once encountered, you know that you at passed the point of back up end. Cheers, Patrick
Zeugswetter Andreas SB SD wrote: > > > OK, so you do a tar backup of a file. While you are doing the tar, > > certain 8k blocks are being modified in the file. There is no way to > > know what blocks are modified as you are doing the tar, and in fact you > > could read partial page writes during the tar. > > No, I think all OS's (Unix and NT at least) guard against this, as long as > the whole 8k block is written in one call. It is only the physical layer (disk) > that is prone to partial writes. Yes, good point. The kernel will present a unified view of the 8k block. Of course, there are still cases where 8k blocks are being changed in front/behind in the tarred file. Will WAL allow us to re-synchronize that file even if part of it has pages from an earlier in time than other pages. Uh, I think so. So maybe we don't need the pre-write images in WAL after all. Can we replay the WAL when some pages in the restored file _have_ the WAL changes and some don't? Maybe the LSN on the pages helps with this. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I know that in Oracle there are 'alter database begin backup' and 'alter database end backup' commands that allow you to script your hot backups through a cron job by calling the begin backup command first, then using disk backup method of choice and then finally call the end backup command. --Barry Patrick Macdonald wrote: >Zeugswetter Andreas SB SD wrote: > > >>>As noted, one of the main problems is knowing where to begin >>>in the log. This can be handled by having backup processing >>>update the control file with the first lsn and log file >>>required. At the time of the backup, this information is or >>>can be made available. The control file can be the last file >>>added to the tar and can contain information spanning the entire >>>backup process. >>> >>> >>lsn and logfile number (of latest checkpoints) is already in the control >>file, thus you need control file at start of backup. (To reduce the number >>of logs needed for restore of an online backup you could force a checkpoint >>before starting file backup) >> >> > >Maybe I should have been more clear. The control file snapshot must >be taken at backup start (as you mention) but can be stored in cache. >The fields can then be modified as we see fit. At the end of backup, >we can write this to a temp file and add it to the tar. Therefore, >as mentioned, the snapshot spans the entire backup process. > > > >>You will also need lsn and logfile number after file backup, to know how much >>log needs to at least be replayed to regain a consistent state. >> >> > >This is a nicety but not a necessity. If you have a backup end log >record, you just have to enforce that the PIT recovery encounters >that particular log record on forward recovery. Once encountered, >you know that you at passed the point of back up end. > >Cheers, >Patrick > > > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html > > > > >
On Mon, 2002-07-08 at 21:53, Barry Lind wrote: > I know that in Oracle there are 'alter database begin backup' and 'alter > database end backup' commands that allow you to script your hot backups > through a cron job by calling the begin backup command first, then using > disk backup method of choice and then finally call the end backup command. This gave me an idea of a not-too-difficult-to-implement way of doing consistent online backups (thanks to MVCC it is probably much easier than Oracle's): Backup: 1) record the lowest uncommitted transaction number (LUTN) , this may have problems with wraparound, but I guess they are solvable. Disllow VACUUM. Do a CHECKPOINT ('alter database begin backup') 3) make a file-level (.tar) backup of data directory. 4) Allow VACUUM. ('alter database end backup') Restore: 1) restore the data directory from file-level backup 2) mark all transactions committed after LUTN as aborted, effectively deleting all tuples inserted and resurrecting those deleted/updated after start of backups. 3) make sure that new transaction number is large enough. PS. It would be nice if our OID-based filenames had some type indicator in their names - it is usually waste of time and space to backup indexes and temp tables. The names could be of form pg_class.relkind:pg_class.relfilenode instead of just pg_class.relfilenode they are now. ------------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > 1) record the lowest uncommitted transaction number (LUTN) , this may > have problems with wraparound, but I guess they are solvable. Disllow > VACUUM. Do a CHECKPOINT ('alter database begin backup') > 3) make a file-level (.tar) backup of data directory. > 4) Allow VACUUM. ('alter database end backup') Transactions don't necessarily commit in sequence number order, so the concept of LUTN seems meaningless. Why is it necessary (or even good) to disallow VACUUM? I really dislike a design that allows the DBA to cripple the database by forgetting the last step in a (long) process. regards, tom lane
On Tue, 2002-07-09 at 17:26, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > 1) record the lowest uncommitted transaction number (LUTN) , this may > > have problems with wraparound, but I guess they are solvable. Disllow > > VACUUM. Do a CHECKPOINT ('alter database begin backup') > > 3) make a file-level (.tar) backup of data directory. > > 4) Allow VACUUM. ('alter database end backup') > > Transactions don't necessarily commit in sequence number order, so the > concept of LUTN seems meaningless. Not quite. It is the most simple way to be sure that if we invalidate all transactions >= than it we get back to a fairly recent Point-In-Time. The real solution would of course be to remember all committed transactions at this PIT, which can probably be done by remembering LUTN and all individual committed transactions > LUTN > Why is it necessary (or even good) to disallow VACUUM? So that it would be possible to resurrect these tuples that have been deleted/updated during disk-level backup. I would like better the ability to tell VACUUM not to touch tuples where deleting transaction number >= LUTN . IIRC the original postgres was able to do that. > I really dislike > a design that allows the DBA to cripple the database by forgetting the > last step in a (long) process. There are several ways around it. 1. do it in a script, that will not forget. 2. Closing the session that did 'alter database begin backup' session could do it automatically, but this would make the backup script trickier. 3. VACUUM should not block but report a warning about being restricted from running. 4. database can be instructed to send a message to DBA's pager if it has been in 'begin backup' state too long ;) ---------------- Hannu
J.R., just checking to see how PITR recovery is going. Do you need any assistance or have any questions for us? Also, do you have any idea how close you are to having something completed? Are you aware we are closing development of 7.3 at the end of August and start beta September 1? Is there any way we can help you? --------------------------------------------------------------------------- J. R. Nield wrote: > On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote: > > > > We have needed > > point-in-time recovery for a long time, > > Most thanks should go to vadim (and whoever else worked on this), since > his WAL code already does most of the work. The key thing is auditing > the backend to look for every case where we assume some action is not > visible until after commit, and therefore don't log its effects. Those > are the main cases that must be changed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote: > > J.R., just checking to see how PITR recovery is going. Do you need any > assistance or have any questions for us? > > Also, do you have any idea how close you are to having something > completed? Are you aware we are closing development of 7.3 at the end > of August and start beta September 1? Is there any way we can help you? > It should be ready to go into CVS by the end of the month. That will include: logging all operations except for rtree and GiST, archival of logfiles (with new postgresql.conf params), headers on the logfiles to verify the system that created them, standalone backend recovery to a point-in-time, and a rudimentary hot backup capability. I could use some advice on the proper way to add tests to configure.in, given that the autoconf output is in CVS. Would you ever want a patch to include the generated 'configure' file? Related to that, the other place I need advice is on adding Ted Tso's LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we allowed to use this? There is a free OSF/DCE spec for UUID's, so I can re-implement the library if required. We also haven't discussed commands for backup/restore, but I will use what I think is appropriate and we can change the grammar if needed. The initial hot-backup capability will require the database to be in read-only mode and use tar for backup, and I will add the ability to allow writes later. Does this sound like a reasonable timeframe/feature-set to make the 7.3 release? -- J. R. Nield jrnield@usol.com
"J. R. Nield" <jrnield@usol.com> writes: > Related to that, the other place I need advice is on adding Ted Tso's > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we > allowed to use this? Uh, why exactly is UUID essential for this? (The correct answer is "it's not", IMHO.) > We also haven't discussed commands for backup/restore, but I will use > what I think is appropriate and we can change the grammar if needed. The > initial hot-backup capability will require the database to be in > read-only mode and use tar for backup, and I will add the ability to > allow writes later. There is no read-only mode, and I for one will resist adding one. regards, tom lane
J. R. Nield wrote: > On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote: > > > > J.R., just checking to see how PITR recovery is going. Do you need any > > assistance or have any questions for us? > > > > Also, do you have any idea how close you are to having something > > completed? Are you aware we are closing development of 7.3 at the end > > of August and start beta September 1? Is there any way we can help you? > > > > It should be ready to go into CVS by the end of the month. > > That will include: logging all operations except for rtree and GiST, > archival of logfiles (with new postgresql.conf params), headers on the > logfiles to verify the system that created them, standalone backend > recovery to a point-in-time, and a rudimentary hot backup capability. Sounds great. That gives us another month to iron out any remaining issues. This will be a great 7.3 feature! > I could use some advice on the proper way to add tests to configure.in, > given that the autoconf output is in CVS. Would you ever want a patch to > include the generated 'configure' file? We only patch configure.in. If you post to hackers, they can give you assistance and I will try to help however I can. I can so some configure.in stuff for you myself. > Related to that, the other place I need advice is on adding Ted Tso's > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we > allowed to use this? There is a free OSF/DCE spec for UUID's, so I can > re-implement the library if required. We talked about this on the replication mailing list. We decided that hostname, properly hashed to an integer, was the proper way to get this value. Also, there should be a postgresql.conf variable so you can override the hostname-generated value if you wish. I think that is sufficient. > We also haven't discussed commands for backup/restore, but I will use > what I think is appropriate and we can change the grammar if needed. The > initial hot-backup capability will require the database to be in > read-only mode and use tar for backup, and I will add the ability to > allow writes later. Yea, I saw Tom balked at that. I think we have enough manpower and time that we can get hot backup in normal read/write mode working before 7.3 beta so I would just code it assuming the system is live and we can deal with making it hot-capable once it is in CVS. It doesn't have to work 100% until beta time. > Does this sound like a reasonable timeframe/feature-set to make the 7.3 > release? Sounds great. This is another killer 7.3 feature, and we really need this for greater enterprise acceptance of PostgreSQL. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
We also have implemented a roll forward recovery mechanism. We modified a 7.2.1 version of Postgres. The mechanism is designed to provide a means of recoverying from the loss or corruption of media. It provides for duplicating wal_files so that if a wal_file is lost roll forward recovery can recover the database using the duplicated wal_files. Hooks were also added so that the roll forward recovery mechanism can be used to implement a hot standby database. Along with the roll forward recovery mechanism we have also implemented an online database backup utility which is synchronized with the recovery log so that the backup can be a starting point for a roll forward recovery session. Roll forward recovery is enabled for a database cluster by specifying one of two postmaster configuration parameters. wal_file_reuse = false This parameter tells the wal system to not reuse wal files. This option is intended for sites wishing to implement a hot standby database where wal files will be periodically copied to another machine where they will be rolled forward into the standby database. wal_file_duplicate = <directory_path> This parameter tells the system to mirror the files in the pg_xlog directory to the specified directory. This allows for the recovery of a database where a wal file has been damaged or lost. It also allows for a variant of a hot standby database where the duplicate directory is the pg_xlog directory of the standby database. Since both of these options cause wal files to accumulate indefinately the dba needs a means of purging wal files when they are no longer needed. So an sql command, "ALTER SYSTEM PURGE WAL_FILES <wal_file_name>", has also been implemented. This command deletes all wal files up to and including the specified <wal_file_name> as long as those wal files are not needed to recover the database in the event of a system crash. To find out the status of the wal files a function has been implemented to return a wal file name. The function is: Wal_file( <request_tye>) Request_type := [ current | last | checkpoint | oldest] Wal_file ('current') returns the name of the log file currently being written to. Wal_file('last') returns the name of the last log file filled. Wal_file('checkpoint') returns the name of the file containing the current redo position. The current redo position is the position in the recovery log where crash recovery would start if the system were to crash now. All logs prior to this one will not be needed to recover the database cluster and could be safely removed. Wal_file('oldest') returns the oldest xlog file found in the pg_xlog directory. To actually perform a roll forward you use the postmaster configuration parameter "roll_forward=yes". This parameter tells the startup process to perform crash recovery even though the state of the database as found in the pg_control file indicates a normal shutdown. This is necessary since the starting point of roll forward session could be the restore of a database cluster that was shutdown in order to back it up. Furthermore this parameter tells the startup process not to write out a checkpoint record at the end of the roll forward session. This allows for the database cluster to receive subsequent wal files and to have those rolled forward as well. When starting the postmaster with the roll_forward=yes option, it shuts down the database as soon as the startup process completes. So the idea is to restore a backup, copy all of your saved/duplicated wal files into the pg_xlog directory of the restored database and start the postmaster with the roll_forward option. For point in time recovery there is also a roll_forward_until = <time> which rolls forward through the wal files until the first transaction commit note that is greater than or equal to the specified time. The pg_copy utility performs an on line copy of a database cluster. Its syntax is: pg_copy <backup_directory> [-h host] [-p port] ... This makes a copy of the database where backup_directory is what you would set PGDATA to in order start a postmaster against the backup copy. The database can be being updated while the copy occurs. If you start a postmaster against this copy it will appear to the startup process as a database that crashed at the instant the pg_copy operation completed. Futhermore the pg_copy utility automatically removes any wal files not needed to recover the database from either pg_xlog directory or the wal_file_duplicate directory. So a DBA to protect the database from media loss just needs to set the wal_file_duplicate paramater and periodically pg_copy the database. The BIG THING we have not done is address the issue that add/drop tables and indexes do not propagate through the roll forward recovery mechanism properly. -regards Richard Tucker
Richard Tucker wrote: > > We also have implemented a roll forward recovery mechanism. We modified a > 7.2.1 version of Postgres. > The mechanism is designed to provide a means of recoverying from the loss or > corruption of media. It provides for duplicating wal_files so that if a > wal_file is lost roll forward recovery can recover the database using the > duplicated wal_files. Hooks were also added so that the roll forward > recovery mechanism can be used to implement a hot standby database. > Along with the roll forward recovery mechanism we have also implemented an > online database backup utility which is synchronized with the recovery log > so that the backup can be a starting point for a roll forward recovery > session. In researching who has done this work, I found that http://www.multera.com/ is Progress Software. I assume this "Distributed but Connected" is partly based on PostgreSQL. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > Richard Tucker wrote: > > > > We also have implemented a roll forward recovery mechanism. We modified a > > 7.2.1 version of Postgres. > > The mechanism is designed to provide a means of recoverying from the loss or > > corruption of media. It provides for duplicating wal_files so that if a > > wal_file is lost roll forward recovery can recover the database using the > > duplicated wal_files. Hooks were also added so that the roll forward > > recovery mechanism can be used to implement a hot standby database. > > Along with the roll forward recovery mechanism we have also implemented an > > online database backup utility which is synchronized with the recovery log > > so that the backup can be a starting point for a roll forward recovery > > session. > > In researching who has done this work, I found that > http://www.multera.com/ is Progress Software. I assume this > "Distributed but Connected" is partly based on PostgreSQL. Oh, I see it now, PostgreSQL is right there: * OuterEdge A distributed site technology suite that allows remote users to maintain the integrity, not the congestion and price, of centralized data. The OuterEdge includes the affordable UltraSQL^(TM) database server powered by PostgreSQL, the third most popular database, and a ^^^^^^^^^^^^^^^^^^^^^ Secure Application/web server that supports the industry's most popular Internet languages. Together with the Replication Engine software, the OuterEdge gives remote users seamless access to important data without relying on bandwidth. Wonder why we are "the third most popular database". I think that's good? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> server powered by PostgreSQL, the third most popular database, and a > > ^^^^^^^^^^^^^^^^^^^^^ > Wonder why we are "the third most popular database". I think that's > good? You'll notice they didn't qualify where. On this list, it's probably #1. Within Progress software perhaps we're third most popular (whatever two are typically used in the InnerEdge are 1 and 2).
Rod Taylor wrote: > > server powered by PostgreSQL, the third most popular database, and a > > > > ^^^^^^^^^^^^^^^^^^^^^ > > > Wonder why we are "the third most popular database". I think that's > > good? > > You'll notice they didn't qualify where. On this list, it's probably > #1. Within Progress software perhaps we're third most popular (whatever > two are typically used in the InnerEdge are 1 and 2). Yea, using that logic, #1 would be the Progress internal db system, #2 would be MySQL (though that seems doubtful at this point with Nusphere), and PostgreSQL. Actually, PostgreSQL is #1 in my home. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On July 17, 2002 05:45 pm, Richard Tucker wrote: > We also have implemented a roll forward recovery mechanism. We modified a > 7.2.1 version of Postgres. > ... Excellent! I can't wait. When will it be in current? > The BIG THING we have not done is address the issue that add/drop tables > and indexes do not propagate through the roll forward recovery mechanism > properly. I can live with that. Schemas shouldn't change so often that you can't just dup any changes to the backup(s). -- D'Arcy J.M. Cain <darcy@{druid|vex}.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
I don't know how our marketing came up third most popular but I think the order is, Oracle, MySQL, and PostgreSQL or maybe Oracle, MSSQL and PostgreSQL. I'm sure there is some criterion by which PostgreSQL is tenth and by some other its number one. Of course, my posting was about Point In Time Recovery and not multera marketing spin. -regards richt -----Original Message----- From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] Sent: Wednesday, July 17, 2002 6:29 PM To: Rod Taylor Cc: PostgreSQL-development; richt@multera.com; J. R. Nield Subject: Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR) Rod Taylor wrote: > > server powered by PostgreSQL, the third most popular database, and a > > > > ^^^^^^^^^^^^^^^^^^^^^ > > > Wonder why we are "the third most popular database". I think that's > > good? > > You'll notice they didn't qualify where. On this list, it's probably > #1. Within Progress software perhaps we're third most popular (whatever > two are typically used in the InnerEdge are 1 and 2). Yea, using that logic, #1 would be the Progress internal db system, #2 would be MySQL (though that seems doubtful at this point with Nusphere), and PostgreSQL. Actually, PostgreSQL is #1 in my home. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, 2002-07-17 at 01:25, Bruce Momjian wrote: > > We only patch configure.in. If you post to hackers, they can give you > assistance and I will try to help however I can. I can so some > configure.in stuff for you myself. Thanks for the offer. The only thing I was changing it for was to test whether and how to get a ethernet MAC address using ioctl, so libuuid could use it if available. That is dropped now. > > > Related to that, the other place I need advice is on adding Ted Tso's > > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we > > allowed to use this? There is a free OSF/DCE spec for UUID's, so I can > > re-implement the library if required. > > We talked about this on the replication mailing list. We decided that > hostname, properly hashed to an integer, was the proper way to get this > value. Also, there should be a postgresql.conf variable so you can > override the hostname-generated value if you wish. I think that is > sufficient. I will do something like this, but reserve 16 bytes for it just in case we change our minds. It needs to be different among systems on the same machine, so there needs to be a time value and a pseudo-random part as well. Also, 'hostname' will likely be the same on many machines (localhost.localdomain or similar). The only reason I bothered with UUID's before is because they have a standard setup to make the possibility of collision extremely small, and I figured replication will end up using it someday. > > > We also haven't discussed commands for backup/restore, but I will use > > what I think is appropriate and we can change the grammar if needed. The > > initial hot-backup capability will require the database to be in > > read-only mode and use tar for backup, and I will add the ability to > > allow writes later. > > Yea, I saw Tom balked at that. I think we have enough manpower and time > that we can get hot backup in normal read/write mode working before 7.3 > beta so I would just code it assuming the system is live and we can deal > with making it hot-capable once it is in CVS. It doesn't have to work > 100% until beta time. Hot backup read/write requires that we force an advance in the logfile segment after the backup. We need to save all the logs between backup start and completion. Otherwise the files will be useless as a standalone system if the current logs somehow get destroyed (fire in the machine room, etc.). The way I would do this is: create a checkpoint do the block-by-block walk of the files using the bufmgr create a second checkpoint force the log toadvance past the end of the current segment save the log segments containing records between the first & second checkpontwith the backup Then if you restore the backup, you can recover to the point of the second checkpoint, even if the logs since then are all gone. Right now the log segment size is fixed, so this means that we'd waste 8MB of log space on average to do a backup. Also, the way XLOG reads records right now, we have to write placeholder records into the empty space, because that's how it finds the end of the log stream. So I need to change XLOG to handle "skip records", and then to truncate the file when it gets archived, so we don't have to save up to 16MB of zeros. Also, if archiving is turned off, then we can't recycle or delete any logs for the duration of the backup, and we have to save them. So I'll finish the XLOG support for this, and then think about the correct way to walk through all the files. -- J. R. Nield jrnield@usol.com
J. R. Nield wrote: > I will do something like this, but reserve 16 bytes for it just in case > we change our minds. It needs to be different among systems on the same > machine, so there needs to be a time value and a pseudo-random part as > well. Also, 'hostname' will likely be the same on many machines > (localhost.localdomain or similar). > > The only reason I bothered with UUID's before is because they have a > standard setup to make the possibility of collision extremely small, and > I figured replication will end up using it someday. Sure. Problem is, we support so many platforms that any trickery is a problem. If they can change it in postgresql.conf, that should be sufficient. > Hot backup read/write requires that we force an advance in the logfile > segment after the backup. We need to save all the logs between backup > start and completion. Otherwise the files will be useless as a > standalone system if the current logs somehow get destroyed (fire in the > machine room, etc.). > > The way I would do this is: > > create a checkpoint > do the block-by-block walk of the files using the bufmgr > create a second checkpoint > force the log to advance past the end of the current segment > save the log segments containing records between the > first & second checkpont with the backup Sounds good. > Then if you restore the backup, you can recover to the point of the > second checkpoint, even if the logs since then are all gone. Good, you put the logs that happened during the backup inside the same backup, make it consistent. Makes sense. > Right now the log segment size is fixed, so this means that we'd waste > 8MB of log space on average to do a backup. Also, the way XLOG reads > records right now, we have to write placeholder records into the empty > space, because that's how it finds the end of the log stream. So I need > to change XLOG to handle "skip records", and then to truncate the file > when it gets archived, so we don't have to save up to 16MB of zeros. > > Also, if archiving is turned off, then we can't recycle or delete any > logs for the duration of the backup, and we have to save them. > > So I'll finish the XLOG support for this, and then think about the > correct way to walk through all the files. Sounds like a good plan. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Richard: I can't quite follow this; maybe you sent a draft by accident. If you want to post a patch against 7.2.1, or even better against HEAD in CVS, that would be great. Or if you'd rather point me to your source online, that would be good too. I just want to clarify though: is this work released to the PostgreSQL Development group by Progress and Multera, or do they still claim copyright interest in it? Regards,J.R. Nield On Thu, 2002-07-18 at 12:56, Richard Tucker wrote: > > > -----Original Message----- > From: J. R. Nield [mailto:jrnield@usol.com] > Sent: Wednesday, July 17, 2002 8:13 PM > To: richt@multera.com > Cc: Bruce Momjian > Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery > (PITR) > > > On Wed, 2002-07-17 at 19:25, Richard Tucker wrote: > > Regarding hot backup. Our implementation of "pg_copy" does a hot backup. > > It turns off database checkpointing for the duration of the backup. > Backups > > all the files of the database cluster up to the wal file currently being > > logged to. It then acquires the WalInsertLock lock long enough to backup > > the current wal file. > > Does it then allow more writes to that WAL file? It would seem like you > want to advance the log to the next file, so the sysadmin wouldn't have > to choose which one of log-file number 3 he wants to use at restore. > > > Writes to the wal file are allowed during the backup except for the > backing of the wal file current when the > > backup completes. That is the pg_xlog directory is the last directory to > be backed up. The wal_files are backed > > up in the order they were used. Continued wal file logging is allowed > until the backup reaches the current wal > > file being written to. To back up this last wal file the WalInsertLock is > held until the copy of the wal file > > is complete. So the backup stops update activity only long enough to copy > this last 16mb file. > > Also, what do you mean by 'turns off checkpointing'. You have to do a > checkpoint, or at least flush the buffers, when you start the backup. > Otherwise how do you know what LSN to start from at restore? > > > The pg_control file also gets backed up. It contains the point in the log > at which to begin > > the redo/roll forward. > > By not allowing the redo point to advance while the backup goes on means > that the startup processes' crash > > recovery code will capture all the changes made to the database cluster > while the backup was running. > > > Anyway: Yes we'd love to see the code. > > > In what form would you like me to send the code to you e.g. as a patch, > copy our whole source ... > > Since I've pretty-much got create/drop and index stuff working, if your > code does the rest then we should be good to go. > > ;jrnield > > > -- > J. R. Nield > jrnield@usol.com > > > > -- J. R. Nield jrnield@usol.com
Progress/Multera would release the hot backup/roll forward work to the PostgreSQL Development group. -regards richt -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of J. R. Nield Sent: Thursday, July 18, 2002 2:34 PM To: richt@multera.com Cc: Bruce Momjian; PostgreSQL Hacker Subject: Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR) Richard: I can't quite follow this; maybe you sent a draft by accident. If you want to post a patch against 7.2.1, or even better against HEAD in CVS, that would be great. Or if you'd rather point me to your source online, that would be good too. I just want to clarify though: is this work released to the PostgreSQL Development group by Progress and Multera, or do they still claim copyright interest in it? Regards,J.R. Nield On Thu, 2002-07-18 at 12:56, Richard Tucker wrote: > > > -----Original Message----- > From: J. R. Nield [mailto:jrnield@usol.com] > Sent: Wednesday, July 17, 2002 8:13 PM > To: richt@multera.com > Cc: Bruce Momjian > Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery > (PITR) > > > On Wed, 2002-07-17 at 19:25, Richard Tucker wrote: > > Regarding hot backup. Our implementation of "pg_copy" does a hot backup. > > It turns off database checkpointing for the duration of the backup. > Backups > > all the files of the database cluster up to the wal file currently being > > logged to. It then acquires the WalInsertLock lock long enough to backup > > the current wal file. > > Does it then allow more writes to that WAL file? It would seem like you > want to advance the log to the next file, so the sysadmin wouldn't have > to choose which one of log-file number 3 he wants to use at restore. > > > Writes to the wal file are allowed during the backup except for the > backing of the wal file current when the > > backup completes. That is the pg_xlog directory is the last directory to > be backed up. The wal_files are backed > > up in the order they were used. Continued wal file logging is allowed > until the backup reaches the current wal > > file being written to. To back up this last wal file the WalInsertLock is > held until the copy of the wal file > > is complete. So the backup stops update activity only long enough to copy > this last 16mb file. > > Also, what do you mean by 'turns off checkpointing'. You have to do a > checkpoint, or at least flush the buffers, when you start the backup. > Otherwise how do you know what LSN to start from at restore? > > > The pg_control file also gets backed up. It contains the point in the log > at which to begin > > the redo/roll forward. > > By not allowing the redo point to advance while the backup goes on means > that the startup processes' crash > > recovery code will capture all the changes made to the database cluster > while the backup was running. > > > Anyway: Yes we'd love to see the code. > > > In what form would you like me to send the code to you e.g. as a patch, > copy our whole source ... > > Since I've pretty-much got create/drop and index stuff working, if your > code does the rest then we should be good to go. > > ;jrnield > > > -- > J. R. Nield > jrnield@usol.com > > > > -- J. R. Nield jrnield@usol.com ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
On Thu, 2002-07-18 at 15:36, Richard Tucker wrote: > Sorry, I didn't delimit my comments correctly before. > I'm offering our pg_copy/roll forward implementation to PostgreSQL.org if it > finds it an acceptable contribution. Progress/Multera would hand over all > rights to any code accepted. > > I'd be willing to post a patch to HEAD -- where can I find instructions on > how to do this? > Instructions for how to get the latest development source are at: http://developer.postgresql.org/TODO/docs/cvs.html If you want to post a context diff against 7.2.1 to pgsql-hackers first, that would let us see what it does. Let me know if there is anything else I can help you with. -- J. R. Nield jrnield@usol.com