RE: WAL documentation - Mailing list pgsql-hackers
From | Christopher Kings-Lynne |
---|---|
Subject | RE: WAL documentation |
Date | |
Msg-id | NEBBIODEHDOLHLJPJCDDGEBGCAAA.chriskl@familyhealth.com.au Whole thread Raw |
In response to | Re: WAL documentation (Barry Lind <barry@xythos.com>) |
Responses |
Re: WAL documentation
|
List | pgsql-hackers |
Also, what happens with the size of the WAL logs? Do they just grow forever eventually filling up your hard drive, or shouldthey reach a stable point where they tend not to grow any further? ie. Will we sysadmins have to put cron jobs in to tar/gz old WAL logs or what??? Chris > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Barry Lind > Sent: Wednesday, January 24, 2001 12:32 PM > To: pgsql-hackers@postgresql.org > Cc: postgresql-docs@postgresql.org > Subject: Re: [HACKERS] WAL documentation > > > Not knowing much about WAL, but understanding a good deal about Oracle's > logs, I read the WAL documentation below. While it is good, after > reading it I am still left with a couple of questions and therefore > believe the doc could be improved a bit. > > The two questions I am left with after reading the WAL doc are: > > 1) In the 'WAL Parameters' section, paragraph 3 there is the following > sentence: > "After a checkpoint has been made, any log segments written before the > redo record may be removed/archived..." What does the 'may' refer > mean? Does the database administrator need to go into the directory and > remove the no longer necessary log files? What does archiving have to > do with this? If I archived all log files, could I roll forward a > backup made previously? That is the only reason I can think of that you > would archive log files (at least that is why you archive log files in > Oracle). > > 2) The doc doesn't seem to explain how on database recovery the database > knows which log file to start with. I think walking through an example > of how after a database crash, the log file is used for recovery, would > be useful. At least it would make me as a user of postgres feel better > if I understood how crashes are recovered from. > > thanks, > --Barry > > > > > Oliver Elphick wrote: > > > > Here is documentation for WAL, as text for immediate review and as SGML > > source, generated from Vadim's original text with my editing. > > > > Please review for correctness. > > > > =========================== WAL chapter ========================== > > > > Write-Ahead Logging (WAL) in Postgres > > > > Author: Written by Vadim Mikheev and Oliver Elphick. > > > > General description > > > > Write Ahead Logging (WAL) is a standard approach to transaction logging. > > Its detailed description may be found in most (if not all) books about > > transaction processing. Briefly, WAL's central concept is that > changes to > > data files (where tables and indices reside) must be written only after > > those changes have been logged - that is, when log records have been > > flushed to permanent storage. When we follow this procedure, we do not > > need to flush data pages to disk on every transaction commit, because we > > know that in the event of a crash we will be able to recover > the database > > using the log: any changes that have not been applied to the data pages > > will first be redone from the log records (this is roll-forward > recovery, > > also known as REDO) and then changes made by uncommitted transactions > > will be removed from the data pages (roll-backward recovery - UNDO). > > > > Immediate benefits of WAL > > > > The first obvious benefit of using WAL is a significantly reduced number > > of disk writes, since only the log file needs to be flushed to disk at > > the time of transaction commit; in multi-user environments, commits of > > many transactions may be accomplished with a single fsync() of the log > > file. Furthermore, the log file is written sequentially, and so the cost > > of syncing the log is much less than the cost of syncing the data pages. > > > > The next benefit is consistency of the data pages. The truth is that, > > before WAL, PostgreSQL was never able to guarantee consistency in the > > case of a crash. Before WAL, any crash during writing could result in: > > > > 1. index tuples pointing to non-existent table rows; > > 2. index tuples lost in split operations; > > 3. totally corrupted table or index page content, because of > > partially written data pages. > > > > (Actually, the first two cases could even be caused by use of > the "pg_ctl > > -m {fast | immediate} stop" command.) Problems with indices (problems > > 1 and 2) might have been capable of being fixed by additional fsync() > > calls, but it is not obvious how to handle the last case without WAL; > > WAL saves the entire data page content in the log if that is required > > to ensure page consistency for after-crash recovery. > > > > Future benefits > > > > In this first release of WAL, UNDO operation is not implemented, because > > of lack of time. This means that changes made by aborted transactions > > will still occupy disk space and that we still need a permanent pg_log > > file to hold the status of transactions, since we are not able to re-use > > transaction identifiers. Once UNDO is implemented, pg_log will > no longer > > be required to be permanent; it will be possible to remove pg_log at > > shutdown, split it into segments and remove old segments. > > > > With UNDO, it will also be possible to implement SAVEPOINTs to allow > > partial rollback of invalid transaction operations (parser errors caused > > by mistyping commands, insertion of duplicate primary/unique keys and > > so on) with the ability to continue or commit valid operations made by > > the transaction before the error. At present, any error will invalidate > > the whole transaction and require a transaction abort. > > > > WAL offers the opportunity for a new method for database on-line backup > > and restore (BAR). To use this method, one would have to make periodic > > saves of data files to another disk, a tape or another host and also > > archive the WAL log files. The database file copy and the archived > > log files could be used to restore just as if one were restoring after a > > crash. Each time a new database file copy was made the old log > files could > > be removed. Implementing this facility will require the logging of data > > file and index creation and deletion; it will also require > development of > > a method for copying the data files (O/S copy commands are not > suitable). > > > > Implementation > > > > WAL is automatically enabled from release 7.1 onwards. No action is > > required from the administrator with the exception of ensuring that the > > additional disk-space requirements of the WAL logs are met, and that > > any necessary tuning is done (see below). > > > > WAL logs are stored in $PGDATA/pg_xlog, as a set of segment files, each > > 16Mb in size. Each segment is divided into 8Kb pages. The log record > > headers are described in access/xlog.h; record content is > dependent on the > > type of event that is being logged. Segment files are given sequential > > numbers as names, starting at 0000000000000000. The numbers do > not wrap, > > at present, but it should take a very long time to exhaust the available > > stock of numbers. > > > > The WAL buffers and control structure are in shared memory, and are > > handled by the backends; they are protected by spinlocks. The demand > > on shared memory is dependent on the number of buffers; the default size > > of the WAL buffers is 64Kb. > > > > It is desirable for the log to be located on another disk than the main > > database files. This may be achieved by moving the directory, pg_xlog, > > to another filesystem (while the postmaster is shut down, of course) > > and creating a symbolic link from $PGDATA to the new location. > > > > The aim of WAL, to ensure that the log is written before database > > records are altered, may be subverted by disk drives that falsely report > > a successful write to the kernel, when, in fact, they have only cached > > the data and not yet stored it on the disk. A power failure in such a > > situation may still lead to irrecoverable data corruption; > administrators > > should try to ensure that disks holding PostgreSQL's data and log files > > do not make such false reports. > > > > WAL parameters > > > > There are several WAL-related parameters that affect database > > performance. This section explains their use. > > > > There are two commonly used WAL functions - LogInsert and LogFlush. > > LogInsert is used to place a new record into the WAL buffers in shared > > memory. If there is no space for the new record, LogInsert will have to > > write (move to OS cache) a few filled WAL buffers. This is undesirable > > because LogInsert is used on every database low level modification > > (for example, tuple insertion) at a time when an exclusive lock is held > > on affected data pages and the operation is supposed to be as fast as > > possible; what is worse, writing WAL buffers may also cause the creation > > of a new log segment, which takes even more time. Normally, WAL buffers > > should be written and flushed by a LogFlush request, which is made, > > for the most part, at transaction commit time to ensure that transaction > > records are flushed to permanent storage. On systems with high > log output, > > LogFlush requests may not occur often enough to prevent WAL buffers' > > being written by LogInsert. On such systems one should increase the > > number of WAL buffers by modifying the "WAL_BUFFERS" parameter. The > > default number of WAL buffers is 8. Increasing this value will have an > > impact on shared memory usage. > > > > Checkpoints are points in the sequence of transactions at which it is > > guaranteed that the data files have been updated with all information > > logged before the checkpoint. At checkpoint time, all dirty data pages > > are flushed to disk and a special checkpoint record is written to the > > log file. As result, in the event of a crash, the recoverer knows from > > what record in the log (known as the redo record) it should start the > > REDO operation, since any changes made to data files before that record > > are already on disk. After a checkpoint has been made, any log segments > > written before the redo record may be removed/archived, so checkpoints > > are used to free disk space in the WAL directory. The checkpoint maker > > is also able to create a few log segments for future use, so as to avoid > > the need for LogInsert or LogFlush to spend time in creating them. > > > > The WAL log is held on the disk as a set of 16Mb files called segments. > > By default a new segment is created only if more than 75% of the current > > segment is used. One can instruct the server to create up to 64 log > > segments at checkpoint time by modifying the "WAL_FILES" parameter. > > > > For faster after-crash recovery, it would be better to create > checkpoints > > more often. However, one should balance this against the cost > of flushing > > dirty data pages; in addition, to ensure data page consistency,the first > > modification of a data page after each checkpoint results in logging > > the entire page content, thus increasing output to log and the > log's size. > > > > By default, the postmaster spawns a special backend process to > create the > > next checkpoint 300 seconds after the previous checkpoint's creation. > > One can change this interval by modifying the "CHECKPOINT_TIMEOUT" > > parameter. It is also possible to force a checkpoint by using the SQL > > command, CHECKPOINT. > > > > Setting the "WAL_DEBUG" parameter to any non-zero value will result in > > each LogInsert and LogFlush WAL call's being logged to standard error. > > At present, it makes no difference what the non-zero value is. > > > > The "COMMIT_DELAY" parameter defines for how long the backend will be > > forced to sleep after writing a commit record to the log with LogInsert > > call but before performing a LogFlush. This delay allows other backends > > to add their commit records to the log so as to have all of them flushed > > with a single log sync. Unfortunately, this mechanism is not fully > > implemented at release 7.1, so there is at present no point in changing > > this parameter from its default value of 5 microseconds. > > > > ===================== CHECKPOINT manual page ====================== > > > > CHECKPOINT -- Forces a checkpoint in the transaction log > > > > Synopsis > > > > CHECKPOINT > > > > Inputs > > > > None > > > > Outputs > > > > CHECKPOINT > > > > This signifies that a checkpoint has been placed into the > transaction log. > > > > Description > > > > Write-Ahead Logging (WAL) puts a checkpoint in the log every 300 seconds > > by default. (This may be changed by the parameter CHECKPOINT_TIMEOUT > > in postgresql.conf.) > > > > The CHECKPOINT command forces a checkpoint at the point at which the > > command is issued. The next automatic checkpoint will happen the default > > time after the forced checkpoint. > > > > Restrictions > > > > Use of the CHECKPOINT command is restricted to users with administrative > > access. > > > > Usage > > > > To force a checkpoint in the transaction log: > > > > CHECKPOINT; > > > > Compatibility > > > > SQL92 > > > > CHECKPOINT is a Postgres language extension. There is no CHECKPOINT > > command in SQL92. > > > > Note: The handling of database storage and logging is a matter that the > > standard leaves to the implementation. > > > > > ------------------------------------------------------------------------ > > Name: checkpoint.sgml > > checkpoint.sgml Type: text/x-sgml > > Description: checkpoint.sgml > > > > Name: wal.sgml > > wal.sgml Type: text/x-sgml > > Description: wal.sgml > > > > Part 1.4Type: Plain Text (text/plain) >
pgsql-hackers by date: