RE: WAL documentation - Mailing list pgsql-hackers

From Christopher Kings-Lynne
Subject RE: WAL documentation
Date
Msg-id NEBBIODEHDOLHLJPJCDDGEBGCAAA.chriskl@familyhealth.com.au
Whole thread Raw
In response to Re: WAL documentation  (Barry Lind <barry@xythos.com>)
Responses Re: WAL documentation
List pgsql-hackers
Also, what happens with the size of the WAL logs?  Do they just grow forever eventually filling up your hard drive, or
shouldthey reach a stable point where they tend not to grow any further?
 

ie. Will we sysadmins have to put cron jobs in to tar/gz old WAL logs or what???

Chris

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Barry Lind
> Sent: Wednesday, January 24, 2001 12:32 PM
> To: pgsql-hackers@postgresql.org
> Cc: postgresql-docs@postgresql.org
> Subject: Re: [HACKERS] WAL documentation
> 
> 
> Not knowing much about WAL, but understanding a good deal about Oracle's
> logs, I read the WAL documentation below.  While it is good, after
> reading it I am still left with a couple of questions and therefore
> believe the doc could be improved a bit.
> 
> The two questions I am left with after reading the WAL doc are:
> 
> 1) In the 'WAL Parameters' section, paragraph 3 there is the following
> sentence:
> "After a checkpoint has been made, any log segments written before the
> redo record may be removed/archived..."  What does the 'may' refer
> mean?  Does the database administrator need to go into the directory and
> remove the no longer necessary log files?  What does archiving have to
> do with this?  If I archived all log files, could I roll forward a
> backup made previously?  That is the only reason I can think of that you
> would archive log files (at least that is why you archive log files in
> Oracle).
> 
> 2) The doc doesn't seem to explain how on database recovery the database
> knows which log file to start with.  I think walking through an example
> of how after a database crash, the log file is used for recovery, would
> be useful.  At least it would make me as a user of postgres feel better
> if I understood how crashes are recovered from.
> 
> thanks,
> --Barry
> 
> 
> 
> 
> Oliver Elphick wrote:
> > 
> > Here is documentation for WAL, as text for immediate review and as SGML
> > source, generated from Vadim's original text with my editing.
> > 
> > Please review for correctness.
> > 
> > =========================== WAL chapter ==========================
> > 
> > Write-Ahead Logging (WAL) in Postgres
> > 
> >     Author: Written by Vadim Mikheev and Oliver Elphick.
> > 
> > General description
> > 
> > Write Ahead Logging (WAL) is a standard approach to transaction logging.
> > Its detailed description may be found in most (if not all) books about
> > transaction processing. Briefly, WAL's central concept is that 
> changes to
> > data files (where tables and indices reside) must be written only after
> > those changes have been logged - that is, when log records have been
> > flushed to permanent storage. When we follow this procedure, we do not
> > need to flush data pages to disk on every transaction commit, because we
> > know that in the event of a crash we will be able to recover 
> the database
> > using the log: any changes that have not been applied to the data pages
> > will first be redone from the log records (this is roll-forward 
> recovery,
> > also known as REDO) and then changes made by uncommitted transactions
> > will be removed from the data pages (roll-backward recovery - UNDO).
> > 
> > Immediate benefits of WAL
> > 
> > The first obvious benefit of using WAL is a significantly reduced number
> > of disk writes, since only the log file needs to be flushed to disk at
> > the time of transaction commit; in multi-user environments, commits of
> > many transactions may be accomplished with a single fsync() of the log
> > file. Furthermore, the log file is written sequentially, and so the cost
> > of syncing the log is much less than the cost of syncing the data pages.
> > 
> > The next benefit is consistency of the data pages. The truth is that,
> > before WAL, PostgreSQL was never able to guarantee consistency in the
> > case of a crash.  Before WAL, any crash during writing could result in:
> > 
> > 1. index tuples pointing to non-existent table rows;
> > 2. index tuples lost in split operations;
> > 3. totally corrupted table or index page content, because of
> >    partially written data pages.
> > 
> > (Actually, the first two cases could even be caused by use of 
> the "pg_ctl
> > -m {fast | immediate} stop" command.)  Problems with indices (problems
> > 1 and 2) might have been capable of being fixed by additional fsync()
> > calls, but it is not obvious how to handle the last case without WAL;
> > WAL saves the entire data page content in the log if that is required
> > to ensure page consistency for after-crash recovery.
> > 
> > Future benefits
> > 
> > In this first release of WAL, UNDO operation is not implemented, because
> > of lack of time. This means that changes made by aborted transactions
> > will still occupy disk space and that we still need a permanent pg_log
> > file to hold the status of transactions, since we are not able to re-use
> > transaction identifiers.  Once UNDO is implemented, pg_log will 
> no longer
> > be required to be permanent; it will be possible to remove pg_log at
> > shutdown, split it into segments and remove old segments.
> > 
> > With UNDO, it will also be possible to implement SAVEPOINTs to allow
> > partial rollback of invalid transaction operations (parser errors caused
> > by mistyping commands, insertion of duplicate primary/unique keys and
> > so on) with the ability to continue or commit valid operations made by
> > the transaction before the error.  At present, any error will invalidate
> > the whole transaction and require a transaction abort.
> > 
> > WAL offers the opportunity for a new method for database on-line backup
> > and restore (BAR).  To use this method, one would have to make periodic
> > saves of data files to another disk, a tape or another host and also
> > archive the WAL log files.  The database file copy and the archived
> > log files could be used to restore just as if one were restoring after a
> > crash. Each time a new database file copy was made the old log 
> files could
> > be removed.  Implementing this facility will require the logging of data
> > file and index creation and deletion; it will also require 
> development of
> > a method for copying the data files (O/S copy commands are not 
> suitable).
> > 
> > Implementation
> > 
> > WAL is automatically enabled from release 7.1 onwards.  No action is
> > required from the administrator with the exception of ensuring that the
> > additional disk-space requirements of the WAL logs are met, and that
> > any necessary tuning is done (see below).
> > 
> > WAL logs are stored in $PGDATA/pg_xlog, as a set of segment files, each
> > 16Mb in size.  Each segment is divided into 8Kb pages.  The log record
> > headers are described in access/xlog.h; record content is 
> dependent on the
> > type of event that is being logged.  Segment files are given sequential
> > numbers as names, starting at 0000000000000000.  The numbers do 
> not wrap,
> > at present, but it should take a very long time to exhaust the available
> > stock of numbers.
> > 
> > The WAL buffers and control structure are in shared memory, and are
> > handled by the backends; they are protected by spinlocks.  The demand
> > on shared memory is dependent on the number of buffers; the default size
> > of the WAL buffers is 64Kb.
> > 
> > It is desirable for the log to be located on another disk than the main
> > database files.  This may be achieved by moving the directory, pg_xlog,
> > to another filesystem (while the postmaster is shut down, of course)
> > and creating a symbolic link from $PGDATA to the new location.
> > 
> > The aim of WAL, to ensure that the log is written before database
> > records are altered, may be subverted by disk drives that falsely report
> > a successful write to the kernel, when, in fact, they have only cached
> > the data and not yet stored it on the disk.  A power failure in such a
> > situation may still lead to irrecoverable data corruption;
> administrators
> > should try to ensure that disks holding PostgreSQL's data and log files
> > do not make such false reports.
> > 
> > WAL parameters
> > 
> > There are several WAL-related parameters that affect database
> > performance. This section explains their use.
> > 
> > There are two commonly used WAL functions - LogInsert and LogFlush.
> > LogInsert is used to place a new record into the WAL buffers in shared
> > memory. If there is no space for the new record, LogInsert will have to
> > write (move to OS cache) a few filled WAL buffers. This is undesirable
> > because LogInsert is used on every database low level modification
> > (for example, tuple insertion) at a time when an exclusive lock is held
> > on affected data pages and the operation is supposed to be as fast as
> > possible; what is worse, writing WAL buffers may also cause the creation
> > of a new log segment, which takes even more time. Normally, WAL buffers
> > should be written and flushed by a LogFlush request, which is made,
> > for the most part, at transaction commit time to ensure that transaction
> > records are flushed to permanent storage. On systems with high 
> log output,
> > LogFlush requests may not occur often enough to prevent WAL buffers'
> > being written by LogInsert. On such systems one should increase the
> > number of WAL buffers by modifying the "WAL_BUFFERS" parameter. The
> > default number of WAL buffers is 8.  Increasing this value will have an
> > impact on shared memory usage.
> > 
> > Checkpoints are points in the sequence of transactions at which it is
> > guaranteed that the data files have been updated with all information
> > logged before the checkpoint.  At checkpoint time, all dirty data pages
> > are flushed to disk and a special checkpoint record is written to the
> > log file. As result, in the event of a crash, the recoverer knows from
> > what record in the log (known as the redo record) it should start the
> > REDO operation, since any changes made to data files before that record
> > are already on disk. After a checkpoint has been made, any log segments
> > written before the redo record may be removed/archived, so checkpoints
> > are used to free disk space in the WAL directory. The checkpoint maker
> > is also able to create a few log segments for future use, so as to avoid
> > the need for LogInsert or LogFlush to spend time in creating them.
> > 
> > The WAL log is held on the disk as a set of 16Mb files called segments.
> > By default a new segment is created only if more than 75% of the current
> > segment is used. One can instruct the server to create up to 64 log
> > segments at checkpoint time by modifying the "WAL_FILES" parameter.
> > 
> > For faster after-crash recovery, it would be better to create 
> checkpoints
> > more often.  However, one should balance this against the cost 
> of flushing
> > dirty data pages; in addition, to ensure data page consistency,the first
> > modification of a data page after each checkpoint results in logging
> > the entire page content, thus increasing output to log and the 
> log's size.
> > 
> > By default, the postmaster spawns a special backend process to 
> create the
> > next checkpoint 300 seconds after the previous checkpoint's creation.
> > One can change this interval by modifying the "CHECKPOINT_TIMEOUT"
> > parameter.  It is also possible to force a checkpoint by using the SQL
> > command, CHECKPOINT.
> > 
> > Setting the "WAL_DEBUG" parameter to any non-zero value will result in
> > each LogInsert and LogFlush WAL call's being logged to standard error.
> > At present, it makes no difference what the non-zero value is.
> > 
> > The "COMMIT_DELAY" parameter defines for how long the backend will be
> > forced to sleep after writing a commit record to the log with LogInsert
> > call but before performing a LogFlush. This delay allows other backends
> > to add their commit records to the log so as to have all of them flushed
> > with a single log sync. Unfortunately, this mechanism is not fully
> > implemented at release 7.1, so there is at present no point in changing
> > this parameter from its default value of 5 microseconds.
> > 
> > ===================== CHECKPOINT manual page ======================
> > 
> > CHECKPOINT -- Forces a checkpoint in the transaction log
> > 
> > Synopsis
> > 
> > CHECKPOINT
> > 
> > Inputs
> > 
> > None
> > 
> > Outputs
> > 
> > CHECKPOINT
> > 
> > This signifies that a checkpoint has been placed into the 
> transaction log.
> > 
> > Description
> > 
> > Write-Ahead Logging (WAL) puts a checkpoint in the log every 300 seconds
> > by default. (This may be changed by the parameter CHECKPOINT_TIMEOUT
> > in postgresql.conf.)
> > 
> > The CHECKPOINT command forces a checkpoint at the point at which the
> > command is issued. The next automatic checkpoint will happen the default
> > time after the forced checkpoint.
> > 
> > Restrictions
> > 
> > Use of the CHECKPOINT command is restricted to users with administrative
> > access.
> > 
> > Usage
> > 
> > To force a checkpoint in the transaction log:
> > 
> > CHECKPOINT;
> > 
> > Compatibility
> > 
> > SQL92
> > 
> > CHECKPOINT is a Postgres language extension. There is no CHECKPOINT
> > command in SQL92.
> > 
> > Note: The handling of database storage and logging is a matter that the
> > standard leaves to the implementation.
> > 
> >   
> ------------------------------------------------------------------------
> >                          Name: checkpoint.sgml
> >    checkpoint.sgml       Type: text/x-sgml
> >                   Description: checkpoint.sgml
> > 
> >                   Name: wal.sgml
> >    wal.sgml       Type: text/x-sgml
> >            Description: wal.sgml
> > 
> >    Part 1.4Type: Plain Text (text/plain)
> 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: This script will crash the connection
Next
From: Zeugswetter Andreas SB
Date:
Subject: AW: Re: AW: Re: MySQL and BerkleyDB (fwd)