Thread: Documentation Update: WAL & Checkpoints

Documentation Update: WAL & Checkpoints

From
Michael Renner
Date:
Hi,

this is a small update to the first paragraph of the WAL configuration
chapter, going into more detail WRT redo vs. checkpoint records, since
the underlying behavior is currently only deducible from the source. I'm
not perfectly sure if I got everything right, so feel free to change as
necessary.

I think it'd be more appropriate to split the chapter and separate
basics from implementation details and tuneables, but for time being
this ought to suffice. Is somebody "in charge" of the documentation and
overall structure or is it a community effort as everything else?


Best regards,
Michael Renner
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index cff6fde..69b8b0a 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -322,19 +322,24 @@
   </para>

   <para>
-   <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
-   are points in the sequence of transactions at which it is guaranteed
-   that the data files have been updated with all information written before
-   the checkpoint.  At checkpoint time, all dirty data pages are flushed to
-   disk and a special checkpoint record is written to the log file.
-   In the event of a crash, the crash recovery procedure looks at the latest
-   checkpoint record to determine the point in the log (known as the redo
-   record) from which it should start the REDO operation.  Any changes made to
-   data files before that point are known to be already on disk.  Hence, after
-   a checkpoint has been made, any log segments preceding the one containing
-   the redo record are no longer needed and can be recycled or removed. (When
-   <acronym>WAL</acronym> archiving is being done, the log segments must be
-   archived before being recycled or removed.)
+   <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> are
+   points in the logical sequence of transactions at which it is guaranteed
+   that the data files have been updated with all information created before
+   the start of the checkpoint command.  Since flushing all dirty data (meaning
+   "changed only in the WAL") to disk can take a while on databases with
+   write-heavy loads, checkpoints are not a single operation but rather a
+   series of events.  When a checkpoint starts, a redo record is written to the
+   WAL and PostgreSQL starts writing out dirty data which has accumulated up to
+   the redo record.  At checkpoint completion time, all changed files are
+   fsynced and a special checkpoint record is written to the log file. In the
+   event of a crash, the crash recovery procedure looks at the latest
+   checkpoint record to determine from which redo record it should start the
+   REDO operation.  Any changes made to data files before that point are known
+   to be already on disk.  Hence, after a checkpoint has been made, any log
+   segments preceding the one containing the redo record are no longer needed
+   and can be recycled or removed. (When <acronym>WAL</acronym> archiving is
+   being done, the log segments must be archived before being recycled or
+   removed.)
   </para>

   <para>

Re: Documentation Update: WAL & Checkpoints

From
Bruce Momjian
Date:
Michael Renner wrote:
> Hi,
>
> this is a small update to the first paragraph of the WAL configuration
> chapter, going into more detail WRT redo vs. checkpoint records, since
> the underlying behavior is currently only deducible from the source. I'm
> not perfectly sure if I got everything right, so feel free to change as
> necessary.
>
> I think it'd be more appropriate to split the chapter and separate
> basics from implementation details and tuneables, but for time being
> this ought to suffice. Is somebody "in charge" of the documentation and
> overall structure or is it a community effort as everything else?
>

I read over you patch and I was afraid it was trying to put too much
information into a single paragraph, so I added a second paragraph that
just talks about checkpoint smoothing.  I did not address the issue of
when the REDO WAL entry is written --- that is probably too much detail
for our documentation.

New patch attached, and applied.

---------------------------------------------------------------------------


>
> Best regards,
> Michael Renner

> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
> index cff6fde..69b8b0a 100644
> --- a/doc/src/sgml/wal.sgml
> +++ b/doc/src/sgml/wal.sgml
> @@ -322,19 +322,24 @@
>    </para>
>
>    <para>
> -   <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
> -   are points in the sequence of transactions at which it is guaranteed
> -   that the data files have been updated with all information written before
> -   the checkpoint.  At checkpoint time, all dirty data pages are flushed to
> -   disk and a special checkpoint record is written to the log file.
> -   In the event of a crash, the crash recovery procedure looks at the latest
> -   checkpoint record to determine the point in the log (known as the redo
> -   record) from which it should start the REDO operation.  Any changes made to
> -   data files before that point are known to be already on disk.  Hence, after
> -   a checkpoint has been made, any log segments preceding the one containing
> -   the redo record are no longer needed and can be recycled or removed. (When
> -   <acronym>WAL</acronym> archiving is being done, the log segments must be
> -   archived before being recycled or removed.)
> +   <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> are
> +   points in the logical sequence of transactions at which it is guaranteed
> +   that the data files have been updated with all information created before
> +   the start of the checkpoint command.  Since flushing all dirty data (meaning
> +   "changed only in the WAL") to disk can take a while on databases with
> +   write-heavy loads, checkpoints are not a single operation but rather a
> +   series of events.  When a checkpoint starts, a redo record is written to the
> +   WAL and PostgreSQL starts writing out dirty data which has accumulated up to
> +   the redo record.  At checkpoint completion time, all changed files are
> +   fsynced and a special checkpoint record is written to the log file. In the
> +   event of a crash, the crash recovery procedure looks at the latest
> +   checkpoint record to determine from which redo record it should start the
> +   REDO operation.  Any changes made to data files before that point are known
> +   to be already on disk.  Hence, after a checkpoint has been made, any log
> +   segments preceding the one containing the redo record are no longer needed
> +   and can be recycled or removed. (When <acronym>WAL</acronym> archiving is
> +   being done, the log segments must be archived before being recycled or
> +   removed.)
>    </para>
>
>    <para>

>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.58
diff -c -c -r1.58 wal.sgml
*** doc/src/sgml/wal.sgml    15 Jan 2009 00:34:25 -0000    1.58
--- doc/src/sgml/wal.sgml    9 Apr 2009 16:19:18 -0000
***************
*** 326,343 ****
     are points in the sequence of transactions at which it is guaranteed
     that the data files have been updated with all information written before
     the checkpoint.  At checkpoint time, all dirty data pages are flushed to
!    disk and a special checkpoint record is written to the log file.
     In the event of a crash, the crash recovery procedure looks at the latest
     checkpoint record to determine the point in the log (known as the redo
     record) from which it should start the REDO operation.  Any changes made to
!    data files before that point are known to be already on disk.  Hence, after
!    a checkpoint has been made, any log segments preceding the one containing
     the redo record are no longer needed and can be recycled or removed. (When
     <acronym>WAL</acronym> archiving is being done, the log segments must be
     archived before being recycled or removed.)
    </para>

    <para>
     The server's background writer process will automatically perform
     a checkpoint every so often.  A checkpoint is created every <xref
     linkend="guc-checkpoint-segments"> log segments, or every <xref
--- 326,352 ----
     are points in the sequence of transactions at which it is guaranteed
     that the data files have been updated with all information written before
     the checkpoint.  At checkpoint time, all dirty data pages are flushed to
!    disk and a special checkpoint record is written to the log file.
!    (The changes were previously flushed to the <acronym>WAL</acronym> files.)
     In the event of a crash, the crash recovery procedure looks at the latest
     checkpoint record to determine the point in the log (known as the redo
     record) from which it should start the REDO operation.  Any changes made to
!    data files before that point are guaranteed to be already on disk.  Hence, after
!    a checkpoint, log segments preceding the one containing
     the redo record are no longer needed and can be recycled or removed. (When
     <acronym>WAL</acronym> archiving is being done, the log segments must be
     archived before being recycled or removed.)
    </para>

    <para>
+    The checkpoint requirement of flushing all dirty data pages to disk
+    can cause a significant I/O load.  For this reason, checkpoint
+    activity is throttled so I/O begins at checkpoint start and completes
+    before the next checkpoint starts;  this minimizes performance
+    degradation during checkpoints.
+   </para>
+
+   <para>
     The server's background writer process will automatically perform
     a checkpoint every so often.  A checkpoint is created every <xref
     linkend="guc-checkpoint-segments"> log segments, or every <xref

Re: Documentation Update: WAL & Checkpoints

From
Michael Renner
Date:
Bruce Momjian wrote:
> Michael Renner wrote:
>> Hi,
>>
>> this is a small update to the first paragraph of the WAL configuration 
>> chapter, going into more detail WRT redo vs. checkpoint records, since 
>> the underlying behavior is currently only deducible from the source. I'm 
>> not perfectly sure if I got everything right, so feel free to change as 
>> necessary.

[..]

> I read over you patch and I was afraid it was trying to put too much
> information into a single paragraph, so I added a second paragraph that
> just talks about checkpoint smoothing.  I did not address the issue of
> when the REDO WAL entry is written --- that is probably too much detail
> for our documentation.

Too bad, understanding how this works is necessary to properly implement 
more complex log shipping setups. Maybe /backend/access/transam/README 
instead? Or specific "under the hood" paragraphs for selected areas of 
the DBMS?

best regards,
Michael


Re: Documentation Update: WAL & Checkpoints

From
Bruce Momjian
Date:
Michael Renner wrote:
> Bruce Momjian wrote:
> > Michael Renner wrote:
> >> Hi,
> >>
> >> this is a small update to the first paragraph of the WAL configuration 
> >> chapter, going into more detail WRT redo vs. checkpoint records, since 
> >> the underlying behavior is currently only deducible from the source. I'm 
> >> not perfectly sure if I got everything right, so feel free to change as 
> >> necessary.
> 
> [..]
> 
> > I read over you patch and I was afraid it was trying to put too much
> > information into a single paragraph, so I added a second paragraph that
> > just talks about checkpoint smoothing.  I did not address the issue of
> > when the REDO WAL entry is written --- that is probably too much detail
> > for our documentation.
> 
> Too bad, understanding how this works is necessary to properly implement 
> more complex log shipping setups. Maybe /backend/access/transam/README 
> instead? Or specific "under the hood" paragraphs for selected areas of 
> the DBMS?

Let's back up and let me ask why it is important for a user to know when
the REDO record is written vs. when the checkpoint completes, and how
that affects more complex log shipping setups.

This detail is certainly appropriate for /backend/access/transam/README
so if you could send in a patch, that would be great.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +