Thread: WAL file utilization question

WAL file utilization question

From
Keaton Adams
Date:
I have a question related to the WAL log. Our organization is looking at using log-shipping to build a high-availability standby server. In our tests the amount of WAL log data generated per minute is significant. On our busiest database during a WAL archive test, eight 16 MB logs were being used/archived per minute, which translated into having to move roughly 7.6 GB of archived WAL log data across a WAN link from one data center to another.

I read this statement from the PostgreSQL 8.1 manual and have a question related to it:

"If full_page_writes is set (as is the default), there is another factor to consider. To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. In that case, a smaller checkpoint interval increases the volume of output to the WAL log, partially negating the goal of using a smaller interval, and in any case causing more disk I/O."

On most database systems I am used to, there is a physical log and separate logical log. The physical log holds any “before images” of data pages that have been modified since the last checkpoint. The logical log holds the individual rows of data that have been inserted, updated, deleted as well as checkpoint records and the like. If the server crashes, fast recovery is initiated where (1) any pages in the physical log are put back in the data files (tablespaces/dbspaces) on disk to get back to a state of physical consistency and (2) individual transactions in the logical log since the last checkpoint are rolled forward / rolled back to get to a point of logical consistency.

Even with full_page_writes set to false and checkpoints taking place every 60 seconds or so, the amount of WAL log data generated per minute seems to be significant.

So my question is this: If I kick off a transaction that loads records with a size of 100 bytes, does the insert for that record take 100 bytes in the WAL file, or is the data archived in the WAL log in page size (8k) portions? So with piggyback commits if I can only stack up 2K worth of data before the next LogFlush, will 2K be written to the WAL file, or will 8K be written each time regardless of the amount of actual transaction data that is available to flush?

Since there is no separate physical log to keep track of dirty/modified pages since the last checkpoint I would assume that the WAL log is serving a dual purpose of being able to get back to the point of physical and logical database consistency, but I need to know for certain that there is not a way to reduce the amount of WAL data being written for the amount of transaction information we are actually writing to the database at any given point in time.

Thank you,

Keaton

Re: WAL file utilization question

From
"Jim C. Nasby"
Date:
On Wed, May 16, 2007 at 10:28:29AM -0600, Keaton Adams wrote:
> On most database systems I am used to, there is a physical log and
> separate logical log. The physical log holds any ???before images??? of data
> pages that have been modified since the last checkpoint. The logical log
> holds the individual rows of data that have been inserted, updated,
> deleted as well as checkpoint records and the like. If the server
> crashes, fast recovery is initiated where (1) any pages in the physical
> log are put back in the data files (tablespaces/dbspaces) on disk to get
> back to a state of physical consistency and (2) individual transactions
> in the logical log since the last checkpoint are rolled forward / rolled
> back to get to a point of logical consistency.

PostgreSQL combines the two, though there has been recent discussion
about changing that. There may be some improvements in this regard in
8.3 (I don't remember if the patches were accepted or not).

> Even with full_page_writes set to false and checkpoints taking place
> every 60 seconds or so, the amount of WAL log data generated per minute
> seems to be significant.

Wow, do you really want to be checkpointing every 60 seconds? That's
going to greatly increase your WAL volume, as well as the size of WAL.

> So my question is this: If I kick off a transaction that loads records
> with a size of 100 bytes, does the insert for that record take 100 bytes
> in the WAL file, or is the data archived in the WAL log in page size
> (8k) portions? So with piggyback commits if I can only stack up 2K worth
> of data before the next LogFlush, will 2K be written to the WAL file, or
> will 8K be written each time regardless of the amount of actual
> transaction data that is available to flush?

(Generally) WAL only records differences. The exception is that the
first time a page is modified after a checkpoint, the entire page is
written out to WAL.

> Since there is no separate physical log to keep track of dirty/modified
> pages since the last checkpoint I would assume that the WAL log is
> serving a dual purpose of being able to get back to the point of
> physical and logical database consistency, but I need to know for
> certain that there is not a way to reduce the amount of WAL data being
> written for the amount of transaction information we are actually
> writing to the database at any given point in time.

The only way to do it right now is to reduce the frequency of your
checkpoints. IIRC you can't actually disable full page writes if you're
using PITR.
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: WAL file utilization question

From
Alvaro Herrera
Date:
Keaton Adams wrote:

> Even with full_page_writes set to false and checkpoints taking place
> every 60 seconds or so, the amount of WAL log data generated per minute
> seems to be significant.

Stop right there.  Having checkpoints every 60 seconds is insane.  Did
you look into increasing checkpoint_segments a lot?

> So my question is this: If I kick off a transaction that loads records
> with a size of 100 bytes, does the insert for that record take 100 bytes
> in the WAL file, or is the data archived in the WAL log in page size
> (8k) portions?

Vicinity of 100 bytes, except right after checkpoints as you noted.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: WAL file utilization question

From
Keaton Adams
Date:
Well, no, I'm not that paranoid where I expect checkpoints to be taking place so often.  I do find it interesting that controlling checkpoint frequency is a factor of the number of WAL files available.  In order to get up to 60 second checkpoints I had to set checkpoint_segments to 25, which resulted in the creation of 52 log files in data/pg_xlog.  So for 120 second checkpoints I will need to have up to 104 log files in data/pg_xlog available?  That's nearly 2 GB of space for WAL logs, and it only nets on average a 2 minute checkpoint.  That would appear to be the pattern (2 * checkpoint_segments at the busiest time):

# - Checkpoints -

#3 - 8 seconds - 3 log files
#6 - 16 seconds - 13 log files
#9 - 29 seconds  - 19 log files
#20 - 57 seconds - 41 log files
#25 - > 60 sec - 52 log files

checkpoint_segments = 25        # in logfile segments, min 1, 16MB each
checkpoint_timeout = 300        # range 30-3600, in seconds
#checkpoint_warning = 60        # in seconds, 0 is off

So, if I do set the checkpoint_segments parameter to 50, 75, or even 100 I will have a considerable number of WAL files in data/pg_xlog, but they will be used more efficiently, so I will be archiving less?  That's what I understand from the comments received so far.

On PITR it makes sense, given the nature of the WAL file, that only full pages can be used.  So I won't be able to set full_page_writes to false in this case.

-Keaton



On Wed, 2007-05-16 at 11:35 -0500, Jim C. Nasby wrote:
On Wed, May 16, 2007 at 10:28:29AM -0600, Keaton Adams wrote:
> On most database systems I am used to, there is a physical log and
> separate logical log. The physical log holds any ???before images??? of data
> pages that have been modified since the last checkpoint. The logical log
> holds the individual rows of data that have been inserted, updated,
> deleted as well as checkpoint records and the like. If the server
> crashes, fast recovery is initiated where (1) any pages in the physical
> log are put back in the data files (tablespaces/dbspaces) on disk to get
> back to a state of physical consistency and (2) individual transactions
> in the logical log since the last checkpoint are rolled forward / rolled
> back to get to a point of logical consistency.
 
PostgreSQL combines the two, though there has been recent discussion
about changing that. There may be some improvements in this regard in
8.3 (I don't remember if the patches were accepted or not).

> Even with full_page_writes set to false and checkpoints taking place
> every 60 seconds or so, the amount of WAL log data generated per minute
> seems to be significant.
 
Wow, do you really want to be checkpointing every 60 seconds? That's
going to greatly increase your WAL volume, as well as the size of WAL.

> So my question is this: If I kick off a transaction that loads records
> with a size of 100 bytes, does the insert for that record take 100 bytes
> in the WAL file, or is the data archived in the WAL log in page size
> (8k) portions? So with piggyback commits if I can only stack up 2K worth
> of data before the next LogFlush, will 2K be written to the WAL file, or
> will 8K be written each time regardless of the amount of actual
> transaction data that is available to flush?
 
(Generally) WAL only records differences. The exception is that the
first time a page is modified after a checkpoint, the entire page is
written out to WAL.

> Since there is no separate physical log to keep track of dirty/modified
> pages since the last checkpoint I would assume that the WAL log is
> serving a dual purpose of being able to get back to the point of
> physical and logical database consistency, but I need to know for
> certain that there is not a way to reduce the amount of WAL data being
> written for the amount of transaction information we are actually
> writing to the database at any given point in time.

The only way to do it right now is to reduce the frequency of your
checkpoints. IIRC you can't actually disable full page writes if you're
using PITR.

Re: WAL file utilization question

From
"Jim C. Nasby"
Date:
On Wed, May 16, 2007 at 10:49:04AM -0600, Keaton Adams wrote:
> So, if I do set the checkpoint_segments parameter to 50, 75, or even 100
> I will have a considerable number of WAL files in data/pg_xlog, but they
> will be used more efficiently, so I will be archiving less?  That's what
> I understand from the comments received so far.

Yes. Let it create 500 or 1000 WAL files if it wants... it's much more
important to limit the frequency of checkpoints than to reduce the
number of WAL files (which has virtually no impact on performance).
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: WAL file utilization question

From
Keaton Adams
Date:
That's considerably more transaction log space to allocate than what I am used to (up to 16 GB), but I'll set the parameter and see just how many logs PostgreSQL creates and what the new rate of WAL file archiving is over my test period.

Thanks, this information was very helpful,

Keaton



On Wed, 2007-05-16 at 11:51 -0500, Jim C. Nasby wrote:
On Wed, May 16, 2007 at 10:49:04AM -0600, Keaton Adams wrote:
> So, if I do set the checkpoint_segments parameter to 50, 75, or even 100
> I will have a considerable number of WAL files in data/pg_xlog, but they
> will be used more efficiently, so I will be archiving less?  That's what
> I understand from the comments received so far.

Yes. Let it create 500 or 1000 WAL files if it wants... it's much more
important to limit the frequency of checkpoints than to reduce the
number of WAL files (which has virtually no impact on performance).

Re: WAL file utilization question

From
Keaton Adams
Date:
Given these postgresql.conf settings:

#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

fsync = on                              # turns forced synchronization on or off
wal_sync_method = fsync                 # the default is the first option
                                        # supported by the operating system:
                                        #   open_datasync
                                        #   fdatasync
                                        #   fsync
                                        #   fsync_writethrough
                                        #   open_sync
full_page_writes = on                   # recover from partial page writes
wal_buffers = 32                        # min 4, 8KB each
commit_delay = 100000                   # range 0-100000, in microseconds
commit_siblings = 1000                  # range 1-1000

# - Checkpoints -

checkpoint_segments = 500       # in logfile segments, min 1, 16MB each
checkpoint_timeout = 300        # range 30-3600, in seconds
checkpoint_warning = 120        # in seconds, 0 is off

# - Archiving -
archive_command = '/mnt/logship/scripts/archivemaster.sh %p %f'
# command to use to archive a logfile
# segment



And these tables to load data into:

           List of relations
Schema |   Name    | Type  |  Owner  
--------+-----------+-------+----------
public | testload  | table | postgres
public | testload2 | table | postgres
public | testload3 | table | postgres
(3 rows)

postgres=# \d testload
       Table "public.testload"
Column |      Type      | Modifiers
--------+----------------+-----------
name   | character(100) |

postgres=# \d testload2
      Table "public.testload2"
Column |      Type      | Modifiers
--------+----------------+-----------
name   | character(100) |

postgres=# \d testload3
      Table "public.testload3"
Column |      Type      | Modifiers
--------+----------------+-----------
name   | character(100) |

There are no indexes on the tables.


Using an 8K data page:

8K data page (8192 bytes)
Less page header and row overhead leaves ~8000 bytes
At 100 bytes per row = ~80 rows/page
Rows loaded: 250,000 / 80 = 3125 data pages * 8192 = 25,600,000 bytes / 1048576 = ~ 24.4 MB of data page space.

The test file is shown here (250,000 rows all the same):
-bash-3.1$ more datafile.txt
AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFFFFGGGGGGGGGGHHHHHHHHHHIIIIIIIIIIJJJJJJJJJJ
AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFFFFGGGGGGGGGGHHHHHHHHHHIIIIIIIIIIJJJJJJJJJJ
AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFFFFGGGGGGGGGGHHHHHHHHHHIIIIIIIIIIJJJJJJJJJJ

The load script:
-bash-3.1$ more loaddata.sql
copy testload from '/home/kadams/logship/datafile.txt' delimiter '|';
copy testload2 from '/home/kadams/logship/datafile.txt' delimiter '|';
copy testload3 from '/home/kadams/logship/datafile.txt' delimiter '|';

So the one load process does a COPY into the three tables.  24.4 MB * 3 tables = ~ 73.2 MB of data page space.

This is the only process running on the database.  No other loads/users are on the system.

psql -f sql/loaddata.sql  >/dev/null 2>&1 &

It seems that 112 MB of WAL file space (16 MB * 7) is required for 73.2 MB of loaded data, which is an extra 34.8% of disk space to log/archive the COPY commands:

First pass:
LOG:  transaction ID wrap limit is 2147484146, limited by database "postgres"
LOG:  archived transaction log file "00000001000000010000005E"
LOG:  archived transaction log file "00000001000000010000005F"
LOG:  archived transaction log file "000000010000000100000060"
LOG:  archived transaction log file "000000010000000100000061"
LOG:  archived transaction log file "000000010000000100000062"
LOG:  archived transaction log file "000000010000000100000063"
LOG:  archived transaction log file "000000010000000100000064"

# of logs in pg_xlog: 9

Second pass:
LOG:  archived transaction log file "000000010000000100000065"
LOG:  archived transaction log file "000000010000000100000066"
LOG:  archived transaction log file "000000010000000100000067"
LOG:  archived transaction log file "000000010000000100000068"
LOG:  archived transaction log file "000000010000000100000069"
LOG:  archived transaction log file "00000001000000010000006A"
LOG:  archived transaction log file "00000001000000010000006B"

# of logs in pg_xlog: 15

Third pass:
LOG:  archived transaction log file "00000001000000010000006C"
LOG:  archived transaction log file "00000001000000010000006D"
LOG:  archived transaction log file "00000001000000010000006E"
LOG:  archived transaction log file "00000001000000010000006F"
LOG:  archived transaction log file "000000010000000100000070"
LOG:  archived transaction log file "000000010000000100000071"
LOG:  archived transaction log file "000000010000000100000072"

# of logs in pg_xlog: 22

Fourth pass:
LOG:  archived transaction log file "000000010000000100000073"
LOG:  archived transaction log file "000000010000000100000074"
LOG:  archived transaction log file "000000010000000100000075"
LOG:  archived transaction log file "000000010000000100000076"
LOG:  archived transaction log file "000000010000000100000077"
LOG:  archived transaction log file "000000010000000100000078"
LOG:  archived transaction log file "000000010000000100000079"

# of logs in pg_xlog: 29

PostgreSQL continued to add log files in pg_xlog, so my assumption is that checkpoints did not come into play during the load process, correct? (Frequent checkpoints would have added even more to the WAL file overhead, is my understanding.)

So is there anything I can do to reduce the 34.8% overhead in WAL file space when loading data? Do you see any glaring mistakes in the calculations themselves, and would you agree with this overhead figure? (I make no claim to perfection by any means.)

We are running on PostgreSQL 8.1.4 and are planning to move to 8.3 when it becomes available.  Are there space utilization improvements in WAL logging in the upcoming release?

Thanks,

Keaton


















On Wed, 2007-05-16 at 11:51 -0500, Jim C. Nasby wrote:
On Wed, May 16, 2007 at 10:49:04AM -0600, Keaton Adams wrote:
> So, if I do set the checkpoint_segments parameter to 50, 75, or even 100
> I will have a considerable number of WAL files in data/pg_xlog, but they
> will be used more efficiently, so I will be archiving less?  That's what
> I understand from the comments received so far.

Yes. Let it create 500 or 1000 WAL files if it wants... it's much more
important to limit the frequency of checkpoints than to reduce the
number of WAL files (which has virtually no impact on performance).

Re: WAL file utilization question

From
Tom Lane
Date:
Keaton Adams <kadams@mxlogic.com> writes:
> Well, no, I'm not that paranoid where I expect checkpoints to be taking
> place so often.  I do find it interesting that controlling checkpoint
> frequency is a factor of the number of WAL files available.  In order to
> get up to 60 second checkpoints I had to set checkpoint_segments to 25,
> which resulted in the creation of 52 log files in data/pg_xlog.  So for
> 120 second checkpoints I will need to have up to 104 log files in
> data/pg_xlog available?

No, it's nonlinear, precisely because of the
checkpoints-increase-the-log-size effect.  Given the size of modern
disks, I don't see why a couple Gb of WAL space is a problem anyhow
for a large database.

            regards, tom lane