Home > mailing lists

Re: Maximum number of WAL files in the pg_xlog directory - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: Maximum number of WAL files in the pg_xlog directory
Date	November 3, 2014 20:39:37
Msg-id	CAMkU=1xmwNE2CZ2Yg8aYb71j9sLwa-JPFpnaPXNxwJoeVOfnxQ@mail.gmail.com Whole thread
In response to	Re: Maximum number of WAL files in the pg_xlog directory (Jeff Janes <jeff.janes@gmail.com>)
Responses	Re: Maximum number of WAL files in the pg_xlog directory Re: Maximum number of WAL files in the pg_xlog directory
List	pgsql-hackers

Tree view

On Wed, Oct 15, 2014 at 1:11 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <guillaume@lelarge.info> wrote:
Hi,

As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.

We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:

greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)

I think the first bug is even having this formula in the documentation to start with, and in trying to use it.

"and will normally not be more than..."

This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past.

So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.

I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer.

It looked to me that the formula, when descending from a previously stressed state, would be:

greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 +

2 * checkpoint_segments + 1

This assumes logs are filled evenly over a checkpoint cycle, which is probably not true because there is a spike in full page writes right after a checkpoint starts.

But I didn't have a great deal of confidence in my analysis.

The first line reflects the number of WAL that will be retained as-is, the second is the number that will be recycled for future use before starting to delete them.

My reading of the code is that wal_keep_segments is computed from the current end of WAL (i.e the checkpoint record), not from the checkpoint redo point. If I distribute the part outside the 'greatest' into both branches of the 'greatest', I don't get the same answer as you do for either branch.

Then I started wondering if the number we keep for recycling is a good choice, anyway. 2 * checkpoint_segments + 1 seems pretty large. But then again, given that we've reached the high-water-mark once, how unlikely are we to reach it again?

Cheers,

Jeff

pgsql-hackers by date:

From: Alvaro Herrera
Date: 03 November 2014, 20:37:03
Subject: Re: tracking commit timestamps

From: Abhijit Menon-Sen
Date: 03 November 2014, 20:41:04
Subject: Re: pgaudit - an auditing extension for PostgreSQL

Re: Maximum number of WAL files in the pg_xlog directory - Mailing list pgsql-hackers

Previous

Next