Thread: Maximum number of WAL files in the pg_xlog directory
Hi,
As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)
And we have lots of alerts from the script for customers who set their wal_keep_segments setting higher than 0.
So we started to question this sentence of the documentation:
There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files.
(http://www.postgresql.org/docs/9.3/static/wal-configuration.html)
While doing some tests, it appears it would be more something like:
wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments + 1
But after reading the source code (src/backend/access/transam/xlog.c), the right formula seems to be:
wal_keep_segments + 2 * checkpoint_segments + 1
Here is how we went to this formula...
CreateCheckPoint(..) is responsible, among other things, for deleting and recycling old WAL files. From src/backend/access/transam/xlog.c, master branch, line 8363:
/*
* Delete old log files (those no longer needed even for previous
* checkpoint or the standbys in XLOG streaming).
*/
if (_logSegNo)
{
KeepLogSeg(recptr, &_logSegNo);
_logSegNo--;
RemoveOldXlogFiles(_logSegNo, recptr);
}
KeepLogSeg(...) function takes care of wal_keep_segments. From src/backend/access/transam/xlog.c, master branch, line 8792:
/* compute limit for wal_keep_segments first */
if (wal_keep_segments > 0)
{
/* avoid underflow, don't go below 1 */
if (segno <= wal_keep_segments)
segno = 1;
else
segno = segno - wal_keep_segments;
}
IOW, the segment number (segno) is decremented according to the setting of wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can remove or recycle all files before this segment number. This function gets the number of WAL files to recycle with the XLOGfileslop constant, which is defined as:
/*
* XLOGfileslop is the maximum number of preallocated future XLOG segments.
* When we are done with an old XLOG segment file, we will recycle it as a
* future XLOG segment as long as there aren't already XLOGfileslop future
* segments; else we'll delete it. This could be made a separate GUC
* variable, but at present I think it's sufficient to hardwire it as
* 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free
* no more than 2*CheckPointSegments log segments, and we want to recycle all
* of them; the +1 allows boundary cases to happen without wasting a
* delete/create-segment cycle.
*/
#define XLOGfileslop (2*CheckPointSegments + 1)
(in src/backend/access/transam/xlog.c, master branch, line 100)
IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence the formula:
wal_keep_segments + 2 * checkpoint_segments + 1
And this is what we usually find in our customers' servers. We may find more WAL files, depending on the write activity of the cluster, but in average, we get this number of WAL files.
AFAICT, the documentation is wrong about the usual number of WAL files in the pg_xlog directory. But I may be wrong, in which case, the documentation isn't clear enough for me, and should be fixed so that others can't misinterpret it like I may have done.
Any comments? did I miss something, or should we fix the documentation?
Thanks.
--
As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)
And we have lots of alerts from the script for customers who set their wal_keep_segments setting higher than 0.
So we started to question this sentence of the documentation:
There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files.
(http://www.postgresql.org/docs/9.3/static/wal-configuration.html)
While doing some tests, it appears it would be more something like:
wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments + 1
But after reading the source code (src/backend/access/transam/xlog.c), the right formula seems to be:
wal_keep_segments + 2 * checkpoint_segments + 1
Here is how we went to this formula...
CreateCheckPoint(..) is responsible, among other things, for deleting and recycling old WAL files. From src/backend/access/transam/xlog.c, master branch, line 8363:
/*
* Delete old log files (those no longer needed even for previous
* checkpoint or the standbys in XLOG streaming).
*/
if (_logSegNo)
{
KeepLogSeg(recptr, &_logSegNo);
_logSegNo--;
RemoveOldXlogFiles(_logSegNo, recptr);
}
KeepLogSeg(...) function takes care of wal_keep_segments. From src/backend/access/transam/xlog.c, master branch, line 8792:
/* compute limit for wal_keep_segments first */
if (wal_keep_segments > 0)
{
/* avoid underflow, don't go below 1 */
if (segno <= wal_keep_segments)
segno = 1;
else
segno = segno - wal_keep_segments;
}
IOW, the segment number (segno) is decremented according to the setting of wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can remove or recycle all files before this segment number. This function gets the number of WAL files to recycle with the XLOGfileslop constant, which is defined as:
/*
* XLOGfileslop is the maximum number of preallocated future XLOG segments.
* When we are done with an old XLOG segment file, we will recycle it as a
* future XLOG segment as long as there aren't already XLOGfileslop future
* segments; else we'll delete it. This could be made a separate GUC
* variable, but at present I think it's sufficient to hardwire it as
* 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free
* no more than 2*CheckPointSegments log segments, and we want to recycle all
* of them; the +1 allows boundary cases to happen without wasting a
* delete/create-segment cycle.
*/
#define XLOGfileslop (2*CheckPointSegments + 1)
(in src/backend/access/transam/xlog.c, master branch, line 100)
IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence the formula:
wal_keep_segments + 2 * checkpoint_segments + 1
And this is what we usually find in our customers' servers. We may find more WAL files, depending on the write activity of the cluster, but in average, we get this number of WAL files.
AFAICT, the documentation is wrong about the usual number of WAL files in the pg_xlog directory. But I may be wrong, in which case, the documentation isn't clear enough for me, and should be fixed so that others can't misinterpret it like I may have done.
Any comments? did I miss something, or should we fix the documentation?
Thanks.
--
<p dir="ltr">Le 8 août 2014 09:08, "Guillaume Lelarge" <<a href="mailto:guillaume@lelarge.info">guillaume@lelarge.info</a>>a écrit :<br /> ><br /> > Hi,<br /> ><br /> >As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segmentssetting higher than 0.<br /> ><br /> > We have a monitoring script that checks the number of WAL filesin the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments,and wal_keep_segments). We usually add a percentage to the usual formula:<br /> ><br /> > greatest(<br/> > (2 + checkpoint_completion_target) * checkpoint_segments + 1,<br /> > checkpoint_segments + wal_keep_segments+ 1<br /> > )<br /> ><br /> > And we have lots of alerts from the script for customers who settheir wal_keep_segments setting higher than 0.<br /> ><br /> > So we started to question this sentence of the documentation:<br/> ><br /> > There will always be at least one WAL segment file, and will normally not be more than(2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files.<br/> ><br /> > (<a href="http://www.postgresql.org/docs/9.3/static/wal-configuration.html">http://www.postgresql.org/docs/9.3/static/wal-configuration.html</a>)<br />><br /> > While doing some tests, it appears it would be more something like:<br /> ><br /> > wal_keep_segments+ (2 + checkpoint_completion_target) * checkpoint_segments + 1<br /> ><br /> > But after reading thesource code (src/backend/access/transam/xlog.c), the right formula seems to be:<br /> ><br /> > wal_keep_segments+ 2 * checkpoint_segments + 1<br /> ><br /> > Here is how we went to this formula...<br /> ><br/> > CreateCheckPoint(..) is responsible, among other things, for deleting and recycling old WAL files. From src/backend/access/transam/xlog.c,master branch, line 8363:<br /> ><br /> > /*<br /> > * Delete old log files (thoseno longer needed even for previous<br /> > * checkpoint or the standbys in XLOG streaming).<br /> > */<br />> if (_logSegNo)<br /> > {<br /> > KeepLogSeg(recptr, &_logSegNo);<br /> > _logSegNo--;<br /> > RemoveOldXlogFiles(_logSegNo, recptr);<br /> > }<br /> ><br /> > KeepLogSeg(...) function takes care ofwal_keep_segments. From src/backend/access/transam/xlog.c, master branch, line 8792:<br /> ><br /> > /* compute limitfor wal_keep_segments first */<br /> > if (wal_keep_segments > 0)<br /> > {<br /> > /* avoid underflow,don't go below 1 */<br /> > if (segno <= wal_keep_segments)<br /> > segno = 1;<br /> > else<br /> > segno = segno - wal_keep_segments;<br /> > }<br /> ><br /> > IOW, the segment number(segno) is decremented according to the setting of wal_keep_segments. segno is then sent back to CreateCheckPoint(...)via _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can remove or recycle allfiles before this segment number. This function gets the number of WAL files to recycle with the XLOGfileslop constant,which is defined as:<br /> ><br /> > /*<br /> > * XLOGfileslop is the maximum number of preallocated futureXLOG segments.<br /> > * When we are done with an old XLOG segment file, we will recycle it as a<br /> > *future XLOG segment as long as there aren't already XLOGfileslop future<br /> > * segments; else we'll delete it. This could be made a separate GUC<br /> > * variable, but at present I think it's sufficient to hardwire it as<br/> > * 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free<br /> > * no more than 2*CheckPointSegmentslog segments, and we want to recycle all<br /> > * of them; the +1 allows boundary cases to happenwithout wasting a<br /> > * delete/create-segment cycle.<br /> > */<br /> > #define XLOGfileslop (2*CheckPointSegments+ 1)<br /> ><br /> > (in src/backend/access/transam/xlog.c, master branch, line 100)<br /> ><br/> > IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL file, and then there may be 2*CheckPointSegments+ 1 recycled ones. Hence the formula:<br /> ><br /> > wal_keep_segments + 2 * checkpoint_segments+ 1<br /> ><br /> > And this is what we usually find in our customers' servers. We may find moreWAL files, depending on the write activity of the cluster, but in average, we get this number of WAL files.<br /> ><br/> > AFAICT, the documentation is wrong about the usual number of WAL files in the pg_xlog directory. But I maybe wrong, in which case, the documentation isn't clear enough for me, and should be fixed so that others can't misinterpretit like I may have done.<br /> ><br /> > Any comments? did I miss something, or should we fix the documentation?<br/> ><br /> > Thanks.<br /> ><p dir="ltr">Ping?
On Mon, Aug 25, 2014 at 07:12:33AM +0200, Guillaume Lelarge wrote: > Le 8 août 2014 09:08, "Guillaume Lelarge" <guillaume@lelarge.info> a écrit : > > > > Hi, > > > > As part of our monitoring work for our customers, we stumbled upon an issue > with our customers' servers who have a wal_keep_segments setting higher than 0. > > > > We have a monitoring script that checks the number of WAL files in the > pg_xlog directory, according to the setting of three parameters > (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We > usually add a percentage to the usual formula: > > > > greatest( > > (2 + checkpoint_completion_target) * checkpoint_segments + 1, > > checkpoint_segments + wal_keep_segments + 1 > > ) > > > > And we have lots of alerts from the script for customers who set their > wal_keep_segments setting higher than 0. > > > > So we started to question this sentence of the documentation: > > > > There will always be at least one WAL segment file, and will normally not be > more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or > checkpoint_segments + wal_keep_segments + 1 files. > > > > (http://www.postgresql.org/docs/9.3/static/wal-configuration.html) > > > > While doing some tests, it appears it would be more something like: > > > > wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments > + 1 > > > > But after reading the source code (src/backend/access/transam/xlog.c), the > right formula seems to be: > > > > wal_keep_segments + 2 * checkpoint_segments + 1 > > > > Here is how we went to this formula... > > > > CreateCheckPoint(..) is responsible, among other things, for deleting and > recycling old WAL files. From src/backend/access/transam/xlog.c, master branch, > line 8363: > > > > /* > > * Delete old log files (those no longer needed even for previous > > * checkpoint or the standbys in XLOG streaming). > > */ > > if (_logSegNo) > > { > > KeepLogSeg(recptr, &_logSegNo); > > _logSegNo--; > > RemoveOldXlogFiles(_logSegNo, recptr); > > } > > > > KeepLogSeg(...) function takes care of wal_keep_segments. From src/backend/ > access/transam/xlog.c, master branch, line 8792: > > > > /* compute limit for wal_keep_segments first */ > > if (wal_keep_segments > 0) > > { > > /* avoid underflow, don't go below 1 */ > > if (segno <= wal_keep_segments) > > segno = 1; > > else > > segno = segno - wal_keep_segments; > > } > > > > IOW, the segment number (segno) is decremented according to the setting of > wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via > _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can > remove or recycle all files before this segment number. This function gets the > number of WAL files to recycle with the XLOGfileslop constant, which is defined > as: > > > > /* > > * XLOGfileslop is the maximum number of preallocated future XLOG segments. > > * When we are done with an old XLOG segment file, we will recycle it as a > > * future XLOG segment as long as there aren't already XLOGfileslop future > > * segments; else we'll delete it. This could be made a separate GUC > > * variable, but at present I think it's sufficient to hardwire it as > > * 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free > > * no more than 2*CheckPointSegments log segments, and we want to recycle all > > * of them; the +1 allows boundary cases to happen without wasting a > > * delete/create-segment cycle. > > */ > > #define XLOGfileslop (2*CheckPointSegments + 1) > > > > (in src/backend/access/transam/xlog.c, master branch, line 100) > > > > IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL > file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence the > formula: > > > > wal_keep_segments + 2 * checkpoint_segments + 1 > > > > And this is what we usually find in our customers' servers. We may find more > WAL files, depending on the write activity of the cluster, but in average, we > get this number of WAL files. > > > > AFAICT, the documentation is wrong about the usual number of WAL files in the > pg_xlog directory. But I may be wrong, in which case, the documentation isn't > clear enough for me, and should be fixed so that others can't misinterpret it > like I may have done. > > > > Any comments? did I miss something, or should we fix the documentation? I looked into this, and came up with more questions. Why is checkpoint_completion_target involved in the total number of WAL segments? If checkpoint_completion_target is 0.5 (the default), the calculation is: (2 + 0.5) * checkpoint_segments + 1 while if it is 0.9, it is: (2 + 0.9) * checkpoint_segments + 1 Is this trying to estimate how many WAL files are going to be created during the checkpoint? If so, wouldn't it be (1 + checkpoint_completion_target), not "2 +". My logic is you have the old WAL files being checkpointed (that's the "1"), plus you have new WAL files being created during the checkpoint, which would be checkpoint_completion_target * checkpoint_segments, plus one for the current WAL file. The original calculation is summarized in this email: http://www.postgresql.org/message-id/AANLkTi=e=oR54OuxAw88=dtV4wt0e5edMiGaeZtBVcKO@mail.gmail.com However, in my reading of this, it appears to be double-counting the WAL files during the checkpoint, e.g. the checkpoint_completion_target * checkpoint_segments WAL files are also part of the later checkpoint_segments number. I also don't see how that can be equivalent to: checkpoint_segments + wal_keep_segments + 1 because wal_keep_segments isn't used in the first calculation. Is the user supposed to compute the maximum of those two? Seems easier to just give one expression. Is the right answer: max(checkpoint_segments, wal_keep_segments) + checkpoint_segments + 1 or, if you want to use checkpoint_completion_target, it would be: max(checkpoint_segments * checkpoint_completion_target, wal_keep_segments) + checkpoint_segments + 1 Is checkpoint_completion_target accurate enough to define a maximum number of files? I think I need Masao Fujii's comments on this. The fact the user is seeing something different from what is documented means something probably needs updating. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian <bruce@momjian.us> wrote:
I looked into this, and came up with more questions. Why is
checkpoint_completion_target involved in the total number of WAL
segments? If checkpoint_completion_target is 0.5 (the default), the
calculation is:
(2 + 0.5) * checkpoint_segments + 1
while if it is 0.9, it is:
(2 + 0.9) * checkpoint_segments + 1
Is this trying to estimate how many WAL files are going to be created
during the checkpoint? If so, wouldn't it be (1 +
checkpoint_completion_target), not "2 +". My logic is you have the old
WAL files being checkpointed (that's the "1"), plus you have new WAL
files being created during the checkpoint, which would be
checkpoint_completion_target * checkpoint_segments, plus one for the
current WAL file.
WAL is not eligible to be recycled until there have been 2 successful checkpoints.
So at the end of a checkpoint, you have 1 cycle of WAL which has just become eligible for recycling,
1 cycle of WAL which is now expendable but which is kept anyway, and checkpoint_completion_target worth of WAL which has occurred while the checkpoint was occurring and is still needed for crash recovery.
I don't really understand the point of this way of doing things. I guess it is because the control file contains two redo pointers, one for the last checkpoint, and one for the previous to that checkpoint, and if recovery finds that it can't use the most recent one it tries the ones before that. Why? Beats me. If we are worried about the control file getting a corrupt redo pointer, it seems like we would record the last one twice, rather than recording two different ones once each. And if the in-memory version got corrupted before being written to the file, I really doubt anything is going to save your bacon at that point.
I've never seen a case where recovery couldn't use the last recorded good checkpoint, so instead used the previous one, and was successful at it. But then again I haven't seen all possible crashes.
This is based on memory from the last time I looked into this, I haven't re-verified it so could be wrong or obsolete.
Cheers,
Jeff
On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote: > On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > I looked into this, and came up with more questions. Why is > checkpoint_completion_target involved in the total number of WAL > segments? If checkpoint_completion_target is 0.5 (the default), the > calculation is: > > (2 + 0.5) * checkpoint_segments + 1 > > while if it is 0.9, it is: > > (2 + 0.9) * checkpoint_segments + 1 > > Is this trying to estimate how many WAL files are going to be created > during the checkpoint? If so, wouldn't it be (1 + > checkpoint_completion_target), not "2 +". My logic is you have the old > WAL files being checkpointed (that's the "1"), plus you have new WAL > files being created during the checkpoint, which would be > checkpoint_completion_target * checkpoint_segments, plus one for the > current WAL file. > > > WAL is not eligible to be recycled until there have been 2 successful > checkpoints. > > So at the end of a checkpoint, you have 1 cycle of WAL which has just become > eligible for recycling, > 1 cycle of WAL which is now expendable but which is kept anyway, and > checkpoint_completion_target worth of WAL which has occurred while the > checkpoint was occurring and is still needed for crash recovery. OK, so based on this analysis, what is the right calculation? This? (1 + checkpoint_completion_target) * checkpoint_segments + 1 +max(wal_keep_segments, checkpoint_segments) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Aug 8, 2014 at 4:08 PM, Guillaume Lelarge <guillaume@lelarge.info> wrote: > Hi, > > As part of our monitoring work for our customers, we stumbled upon an issue > with our customers' servers who have a wal_keep_segments setting higher than > 0. > > We have a monitoring script that checks the number of WAL files in the > pg_xlog directory, according to the setting of three parameters > (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). > We usually add a percentage to the usual formula: > > greatest( > (2 + checkpoint_completion_target) * checkpoint_segments + 1, > checkpoint_segments + wal_keep_segments + 1 > ) > > And we have lots of alerts from the script for customers who set their > wal_keep_segments setting higher than 0. > > So we started to question this sentence of the documentation: > > There will always be at least one WAL segment file, and will normally not be > more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or > checkpoint_segments + wal_keep_segments + 1 files. > > (http://www.postgresql.org/docs/9.3/static/wal-configuration.html) > > While doing some tests, it appears it would be more something like: > > wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments > + 1 > > But after reading the source code (src/backend/access/transam/xlog.c), the > right formula seems to be: > > wal_keep_segments + 2 * checkpoint_segments + 1 > > Here is how we went to this formula... > > CreateCheckPoint(..) is responsible, among other things, for deleting and > recycling old WAL files. From src/backend/access/transam/xlog.c, master > branch, line 8363: > > /* > * Delete old log files (those no longer needed even for previous > * checkpoint or the standbys in XLOG streaming). > */ > if (_logSegNo) > { > KeepLogSeg(recptr, &_logSegNo); > _logSegNo--; > RemoveOldXlogFiles(_logSegNo, recptr); > } > > KeepLogSeg(...) function takes care of wal_keep_segments. From > src/backend/access/transam/xlog.c, master branch, line 8792: > > /* compute limit for wal_keep_segments first */ > if (wal_keep_segments > 0) > { > /* avoid underflow, don't go below 1 */ > if (segno <= wal_keep_segments) > segno = 1; > else > segno = segno - wal_keep_segments; > } > > IOW, the segment number (segno) is decremented according to the setting of > wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via > _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can > remove or recycle all files before this segment number. This function gets > the number of WAL files to recycle with the XLOGfileslop constant, which is > defined as: > > /* > * XLOGfileslop is the maximum number of preallocated future XLOG segments. > * When we are done with an old XLOG segment file, we will recycle it as a > * future XLOG segment as long as there aren't already XLOGfileslop future > * segments; else we'll delete it. This could be made a separate GUC > * variable, but at present I think it's sufficient to hardwire it as > * 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free > * no more than 2*CheckPointSegments log segments, and we want to recycle > all > * of them; the +1 allows boundary cases to happen without wasting a > * delete/create-segment cycle. > */ > #define XLOGfileslop (2*CheckPointSegments + 1) > > (in src/backend/access/transam/xlog.c, master branch, line 100) > > IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL > file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence > the formula: > > wal_keep_segments + 2 * checkpoint_segments + 1 > > And this is what we usually find in our customers' servers. We may find more > WAL files, depending on the write activity of the cluster, but in average, > we get this number of WAL files. > > AFAICT, the documentation is wrong about the usual number of WAL files in > the pg_xlog directory. But I may be wrong, in which case, the documentation > isn't clear enough for me, and should be fixed so that others can't > misinterpret it like I may have done. > > Any comments? did I miss something, or should we fix the documentation? I think you're right. The correct formula of the number of WAL files in pg_xlog seems to be (3 + checkpoint_completion_target) * checkpoint_segments + 1 or wal_keep_segments + 2 * checkpoint_segments + 1 Why? At the end of checkpoint, the WAL files which were generated since the start of previous checkpoint cannot be removed and must remain in pg_xlog. The number of them is (1 + checkpoint_completion_target) * checkpoint_segments or wal_keep_segments Also, at the end of checkpoint, as you pointed out, if there are *many* enough old WAL files, 2 * checkpoint_segments + 1 WAL files will be recycled. Then checkpoint_segments WAL files will be consumed till the end of next checkpoint. But since there are already 2 * checkpoint_segments + 1 recycled WAL files, no more files are increased. So, WAL files that we cannot remove and can recycle at the end of checkpoint can exist in pg_xlog, and the num of them can be calculated by the above formula. If my understanding is right, we need to change the formula at the document. Regards, -- Fujii Masao
On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <guillaume@lelarge.info> wrote:
Hi,
As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)
I think the first bug is even having this formula in the documentation to start with, and in trying to use it.
"and will normally not be more than..."
This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past.
So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.
I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer.
Cheers,
Jeff
2014-10-15 22:11 GMT+02:00 Jeff Janes <jeff.janes@gmail.com>:
On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <guillaume@lelarge.info> wrote:Hi,
As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)I think the first bug is even having this formula in the documentation to start with, and in trying to use it.
I agree. But we have customers asking how to compute the right size for their WAL file system partitions. Right size is usually a euphemism for smallest size, and they usually tend to get it wrong, leading to huge issues. And I'm not even speaking of monitoring, and alerting.
A way to avoid this issue is probably to erase the formula from the documentation, and find a new way to explain them how to size their partitions for WALs.
Monitoring is another matter, and I don't really think a monitoring solution should count the WAL files. What actually really matters is the database availability, and that is covered with having enough disk space in the WALs partition.
"and will normally not be more than..."This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past.
Agreed.
So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.
Agreed too.
I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer.
Thanks. I look forward reading what you found.
What seems clear to me right now is that no one has a sane explanation of the formula. Though yours definitely made sense, it didn't seem to be what the code does.
--
On 10/15/2014 01:25 PM, Guillaume Lelarge wrote: > Monitoring is another matter, and I don't really think a monitoring > solution should count the WAL files. What actually really matters is the > database availability, and that is covered with having enough disk space in > the WALs partition. If we don't count the WAL files, though, that eliminates the best way to detecting when archiving is failing. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
2014-10-15 23:12 GMT+02:00 Josh Berkus <josh@agliodbs.com>:
On 10/15/2014 01:25 PM, Guillaume Lelarge wrote:
> Monitoring is another matter, and I don't really think a monitoring
> solution should count the WAL files. What actually really matters is the
> database availability, and that is covered with having enough disk space in
> the WALs partition.
If we don't count the WAL files, though, that eliminates the best way to
detecting when archiving is failing.
WAL files don't give you this directly. You may think it's an issue to get a lot of WAL files, but it can just be a spike of changes. Counting .ready files makes more sense when you're trying to see if wal archiving is failing. And now, using pg_stat_archiver is the way to go (thanks Gabriele :) ).
--
On 10/15/2014 02:17 PM, Guillaume Lelarge wrote: >> > If we don't count the WAL files, though, that eliminates the best way to >> > detecting when archiving is failing. >> > >> > > WAL files don't give you this directly. You may think it's an issue to get > a lot of WAL files, but it can just be a spike of changes. Counting .ready > files makes more sense when you're trying to see if wal archiving is > failing. And now, using pg_stat_archiver is the way to go (thanks Gabriele > :) ). Yeah, a situation where we can't give our users any kind of reasonable monitoring threshold at all sucks though. Also, it makes it kind of hard to allocate a wal partition if it could be 10X the minimum size, you know? What happened to the work Heikki was doing on making transaction log disk usage sane? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
<p dir="ltr">Hi,<p dir="ltr">Le 15 oct. 2014 22:25, "Guillaume Lelarge" <<a href="mailto:guillaume@lelarge.info">guillaume@lelarge.info</a>>a écrit :<br /> ><br /> > 2014-10-15 22:11 GMT+02:00Jeff Janes <<a href="mailto:jeff.janes@gmail.com">jeff.janes@gmail.com</a>>:<br /> >><br /> >>On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <<a href="mailto:guillaume@lelarge.info">guillaume@lelarge.info</a>>wrote:<br /> >>><br /> >>> Hi,<br />>>><br /> >>> As part of our monitoring work for our customers, we stumbled upon an issue with our customers'servers who have a wal_keep_segments setting higher than 0.<br /> >>><br /> >>> We have a monitoringscript that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to theusual formula:<br /> >>><br /> >>> greatest(<br /> >>> (2 + checkpoint_completion_target)* checkpoint_segments + 1,<br /> >>> checkpoint_segments + wal_keep_segments + 1<br/> >>> )<br /> >><br /> >><br /> >> I think the first bug is even having this formula in thedocumentation to start with, and in trying to use it.<br /> >><br /> ><br /> > I agree. But we have customersasking how to compute the right size for their WAL file system partitions. Right size is usually a euphemism forsmallest size, and they usually tend to get it wrong, leading to huge issues. And I'm not even speaking of monitoring,and alerting.<br /> ><br /> > A way to avoid this issue is probably to erase the formula from the documentation,and find a new way to explain them how to size their partitions for WALs.<br /> ><br /> > Monitoringis another matter, and I don't really think a monitoring solution should count the WAL files. What actually reallymatters is the database availability, and that is covered with having enough disk space in the WALs partition.<br />><br /> >> "and will normally not be more than..."<br /> >><br /> >> This may be "normal" for a toysystem. I think that the normal state for any system worth monitoring is that it has had load spikes at some point inthe past. <br /> >><br /> ><br /> > Agreed.<br /> > <br /> >><br /> >> So it is the next partof the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the importantone. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.<br /> >><br /> ><br/> > Agreed too.<br /> > <br /> >><br /> >> I will try to independently derive the correct formulafrom the code, as you did, without looking too much at your derivation first, and see if we get the same answer.<br/> >><br /> ><br /> > Thanks. I look forward reading what you found.<br /> ><br /> > What seemsclear to me right now is that no one has a sane explanation of the formula. Though yours definitely made sense, it didn'tseem to be what the code does.<br /> ><p dir="ltr">Did you find time to work on this? Any news?<p dir="ltr">Thanks.
On Wed, Oct 15, 2014 at 1:11 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <guillaume@lelarge.info> wrote:Hi,
As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)I think the first bug is even having this formula in the documentation to start with, and in trying to use it."and will normally not be more than..."This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past.So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer.
It looked to me that the formula, when descending from a previously stressed state, would be:
greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 +
2 * checkpoint_segments + 1
This assumes logs are filled evenly over a checkpoint cycle, which is probably not true because there is a spike in full page writes right after a checkpoint starts.
But I didn't have a great deal of confidence in my analysis.
The first line reflects the number of WAL that will be retained as-is, the second is the number that will be recycled for future use before starting to delete them.
My reading of the code is that wal_keep_segments is computed from the current end of WAL (i.e the checkpoint record), not from the checkpoint redo point. If I distribute the part outside the 'greatest' into both branches of the 'greatest', I don't get the same answer as you do for either branch.
Then I started wondering if the number we keep for recycling is a good choice, anyway. 2 * checkpoint_segments + 1 seems pretty large. But then again, given that we've reached the high-water-mark once, how unlikely are we to reach it again?
Cheers,
Jeff
On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote: > It looked to me that the formula, when descending from a previously stressed > state, would be: > > greatest(1 + checkpoint_completion_target) * checkpoint_segments, > wal_keep_segments) + 1 + > 2 * checkpoint_segments + 1 I don't think we can assume checkpoint_completion_target is at all reliable enough to base a maximum calculation on, assuming anything above the maximum is cause of concern and something to inform the admins about. Assuming checkpoint_completion_target is 1 for maximum purposes, how about: max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Sorry for my very late answer. It's been a tough month.
--
2014-11-27 0:00 GMT+01:00 Bruce Momjian <bruce@momjian.us>:
On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
> It looked to me that the formula, when descending from a previously stressed
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1
I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.
Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:
max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2
Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why "+ 2", and not "+ 1". It seems Jeff and you agree on this, so I may have misunderstood something.
--
On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge <guillaume@lelarge.info> wrote:
Sorry for my very late answer. It's been a tough month.2014-11-27 0:00 GMT+01:00 Bruce Momjian <bruce@momjian.us>:On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
> It looked to me that the formula, when descending from a previously stressed
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1
I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.
Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:
max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why "+ 2", and not "+ 1". It seems Jeff and you agree on this, so I may have misunderstood something.
From hazy memory, one +1 comes from the currently active WAL file, which exists but is not counted towards either wal_keep_segments nor towards recycled files. And the other +1 comes from the formula for how many recycled files to retain, which explicitly has a +1 in it.
Cheers,
Jeff
2014-12-30 18:45 GMT+01:00 Jeff Janes <jeff.janes@gmail.com>:
OK, that seems much better. Thanks, Jeff.
--
On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge <guillaume@lelarge.info> wrote:Sorry for my very late answer. It's been a tough month.2014-11-27 0:00 GMT+01:00 Bruce Momjian <bruce@momjian.us>:On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
> It looked to me that the formula, when descending from a previously stressed
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1
I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.
Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:
max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why "+ 2", and not "+ 1". It seems Jeff and you agree on this, so I may have misunderstood something.From hazy memory, one +1 comes from the currently active WAL file, which exists but is not counted towards either wal_keep_segments nor towards recycled files. And the other +1 comes from the formula for how many recycled files to retain, which explicitly has a +1 in it.
OK, that seems much better. Thanks, Jeff.
--
On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote: > On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote: > > On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > I looked into this, and came up with more questions. Why is > > checkpoint_completion_target involved in the total number of WAL > > segments? If checkpoint_completion_target is 0.5 (the default), the > > calculation is: > > > > (2 + 0.5) * checkpoint_segments + 1 > > > > while if it is 0.9, it is: > > > > (2 + 0.9) * checkpoint_segments + 1 > > > > Is this trying to estimate how many WAL files are going to be created > > during the checkpoint? If so, wouldn't it be (1 + > > checkpoint_completion_target), not "2 +". My logic is you have the old > > WAL files being checkpointed (that's the "1"), plus you have new WAL > > files being created during the checkpoint, which would be > > checkpoint_completion_target * checkpoint_segments, plus one for the > > current WAL file. > > > > > > WAL is not eligible to be recycled until there have been 2 successful > > checkpoints. > > > > So at the end of a checkpoint, you have 1 cycle of WAL which has just become > > eligible for recycling, > > 1 cycle of WAL which is now expendable but which is kept anyway, and > > checkpoint_completion_target worth of WAL which has occurred while the > > checkpoint was occurring and is still needed for crash recovery. > > OK, so based on this analysis, what is the right calculation? This? > > (1 + checkpoint_completion_target) * checkpoint_segments + 1 + > max(wal_keep_segments, checkpoint_segments) Now that we have min_wal_size and max_wal_size in 9.5, I don't see any value to figuring out the proper formula for backpatching. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, 3 Mar 2015 11:15:13 -0500 Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote: > > On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote: > > > On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > > I looked into this, and came up with more questions. Why is > > > checkpoint_completion_target involved in the total number of WAL > > > segments? If checkpoint_completion_target is 0.5 (the default), the > > > calculation is: > > > > > > (2 + 0.5) * checkpoint_segments + 1 > > > > > > while if it is 0.9, it is: > > > > > > (2 + 0.9) * checkpoint_segments + 1 > > > > > > Is this trying to estimate how many WAL files are going to be created > > > during the checkpoint? If so, wouldn't it be (1 + > > > checkpoint_completion_target), not "2 +". My logic is you have the > > > old WAL files being checkpointed (that's the "1"), plus you have new WAL > > > files being created during the checkpoint, which would be > > > checkpoint_completion_target * checkpoint_segments, plus one for the > > > current WAL file. > > > > > > > > > WAL is not eligible to be recycled until there have been 2 successful > > > checkpoints. > > > > > > So at the end of a checkpoint, you have 1 cycle of WAL which has just > > > become eligible for recycling, > > > 1 cycle of WAL which is now expendable but which is kept anyway, and > > > checkpoint_completion_target worth of WAL which has occurred while the > > > checkpoint was occurring and is still needed for crash recovery. > > > > OK, so based on this analysis, what is the right calculation? This? > > > > (1 + checkpoint_completion_target) * checkpoint_segments + 1 + > > max(wal_keep_segments, checkpoint_segments) > > Now that we have min_wal_size and max_wal_size in 9.5, I don't see any > value to figuring out the proper formula for backpatching. I guess it worth backpatching the documentation as 9.4 -> 9.1 will be supported for somes the next 4 years -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com
On Tue, 31 Mar 2015 08:24:15 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > On Tue, 3 Mar 2015 11:15:13 -0500 > Bruce Momjian <bruce@momjian.us> wrote: > > > On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote: > > > On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote: > > > > On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian <bruce@momjian.us> > > > > wrote: > > > > > > > > > > > > I looked into this, and came up with more questions. Why is > > > > checkpoint_completion_target involved in the total number of WAL > > > > segments? If checkpoint_completion_target is 0.5 (the default), the > > > > calculation is: > > > > > > > > (2 + 0.5) * checkpoint_segments + 1 > > > > > > > > while if it is 0.9, it is: > > > > > > > > (2 + 0.9) * checkpoint_segments + 1 > > > > > > > > Is this trying to estimate how many WAL files are going to be > > > > created during the checkpoint? If so, wouldn't it be (1 + > > > > checkpoint_completion_target), not "2 +". My logic is you have the > > > > old WAL files being checkpointed (that's the "1"), plus you have new WAL > > > > files being created during the checkpoint, which would be > > > > checkpoint_completion_target * checkpoint_segments, plus one for the > > > > current WAL file. > > > > > > > > > > > > WAL is not eligible to be recycled until there have been 2 successful > > > > checkpoints. > > > > > > > > So at the end of a checkpoint, you have 1 cycle of WAL which has just > > > > become eligible for recycling, > > > > 1 cycle of WAL which is now expendable but which is kept anyway, and > > > > checkpoint_completion_target worth of WAL which has occurred while the > > > > checkpoint was occurring and is still needed for crash recovery. > > > > > > OK, so based on this analysis, what is the right calculation? This? > > > > > > (1 + checkpoint_completion_target) * checkpoint_segments + 1 + > > > max(wal_keep_segments, checkpoint_segments) > > > > Now that we have min_wal_size and max_wal_size in 9.5, I don't see any > > value to figuring out the proper formula for backpatching. > > I guess it worth backpatching the documentation as 9.4 -> 9.1 will be > supported for somes the next 4 years Sorry, lack of caffeine this morning. Fired the mail before correcting and finishing it: I guess it worth backpatching the documentation as 9.4 -> 9.1 will be supported for some more years. I'll give it a try this week. Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com
Hi, As I'm writing a doc patch for 9.4 -> 9.0, I'll discuss below on this formula as this is the last one accepted by most of you. On Mon, 3 Nov 2014 12:39:26 -0800 Jeff Janes <jeff.janes@gmail.com> wrote: > It looked to me that the formula, when descending from a previously > stressed state, would be: > > greatest(1 + checkpoint_completion_target) * checkpoint_segments, > wal_keep_segments) + 1 + > 2 * checkpoint_segments + 1 It lacks a closing parenthesis. I guess the formula is: greatest ( (1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments ) + 1 + 2 * checkpoint_segments + 1 > This assumes logs are filled evenly over a checkpoint cycle, which is > probably not true because there is a spike in full page writes right after > a checkpoint starts. > > But I didn't have a great deal of confidence in my analysis. The only problem I have with this formula is that considering checkpoint_completion_target ~ 1 and wal_keep_segments = 0, it becomes: 4 * checkpoint_segments + 2 Which violate the well known, observed and default one: 3 * checkpoint_segments + 1 A value above this formula means the system can not cope with the number of file to flush. The doc is right about that: If, due to a short-term peak of log output rate, there are more than 3 * <varname>checkpoint_segments</varname> + 1 segment files, the unneeded segment files will be deleted The formula is wrong in the doc when wal_keep_segments <> 0 > The first line reflects the number of WAL that will be retained as-is, I agree with this files MUST be retained: the set of checkpoint_segments WALs beeing flushed and the checkpoint_completion_target ones written in the meantime. > the second is the number that will be recycled for future use before starting > to delete them. disagree cause the WAL files beeing written are actually consuming recycled WALs in the meantime. Your formula expect written files are created and recycled ones never touched, leading to this checkpoint_segment + 1 difference between formulas. > My reading of the code is that wal_keep_segments is computed from the > current end of WAL (i.e the checkpoint record), not from the checkpoint > redo point. If I distribute the part outside the 'greatest' into both > branches of the 'greatest', I don't get the same answer as you do for > either branch. So The formula, using checkpoint_completion_target=1, should be: greatest ( checkpoint_segments, wal_keep_segments ) + 2 * checkpoint_segments + 1 Please find attached to this email a documentation patch for 9.4 using this formula. Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com
Attachment
On Wed, Apr 1, 2015 at 7:00 PM, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Hi, > > As I'm writing a doc patch for 9.4 -> 9.0, I'll discuss below on this formula > as this is the last one accepted by most of you. > > On Mon, 3 Nov 2014 12:39:26 -0800 > Jeff Janes <jeff.janes@gmail.com> wrote: > >> It looked to me that the formula, when descending from a previously >> stressed state, would be: >> >> greatest(1 + checkpoint_completion_target) * checkpoint_segments, >> wal_keep_segments) + 1 + >> 2 * checkpoint_segments + 1 > > It lacks a closing parenthesis. I guess the formula is: > > greatest ( > (1 + checkpoint_completion_target) * checkpoint_segments, > wal_keep_segments > ) > + 1 + 2 * checkpoint_segments + 1 > >> This assumes logs are filled evenly over a checkpoint cycle, which is >> probably not true because there is a spike in full page writes right after >> a checkpoint starts. >> >> But I didn't have a great deal of confidence in my analysis. > > The only problem I have with this formula is that considering > checkpoint_completion_target ~ 1 and wal_keep_segments = 0, it becomes: > > 4 * checkpoint_segments + 2 > > Which violate the well known, observed and default one: > > 3 * checkpoint_segments + 1 > > A value above this formula means the system can not cope with the number of > file to flush. The doc is right about that: > > If, due to a short-term peak of log output rate, there > are more than 3 * <varname>checkpoint_segments</varname> + 1 > segment files, the unneeded segment files will be deleted > > The formula is wrong in the doc when wal_keep_segments <> 0 > >> The first line reflects the number of WAL that will be retained as-is, > > I agree with this files MUST be retained: the set of checkpoint_segments WALs > beeing flushed and the checkpoint_completion_target ones written in > the meantime. > >> the second is the number that will be recycled for future use before starting >> to delete them. > > disagree cause the WAL files beeing written are actually consuming recycled > WALs in the meantime. > > Your formula expect written files are created and recycled ones never touched, > leading to this checkpoint_segment + 1 difference between formulas. > >> My reading of the code is that wal_keep_segments is computed from the >> current end of WAL (i.e the checkpoint record), not from the checkpoint >> redo point. If I distribute the part outside the 'greatest' into both >> branches of the 'greatest', I don't get the same answer as you do for >> either branch. > > So The formula, using checkpoint_completion_target=1, should be: > > greatest ( > checkpoint_segments, > wal_keep_segments > ) > + 2 * checkpoint_segments + 1 No. Please imagine how many WAL files can exist at the end of checkpoint. At the end of checkpoint, we have to leave all the WAL files which were generated since the starting point of previous checkpoint for the future crash recovery. The number of these WAL files is (1 + checkpoint_completion_target) * checkpoint_segments or wal_keep_segments In addition to these files, at the end of checkpoint, old WAL files which were generated before the starting point of previous checkpoint are recycled. The number of these WAL files is at most 2 * checkpoint_segments + 1 Note that *usually* there are not such many WAL files at the end of checkpoint. But this can happen after the peak ofWAL logging generates too much WAL files. So the sum of those is the right formula, i.e., (3 + checkpoint_completion_target) * checkpoint_segments + 1 or wal_keep_segments + 2 * checkpoint_segments + 1 If checkpoint_completion_target is 1 and wal_keep_segments is 0, it can become 4 * checkpoint_segments + 1. Regards, -- Fujii Masao