Re: Maximum number of WAL files in the pg_xlog directory - Mailing list pgsql-hackers

From Guillaume Lelarge
Subject Re: Maximum number of WAL files in the pg_xlog directory
Date
Msg-id CAECtzeWXY_v8-eBuC+mZRLs7y94z0ppLSHN2+2t3sJDkhyhb6g@mail.gmail.com
Whole thread Raw
In response to Re: Maximum number of WAL files in the pg_xlog directory  (Guillaume Lelarge <guillaume@lelarge.info>)
List pgsql-hackers
<p dir="ltr">Hi,<p dir="ltr">Le 15 oct. 2014 22:25, "Guillaume Lelarge" <<a
href="mailto:guillaume@lelarge.info">guillaume@lelarge.info</a>>a écrit :<br /> ><br /> > 2014-10-15 22:11
GMT+02:00Jeff Janes <<a href="mailto:jeff.janes@gmail.com">jeff.janes@gmail.com</a>>:<br /> >><br />
>>On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <<a
href="mailto:guillaume@lelarge.info">guillaume@lelarge.info</a>>wrote:<br /> >>><br /> >>> Hi,<br
/>>>><br /> >>> As part of our monitoring work for our customers, we stumbled upon an issue with our
customers'servers who have a wal_keep_segments setting higher than 0.<br /> >>><br /> >>> We have a
monitoringscript that checks the number of WAL files in the pg_xlog directory, according to the setting of three
parameters(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to
theusual formula:<br /> >>><br /> >>> greatest(<br /> >>>   (2 +
checkpoint_completion_target)* checkpoint_segments + 1,<br /> >>>   checkpoint_segments + wal_keep_segments +
1<br/> >>> )<br /> >><br /> >><br /> >> I think the first bug is even having this formula in
thedocumentation to start with, and in trying to use it.<br /> >><br /> ><br /> > I agree. But we have
customersasking how to compute the right size for their WAL file system partitions. Right size is usually a euphemism
forsmallest size, and they usually tend to get it wrong, leading to huge issues. And I'm not even speaking of
monitoring,and alerting.<br /> ><br /> > A way to avoid this issue is probably to erase the formula from the
documentation,and find a new way to explain them how to size their partitions for WALs.<br /> ><br /> >
Monitoringis another matter, and I don't really think a monitoring solution should count the WAL files. What actually
reallymatters is the database availability, and that is covered with having enough disk space in the WALs partition.<br
/>><br /> >> "and will normally not be more than..."<br /> >><br /> >> This may be "normal" for a
toysystem.  I think that the normal state for any system worth monitoring is that it has had load spikes at some point
inthe past.  <br /> >><br /> ><br /> > Agreed.<br /> >  <br /> >><br /> >> So it is the next
partof the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the
importantone.  And that doesn't mention wal_keep_segments at all, which surely cannot be correct.<br /> >><br />
><br/> > Agreed too.<br /> >  <br /> >><br /> >> I will try to independently derive the correct
formulafrom the code, as you did, without looking too much at your derivation  first, and see if we get the same
answer.<br/> >><br /> ><br /> > Thanks. I look forward reading what you found.<br /> ><br /> > What
seemsclear to me right now is that no one has a sane explanation of the formula. Though yours definitely made sense, it
didn'tseem to be what the code does.<br /> ><p dir="ltr">Did you find time to work on this? Any news?<p
dir="ltr">Thanks.

pgsql-hackers by date:

Previous
From: Rushabh Lathia
Date:
Subject: Re: CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ...
Next
From: Alexey Vasiliev
Date:
Subject: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code