Thread: WAL archiving idle database
I have a test PG 8.2.5 installation that has been left idle with no connections to it whatsoever for the last 24 hours plus. WALs are being archived exactly 5 minutes apart, even though archive_timeout is set to 60. Is this the expected behavior for a database with no changes? Brian
On Oct 26, 2007, at 4:08 PM, Brian Wipf wrote: > I have a test PG 8.2.5 installation that has been left idle with no > connections to it whatsoever for the last 24 hours plus. WALs are > being archived exactly 5 minutes apart, even though archive_timeout > is set to 60. Is this the expected behavior for a database with no > changes? > > Brian In the absence of activity, WAL are archived every checkpoint_timeout seconds. archive_timeout is how long postgres will wait for the archive_command to return before declaring it failed. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Fri, 2007-10-26 at 15:08 -0600, Brian Wipf wrote: > I have a test PG 8.2.5 installation that has been left idle with no > connections to it whatsoever for the last 24 hours plus. WALs are > being archived exactly 5 minutes apart, even though archive_timeout > is set to 60. Is this the expected behavior for a database with no > changes? > If it's set to just "60" that means 60 seconds. What's happening is that you have a checkpoint_timeout of 5 minutes, and that checkpoint must write a checkpoint record in the WAL, prompting the archiving. If you want it to happen less frequently, it's often safe to have checkpoint timeout set to something larger by a reasonable amount. Anyone using a checkpoint_timeout is going to end up with quite a few mostly-empty 16MB files to deal with. Someone wrote a utility to zero out the empty space in WAL segments, you might look at "pg_clearxlogtail" written by Kevin Grittner (search the archives or pgfoundry). This allows you to gzip the files to basically nothing. Regards, Jeff Davis
On Fri, 2007-10-26 at 16:31 -0500, Erik Jones wrote: > In the absence of activity, WAL are archived every checkpoint_timeout > seconds. archive_timeout is how long postgres will wait for the > archive_command to return before declaring it failed. > http://www.postgresql.org/docs/current/static/runtime-config-wal.html "When this parameter is greater than zero, the server will switch to a new segment file whenever this many seconds have elapsed since the last segment file switch." Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > What's happening is that you have a checkpoint_timeout of 5 minutes, and > that checkpoint must write a checkpoint record in the WAL, prompting the > archiving. > If you want it to happen less frequently, it's often safe to have > checkpoint timeout set to something larger by a reasonable amount. I think you're confusing checkpoint_timeout and archive_timeout... I seem to recall this behavior having been discussed before, but I can't find it in the archives right now. What is happening is that after each checkpoint_timeout, we test to see if we need to write a new checkpoint; which is determined by whether anything's been inserted into WAL since the start of the last checkpoint. And after each archive_timeout, we test to see if we need to flush the current WAL segment out to the archive; which is determined by whether the write pointer is currently exactly at the start of a segment or not. Which would be fine, except that the "has anything been inserted since last checkpoint" test is actually done by seeing if the WAL insert pointer has moved. Which it will have, if we did an archive flush. And that means that each of these activities makes it look to the other one like something has happened, and so you get a checkpoint record every checkpoint_timeout seconds, and then we flush the entire WAL file (containing only that record), even if the database is in reality completely idle. Lather, rinse, repeat. In the prior discussion that I seem to remember, we didn't think of a decent solution, and it kinda fell off the radar since zero-activity isn't too interesting to a lot of folks. However, chewing on it again I think I've come up with a good idea that will fix this and actually simplify the code a bit: * Add a boolean flag insertedXLog to XLogCtlInsert, which means "at least one WAL record has been inserted since start of last checkpoint". Also add a flag completedCkpt somewhere in XLogCtlData, which means "checkpoint successfully completed"; this second flag is only used by checkpoint so it can be considered as being protected by the CheckpointLock. At startup we can initialize insertedXLog = false, completedCkpt = true. * XLogInsert sets insertedXLog to true while holding WALInsertLock, *except* when inserting either a checkpoint record or an xlog switch record; in those cases it doesn't change the flag. * CreateCheckpoint replaces its current rather complex test (lines 5693-5703 in CVS-tip xlog.c) with "if insertedXLog is clear and completedCkpt is set, we need not checkpoint". If it does have to perform a checkpoint, it clears both flags before releasing WALInsertLock. * After successful completion of a checkpoint, completedCkpt gets set. Because insertedXLog is cleared at the same time the checkpoint's REDO pointer is determined, this will correctly implement the requirement of detecting whether anything has been inserted since the last REDO point. This replaces the current indirect test involving comparing the last checkpoint's REDO pointer to its own address. However we have to not set insertedXLog when we finally do insert the checkpoint record, thus the special case is needed in XLogInsert. The other special case of ignoring xlog switch is what's needed to fix the bug, and is obviously OK because an xlog switch doesn't represent a checkpointable change. The reason we need the completedCkpt flag is that if a checkpoint fails partway through, it would nonetheless have cleared insertedXLog, and we don't want that to cause us to not retry the checkpoint next time. This is slightly warty but it certainly seems a lot clearer than the current test in lines 5693-5703. The couple of lines to be added to XLogInsert should have negligible performance impact. Comments? regards, tom lane
>>> On Fri, Oct 26, 2007 at 5:47 PM, in message <695.1193438855@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > And after > each archive_timeout, we test to see if we need to flush the current WAL > segment out to the archive; which is determined by whether the write > pointer is currently exactly at the start of a segment or not. Hmmm... We would actually prefer to get the WAL file at the specified interval. We have software to ensure that the warm standby instances are not getting stale, and that's pretty simple with the current behavior. We don't have a bandwidth or storage space issue because we zero out the unused portion of the WAL file and gzip it -- an empty file's about 16 KB. Checking that the whole system is healthy gets a lot more complicated if we stop sending empty WAL files. Could this at least be a configurable option? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > On Fri, Oct 26, 2007 at 5:47 PM, in message <695.1193438855@sss.pgh.pa.us>, > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> And after >> each archive_timeout, we test to see if we need to flush the current WAL >> segment out to the archive; which is determined by whether the write >> pointer is currently exactly at the start of a segment or not. > Hmmm... We would actually prefer to get the WAL file at the > specified interval. Well, if it's a feature not a bug, that's fine with me. I wonder though how predictable the behavior will really be with 8.3's distributed checkpoints ... you might need to find another way anyhow. regards, tom lane
On Fri, 2007-10-26 at 18:06 -0500, Kevin Grittner wrote: > >>> On Fri, Oct 26, 2007 at 5:47 PM, in message <695.1193438855@sss.pgh.pa.us>, > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > And after > > each archive_timeout, we test to see if we need to flush the current WAL > > segment out to the archive; which is determined by whether the write > > pointer is currently exactly at the start of a segment or not. > > Hmmm... We would actually prefer to get the WAL file at the > specified interval. We have software to ensure that the warm > standby instances are not getting stale, and that's pretty simple > with the current behavior. We don't have a bandwidth or storage > space issue because we zero out the unused portion of the WAL file > and gzip it -- an empty file's about 16 KB. Checking that the whole > system is healthy gets a lot more complicated if we stop sending > empty WAL files. > > Could this at least be a configurable option? > A good point. Keep in mind that even in the current system, your configuration is variable based on the checkpoint_timeout setting. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Keep in mind that even in the current system, your configuration is > variable based on the checkpoint_timeout setting. Yeah, and he has to keep this less than archive_timeout in order for it to work the way he wants, which is probably not good for performance. (Sane settings of checkpoint_timeout are probably higher, not lower, than what people are likely to use for archive_timeout.) I think my recommendation to Kevin would be to force some trivial transaction to occur a little before each expected archive_timeout, so that there will be something to be archived. This would have the additional advantage that the monitor is checking that the database is actually responding to queries, whereas just noting that it's spitting out WAL files doesn't really prove that --- especially not if mere no-op checkpoints can cause WAL files to be emitted. regards, tom lane
On Fri, 2007-10-26 at 18:47 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > What's happening is that you have a checkpoint_timeout of 5 minutes, and > > that checkpoint must write a checkpoint record in the WAL, prompting the > > archiving. > > > If you want it to happen less frequently, it's often safe to have > > checkpoint timeout set to something larger by a reasonable amount. > > I think you're confusing checkpoint_timeout and archive_timeout... Thanks for clarifying it. The user-visible behavior, as I understand it, is that the time between archiving on an idle database is: MAX(archive_timeout,checkpoint_timeout) [ of course, there's no guarantee that the archive_command succeeds in that time ] It looks like checkpoint_timeout was the limiting factor, in his case. Regards, Jeff Davis
On Fri, 2007-10-26 at 18:06 -0500, Kevin Grittner wrote: > Hmmm... We would actually prefer to get the WAL file at the > specified interval. We have software to ensure that the warm > standby instances are not getting stale, and that's pretty simple > with the current behavior. We don't have a bandwidth or storage Another thought: when you say it's "pretty simple", what do you do now? My monitoring scripts for this particular situation employ some pretty ugly code. I think if this did get changed, I would change my script to monitor the pg_current_xlog_location() of the primary database and compare to the last "restored log file..." entry in the standby database's log. I would think if the current location does not end in all zeros, you should expect a new WAL segment to be archived soon. Although this assumes that an idle database would not advance that location at all, and I'm still trying to understand Tom's proposal well enough to know whether that would be true or not. If this doesn't get changed, I think we should archive every archive_timeout seconds, rather than MAX(archive_timeout,checkpoint_timeout), which is less obvious. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > I would think if the current location does not end in all zeros, you > should expect a new WAL segment to be archived soon. Although this > assumes that an idle database would not advance that location at all, > and I'm still trying to understand Tom's proposal well enough to know > whether that would be true or not. With my proposal, after the last activity, you'd get a checkpoint, and then at the next archive_timeout we'd advance the pointer to a segment boundary and archive the old segment, and then nothing more would happen until the next WAL-loggable update. So yeah, the master's pg_current_xlog_location could be expected to sit at a segment boundary while it was idle. regards, tom lane
On Oct 26, 2007, at 4:46 PM, Jeff Davis wrote: > On Fri, 2007-10-26 at 16:31 -0500, Erik Jones wrote: >> In the absence of activity, WAL are archived every checkpoint_timeout >> seconds. archive_timeout is how long postgres will wait for the >> archive_command to return before declaring it failed. >> > > http://www.postgresql.org/docs/current/static/runtime-config-wal.html > > "When this parameter is greater than zero, the server will switch to a > new segment file whenever this many seconds have elapsed since the > last > segment file switch." > > Regards, > Jeff Davis Ah, my bad :) Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Fri, 2007-10-26 at 14:39 -0700, Jeff Davis wrote: > On Fri, 2007-10-26 at 15:08 -0600, Brian Wipf wrote: > > I have a test PG 8.2.5 installation that has been left idle with no > > connections to it whatsoever for the last 24 hours plus. WALs are > > being archived exactly 5 minutes apart, even though archive_timeout > > is set to 60. Is this the expected behavior for a database with no > > changes? > > > > If it's set to just "60" that means 60 seconds. > > What's happening is that you have a checkpoint_timeout of 5 minutes, and > that checkpoint must write a checkpoint record in the WAL, prompting the > archiving. archive_timeout is the maximum time to wait for a log switch that contains newly written WAL. That interval is not the same thing as how often WAL records are written. On the idle server a checkpoint is being written every checkpoint_timeout. Then archive_timeout kicks in 60 seconds later, switches the log which then archives that file. Then four minutes later a checkpoint occurs, sees that there is no immediately preceding checkpoint because of the log switch and writes a new checkpoint record. Then 60 seconds later... Overall this produces one WAL file every checkpoint_timeout during idle times, yet without relaxing the guarantee that WAL will be archived every archive_timeout seconds. We *could* force it to perform a log switch whether or not new WAL has been written. If that's what people think is wanted. I'd seen the behaviour as beneficial up til now. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > We *could* force it to perform a log switch whether or not new WAL has > been written. That would be a truly awful idea. I think the right answer for Kevin's situation really is to be doing a dummy transaction every few minutes as one component of his system monitoring. That way he gets the behavior he wants, and it needn't preclude us from fixing the default behavior to be less wasteful. regards, tom lane
>>> On Fri, Oct 26, 2007 at 6:39 PM, in message <1193441976.7624.128.camel@dogma.ljc.laika.com>, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2007-10-26 at 18:06 -0500, Kevin Grittner wrote: >> Hmmm... We would actually prefer to get the WAL file at the >> specified interval. We have software to ensure that the warm >> standby instances are not getting stale, and that's pretty simple >> with the current behavior. > > Another thought: when you say it's "pretty simple", what do you do now? > My monitoring scripts for this particular situation employ some pretty > ugly code. Here's our script: #! /bin/bash if [ "$1" == "" ] ; then savepwd=$PWD cd /var/pgsql/data/county/ find * -maxdepth 0 -type d | xargs -idirname $0 dirname cd $savepwd exit 0 fi for countyName ; do echo County: $countyName /usr/local/pgsql/bin/pg_controldata /var/pgsql/data/county/$countyName/data | grep -E '(Database cluster state|pg_controllast modified)' /etc/init.d/postgresql-${countyName}-cc status grep basebackup /var/pgsql/data/county/$countyName/data/basebackup-of-this-instance echo '' done Here's an example of running it (although the opcenter usually runs it without a parameter, to get all counties): opcenter@PGBACKUP:~> sudo pgstatus.sh iowa County: iowa Database cluster state: in archive recovery pg_control last modified: Mon 29 Oct 2007 09:03:16 AM CDT pg_ctl: server is running (PID: 15902) /usr/local/pgsql-8.2.4/bin/postgres -D /var/pgsql/data/county/iowa/data basebackup cc-2007-10-26_190001 This gets parsed by a script in our monitor (python, I think) and winds up feeding a status display. It's probably a bit crude, but it has worked well for us, with very little time required to get it going. This thread has made me aware that it is dependent on the checkpoint frequency as well as the archive frequency. Our checkpoint_timeout setting is 30min and our archive_timeout is the default (one hour). The monitor shows red if the cluster state isn't "in archvie recovery" or pg_ctl doesn't report "server is running" or the last modified is older than 75 minutes. We are OK with a one hour archive interval because we have a separate application-oriented transaction stream (independent of the database product) coming back real-time, which we can replay to "top off" a database backup. -Kevin
>>> On Fri, Oct 26, 2007 at 6:28 PM, in message <1193441286.7624.117.camel@dogma.ljc.laika.com>, Jeff Davis <pgsql@j-davis.com> wrote: > [ of course, there's no guarantee that the archive_command succeeds in > that time ] Which is one of the things we would want to cause an alert. -Kevin
On Mon, 2007-10-29 at 09:56 -0500, Kevin Grittner wrote: > Here's our script: Thanks, I think that is better than what I'm doing. One minor thing: I think it's still dependent on locale though, because the output of pg_controldata is locale-dependent, right? It would work fine for me, but it would be nice if there was something that could be released that anyone could use, including non-english installations. Also, did you publish your pg_clearxlogtail program anywhere? I think that would be helpful to many people, but I don't see it on pgfoundry. Regards, Jeff Davis
>>> On Mon, Oct 29, 2007 at 11:50 AM, in message <1193676615.7624.140.camel@dogma.ljc.laika.com>, Jeff Davis <pgsql@j-davis.com> wrote: > Also, did you publish your pg_clearxlogtail program anywhere? I think > that would be helpful to many people, but I don't see it on pgfoundry. So far I've just included with the email on the hackers list. I have made one fix since: I found that an explicit close of stdout speeds the worst-case situation to break-even. (I'm guessing that's safer, too.) So in all of my tests it is now as fast or faster to pipe through this on the way to gzip than to just pipe through gzip. I'll see about getting that onto pgfoundry soon. -Kevin
Jeff Davis <pgsql@j-davis.com> writes: > One minor thing: I think it's still dependent on locale though, because > the output of pg_controldata is locale-dependent, right? It would work > fine for me, but it would be nice if there was something that could be > released that anyone could use, including non-english installations. Wouldn't "export LANG=C" fix that? regards, tom lane
On Mon, 2007-10-29 at 14:20 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > One minor thing: I think it's still dependent on locale though, because > > the output of pg_controldata is locale-dependent, right? It would work > > fine for me, but it would be nice if there was something that could be > > released that anyone could use, including non-english installations. > > Wouldn't "export LANG=C" fix that? > Ah, of course. Thanks, Jeff Davis