Thread: Re: [GENERAL] WAL archiving idle database

Re: [GENERAL] WAL archiving idle database

From

Tom Lane

Date:

26 October 2007, 22:47:55

Jeff Davis <pgsql@j-davis.com> writes:
> What's happening is that you have a checkpoint_timeout of 5 minutes, and
> that checkpoint must write a checkpoint record in the WAL, prompting the
> archiving.

> If you want it to happen less frequently, it's often safe to have
> checkpoint timeout set to something larger by a reasonable amount.

I think you're confusing checkpoint_timeout and archive_timeout...

I seem to recall this behavior having been discussed before, but I
can't find it in the archives right now.  What is happening is that
after each checkpoint_timeout, we test to see if we need to write
a new checkpoint; which is determined by whether anything's been
inserted into WAL since the start of the last checkpoint.  And after
each archive_timeout, we test to see if we need to flush the current WAL
segment out to the archive; which is determined by whether the write
pointer is currently exactly at the start of a segment or not.

Which would be fine, except that the "has anything been inserted
since last checkpoint" test is actually done by seeing if the WAL
insert pointer has moved.  Which it will have, if we did an archive
flush.  And that means that each of these activities makes it look
to the other one like something has happened, and so you get a
checkpoint record every checkpoint_timeout seconds, and then we flush
the entire WAL file (containing only that record), even if the database
is in reality completely idle.  Lather, rinse, repeat.

In the prior discussion that I seem to remember, we didn't think of
a decent solution, and it kinda fell off the radar since zero-activity
isn't too interesting to a lot of folks.  However, chewing on it again
I think I've come up with a good idea that will fix this and actually
simplify the code a bit:

* Add a boolean flag insertedXLog to XLogCtlInsert, which means "at
least one WAL record has been inserted since start of last checkpoint".
Also add a flag completedCkpt somewhere in XLogCtlData, which means
"checkpoint successfully completed"; this second flag is only used by
checkpoint so it can be considered as being protected by the
CheckpointLock.  At startup we can initialize insertedXLog = false,
completedCkpt = true.

* XLogInsert sets insertedXLog to true while holding WALInsertLock,
*except* when inserting either a checkpoint record or an xlog switch
record; in those cases it doesn't change the flag.

* CreateCheckpoint replaces its current rather complex test (lines
5693-5703 in CVS-tip xlog.c) with "if insertedXLog is clear and
completedCkpt is set, we need not checkpoint".  If it does have
to perform a checkpoint, it clears both flags before releasing
WALInsertLock.

* After successful completion of a checkpoint, completedCkpt gets set.

Because insertedXLog is cleared at the same time the checkpoint's REDO
pointer is determined, this will correctly implement the requirement of
detecting whether anything has been inserted since the last REDO point.
This replaces the current indirect test involving comparing the last
checkpoint's REDO pointer to its own address.  However we have to not
set insertedXLog when we finally do insert the checkpoint record, thus
the special case is needed in XLogInsert.  The other special case of
ignoring xlog switch is what's needed to fix the bug, and is obviously
OK because an xlog switch doesn't represent a checkpointable change.

The reason we need the completedCkpt flag is that if a checkpoint fails
partway through, it would nonetheless have cleared insertedXLog, and
we don't want that to cause us to not retry the checkpoint next time.

This is slightly warty but it certainly seems a lot clearer than the
current test in lines 5693-5703.  The couple of lines to be added to
XLogInsert should have negligible performance impact.

Comments?

            regards, tom lane

Re: [GENERAL] WAL archiving idle database

From

"Kevin Grittner"

Date:

26 October 2007, 23:06:27

>>> On Fri, Oct 26, 2007 at  5:47 PM, in message <695.1193438855@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> And after
> each archive_timeout, we test to see if we need to flush the current WAL
> segment out to the archive; which is determined by whether the write
> pointer is currently exactly at the start of a segment or not.

Hmmm...  We would actually prefer to get the WAL file at the
specified interval.  We have software to ensure that the warm
standby instances are not getting stale, and that's pretty simple
with the current behavior.  We don't have a bandwidth or storage
space issue because we zero out the unused portion of the WAL file
and gzip it -- an empty file's about 16 KB.  Checking that the whole
system is healthy gets a lot more complicated if we stop sending
empty WAL files.

Could this at least be a configurable option?

-Kevin

Re: [GENERAL] WAL archiving idle database

From

Jeff Davis

Date:

26 October 2007, 23:16:35

On Fri, 2007-10-26 at 18:06 -0500, Kevin Grittner wrote:
> >>> On Fri, Oct 26, 2007 at  5:47 PM, in message <695.1193438855@sss.pgh.pa.us>,
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > And after
> > each archive_timeout, we test to see if we need to flush the current WAL
> > segment out to the archive; which is determined by whether the write
> > pointer is currently exactly at the start of a segment or not.
>
> Hmmm...  We would actually prefer to get the WAL file at the
> specified interval.  We have software to ensure that the warm
> standby instances are not getting stale, and that's pretty simple
> with the current behavior.  We don't have a bandwidth or storage
> space issue because we zero out the unused portion of the WAL file
> and gzip it -- an empty file's about 16 KB.  Checking that the whole
> system is healthy gets a lot more complicated if we stop sending
> empty WAL files.
>
> Could this at least be a configurable option?
>

A good point.

Keep in mind that even in the current system, your configuration is
variable based on the checkpoint_timeout setting.

Regards,
    Jeff Davis

Re: [GENERAL] WAL archiving idle database

From

Tom Lane

Date:

26 October 2007, 23:16:42

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> On Fri, Oct 26, 2007 at  5:47 PM, in message <695.1193438855@sss.pgh.pa.us>,
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> And after
>> each archive_timeout, we test to see if we need to flush the current WAL
>> segment out to the archive; which is determined by whether the write
>> pointer is currently exactly at the start of a segment or not.

> Hmmm...  We would actually prefer to get the WAL file at the
> specified interval.

Well, if it's a feature not a bug, that's fine with me.  I wonder though
how predictable the behavior will really be with 8.3's distributed
checkpoints ... you might need to find another way anyhow.

            regards, tom lane

Re: [GENERAL] WAL archiving idle database

From

Tom Lane

Date:

26 October 2007, 23:26:14

Jeff Davis <pgsql@j-davis.com> writes:
> Keep in mind that even in the current system, your configuration is
> variable based on the checkpoint_timeout setting.

Yeah, and he has to keep this less than archive_timeout in order for
it to work the way he wants, which is probably not good for performance.
(Sane settings of checkpoint_timeout are probably higher, not lower,
than what people are likely to use for archive_timeout.)

I think my recommendation to Kevin would be to force some trivial
transaction to occur a little before each expected archive_timeout,
so that there will be something to be archived.  This would have the
additional advantage that the monitor is checking that the database is
actually responding to queries, whereas just noting that it's spitting
out WAL files doesn't really prove that --- especially not if mere
no-op checkpoints can cause WAL files to be emitted.

            regards, tom lane

Re: [GENERAL] WAL archiving idle database

From

Jeff Davis

Date:

26 October 2007, 23:28:41

On Fri, 2007-10-26 at 18:47 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > What's happening is that you have a checkpoint_timeout of 5 minutes, and
> > that checkpoint must write a checkpoint record in the WAL, prompting the
> > archiving.
>
> > If you want it to happen less frequently, it's often safe to have
> > checkpoint timeout set to something larger by a reasonable amount.
>
> I think you're confusing checkpoint_timeout and archive_timeout...

Thanks for clarifying it. The user-visible behavior, as I understand it,
is that the time between archiving on an idle database is:
MAX(archive_timeout,checkpoint_timeout)

[ of course, there's no guarantee that the archive_command succeeds in
that time ]

It looks like checkpoint_timeout was the limiting factor, in his case.

Regards,
    Jeff Davis

Re: [GENERAL] WAL archiving idle database

From

Jeff Davis

Date:

26 October 2007, 23:39:52

On Fri, 2007-10-26 at 18:06 -0500, Kevin Grittner wrote:
> Hmmm...  We would actually prefer to get the WAL file at the
> specified interval.  We have software to ensure that the warm
> standby instances are not getting stale, and that's pretty simple
> with the current behavior.  We don't have a bandwidth or storage

Another thought: when you say it's "pretty simple", what do you do now?
My monitoring scripts for this particular situation employ some pretty
ugly code.

I think if this did get changed, I would change my script to monitor the
pg_current_xlog_location() of the primary database and compare to the
last "restored log file..." entry in the standby database's log.

I would think if the current location does not end in all zeros, you
should expect a new WAL segment to be archived soon. Although this
assumes that an idle database would not advance that location at all,
and I'm still trying to understand Tom's proposal well enough to know
whether that would be true or not.

If this doesn't get changed, I think we should archive every
archive_timeout seconds, rather than
MAX(archive_timeout,checkpoint_timeout), which is less obvious.

Regards,
    Jeff Davis

Re: [GENERAL] WAL archiving idle database

From

Tom Lane

Date:

26 October 2007, 23:48:07

Jeff Davis <pgsql@j-davis.com> writes:
> I would think if the current location does not end in all zeros, you
> should expect a new WAL segment to be archived soon. Although this
> assumes that an idle database would not advance that location at all,
> and I'm still trying to understand Tom's proposal well enough to know
> whether that would be true or not.

With my proposal, after the last activity, you'd get a checkpoint, and
then at the next archive_timeout we'd advance the pointer to a segment
boundary and archive the old segment, and then nothing more would happen
until the next WAL-loggable update.  So yeah, the master's
pg_current_xlog_location could be expected to sit at a segment boundary
while it was idle.

            regards, tom lane

Re: [GENERAL] WAL archiving idle database

From

"Kevin Grittner"

Date:

29 October 2007, 15:24:42

>>> On Fri, Oct 26, 2007 at  6:39 PM, in message
<1193441976.7624.128.camel@dogma.ljc.laika.com>, Jeff Davis <pgsql@j-davis.com>
wrote:
> On Fri, 2007-10-26 at 18:06 -0500, Kevin Grittner wrote:
>> Hmmm...  We would actually prefer to get the WAL file at the
>> specified interval.  We have software to ensure that the warm
>> standby instances are not getting stale, and that's pretty simple
>> with the current behavior.
>
> Another thought: when you say it's "pretty simple", what do you do now?
> My monitoring scripts for this particular situation employ some pretty
> ugly code.

Here's our script:

#! /bin/bash

if [ "$1" == "" ] ; then
  savepwd=$PWD
  cd /var/pgsql/data/county/
  find * -maxdepth 0 -type d | xargs -idirname $0 dirname
  cd $savepwd
  exit 0
fi

for countyName ; do
  echo County: $countyName
  /usr/local/pgsql/bin/pg_controldata /var/pgsql/data/county/$countyName/data | grep -E '(Database cluster
state|pg_controllast modified)' 
  /etc/init.d/postgresql-${countyName}-cc status
  grep basebackup /var/pgsql/data/county/$countyName/data/basebackup-of-this-instance
  echo ''
done

Here's an example of running it (although the opcenter usually runs
it without a parameter, to get all counties):

opcenter@PGBACKUP:~> sudo pgstatus.sh iowa
County: iowa
Database cluster state:               in archive recovery
pg_control last modified:             Mon 29 Oct 2007 09:03:16 AM CDT
pg_ctl: server is running (PID: 15902)
/usr/local/pgsql-8.2.4/bin/postgres -D /var/pgsql/data/county/iowa/data
basebackup    cc-2007-10-26_190001

This gets parsed by a script in our monitor (python, I think) and
winds up feeding a status display.  It's probably a bit crude, but
it has worked well for us, with very little time required to get it
going.  This thread has made me aware that it is dependent on the
checkpoint frequency as well as the archive frequency.  Our
checkpoint_timeout setting is 30min and our archive_timeout is the
default (one hour).  The monitor shows red if the cluster state
isn't "in archvie recovery" or pg_ctl doesn't report "server is
running" or the last modified is older than 75 minutes.

We are OK with a one hour archive interval because we have a
separate application-oriented transaction stream (independent of
the database product) coming back real-time, which we can replay
to "top off" a database backup.

-Kevin

Re: [GENERAL] WAL archiving idle database

From

"Kevin Grittner"

Date:

29 October 2007, 15:55:45

>>> On Fri, Oct 26, 2007 at  6:28 PM, in message
<1193441286.7624.117.camel@dogma.ljc.laika.com>, Jeff Davis <pgsql@j-davis.com>
wrote:

> [ of course, there's no guarantee that the archive_command succeeds in
> that time ]

Which is one of the things we would want to cause an alert.

-Kevin

Re: [GENERAL] WAL archiving idle database

From

Jeff Davis

Date:

29 October 2007, 16:50:34

On Mon, 2007-10-29 at 09:56 -0500, Kevin Grittner wrote:
> Here's our script:

Thanks, I think that is better than what I'm doing.

One minor thing: I think it's still dependent on locale though, because
the output of pg_controldata is locale-dependent, right? It would work
fine for me, but it would be nice if there was something that could be
released that anyone could use, including non-english installations.

Also, did you publish your pg_clearxlogtail program anywhere? I think
that would be helpful to many people, but I don't see it on pgfoundry.

Regards,
    Jeff Davis

Re: [GENERAL] WAL archiving idle database

From

"Kevin Grittner"

Date:

29 October 2007, 17:18:19

>>> On Mon, Oct 29, 2007 at 11:50 AM, in message
<1193676615.7624.140.camel@dogma.ljc.laika.com>, Jeff Davis <pgsql@j-davis.com>
wrote:

> Also, did you publish your pg_clearxlogtail program anywhere? I think
> that would be helpful to many people, but I don't see it on pgfoundry.

So far I've just included with the email on the hackers list.  I
have made one fix since: I found that an explicit close of stdout
speeds the worst-case situation to break-even.  (I'm guessing that's
safer, too.)  So in all of my tests it is now as fast or faster to
pipe through this on the way to gzip than to just pipe through gzip.

I'll see about getting that onto pgfoundry soon.

-Kevin

Re: [GENERAL] WAL archiving idle database

From

Tom Lane

Date:

29 October 2007, 18:21:06

Jeff Davis <pgsql@j-davis.com> writes:
> One minor thing: I think it's still dependent on locale though, because
> the output of pg_controldata is locale-dependent, right? It would work
> fine for me, but it would be nice if there was something that could be
> released that anyone could use, including non-english installations.

Wouldn't "export LANG=C" fix that?

            regards, tom lane

Re: [GENERAL] WAL archiving idle database

From

Jeff Davis

Date:

29 October 2007, 18:26:44

On Mon, 2007-10-29 at 14:20 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > One minor thing: I think it's still dependent on locale though, because
> > the output of pg_controldata is locale-dependent, right? It would work
> > fine for me, but it would be nice if there was something that could be
> > released that anyone could use, including non-english installations.
>
> Wouldn't "export LANG=C" fix that?
>

Ah, of course.

Thanks,
    Jeff Davis