Re: warm standby server stops doing checkpoints after awhile - Mailing list pgsql-general

From Simon Riggs
Subject Re: warm standby server stops doing checkpoints after awhile
Date
Msg-id 1180695498.26297.97.camel@silverbirch.site
Whole thread Raw
In response to Re: warm standby server stops doing checkpoints after a while  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: warm standby server stops doing checkpoints after awhile  (Frank Wittig <fw@weisshuhn.de>)
List pgsql-general
On Thu, 2007-05-31 at 10:23 -0400, Tom Lane wrote:
> Frank Wittig <fw@weisshuhn.de> writes:
> > The problem is that the slave server stops checkpointing after some
> > hours of working (about 24 to 48 hours of conitued log replay).
>
> Hm ... look at RecoveryRestartPoint() in xlog.c.  Could there be
> something wrong with this logic?
>
>     /*
>      * Do nothing if the elapsed time since the last restartpoint is less than
>      * half of checkpoint_timeout.    (We use a value less than
>      * checkpoint_timeout so that variations in the timing of checkpoints on
>      * the master, or speed of transmission of WAL segments to a slave, won't
>      * make the slave skip a restartpoint once it's synced with the master.)
>      * Checking true elapsed time keeps us from doing restartpoints too often
>      * while rapidly scanning large amounts of WAL.
>      */
>     elapsed_secs = time(NULL) - ControlFile->time;
>     if (elapsed_secs < CheckPointTimeout / 2)
>         return;
>
> The idea is that the slave (once in sync with the master) ought to
> checkpoint every time it sees a checkpoint record in the master's
> output.  I'm not seeing a flaw but maybe there is one here, or somewhere
> nearby.  Are you sure the master is checkpointing?

Hmmm. This can happen if a backend crashes while half-way through any
set of changes that causes safe_restartpoint() to be true. Or it might
be that one of the Index AMs don't correctly clear the multi-WAL actions
in some corner cases.

Or it could be that the mdsync looping problem has been worse than we
thought and checkpoints have been avoided completely for some time.

Frank,

This is repeatable, yes?
Has anything crashed on your server?
Are you using GIN or GIST indexes?

I'll look at putting some debug information in there that logs whether
multi-WAL actions remain unresolved for any length of time.

Continuing to think about this one....

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



pgsql-general by date:

Previous
From: Anton
Date:
Subject: Re: how to use array with "holes" ?
Next
From: Gregory Stark
Date:
Subject: Re: invalid memory alloc after insert with c trigger function