Home > mailing lists

Re: warm standby server stops doing checkpoints after awhile - Mailing list pgsql-general

From	Simon Riggs
Subject	Re: warm standby server stops doing checkpoints after awhile
Date	June 1, 2007 08:00:25
Msg-id	1180695498.26297.97.camel@silverbirch.site Whole thread Raw
In response to	Re: warm standby server stops doing checkpoints after a while (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: warm standby server stops doing checkpoints after awhile
List	pgsql-general

Tree view

On Thu, 2007-05-31 at 10:23 -0400, Tom Lane wrote:
> Frank Wittig <fw@weisshuhn.de> writes:
> > The problem is that the slave server stops checkpointing after some
> > hours of working (about 24 to 48 hours of conitued log replay).
>
> Hm ... look at RecoveryRestartPoint() in xlog.c.  Could there be
> something wrong with this logic?
>
>     /*
>      * Do nothing if the elapsed time since the last restartpoint is less than
>      * half of checkpoint_timeout.    (We use a value less than
>      * checkpoint_timeout so that variations in the timing of checkpoints on
>      * the master, or speed of transmission of WAL segments to a slave, won't
>      * make the slave skip a restartpoint once it's synced with the master.)
>      * Checking true elapsed time keeps us from doing restartpoints too often
>      * while rapidly scanning large amounts of WAL.
>      */
>     elapsed_secs = time(NULL) - ControlFile->time;
>     if (elapsed_secs < CheckPointTimeout / 2)
>         return;
>
> The idea is that the slave (once in sync with the master) ought to
> checkpoint every time it sees a checkpoint record in the master's
> output.  I'm not seeing a flaw but maybe there is one here, or somewhere
> nearby.  Are you sure the master is checkpointing?

Hmmm. This can happen if a backend crashes while half-way through any
set of changes that causes safe_restartpoint() to be true. Or it might
be that one of the Index AMs don't correctly clear the multi-WAL actions
in some corner cases.

Or it could be that the mdsync looping problem has been worse than we
thought and checkpoints have been avoided completely for some time.

Frank,

This is repeatable, yes?
Has anything crashed on your server?
Are you using GIN or GIST indexes?

I'll look at putting some debug information in there that logs whether
multi-WAL actions remain unresolved for any length of time.

Continuing to think about this one....

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

pgsql-general by date:

From: Anton
Date: 01 June 2007, 07:58:54
Subject: Re: how to use array with "holes" ?

From: Gregory Stark
Date: 01 June 2007, 08:08:18
Subject: Re: invalid memory alloc after insert with c trigger function

Re: warm standby server stops doing checkpoints after awhile - Mailing list pgsql-general

Previous

Next