Re: time-delayed standbys - Mailing list pgsql-hackers

From Robert Haas
Subject Re: time-delayed standbys
Date
Msg-id BANLkTinfrgsVK_8o9+mw6kWRcC4BxiR4jw@mail.gmail.com
Whole thread Raw
In response to Re: time-delayed standbys  (Jaime Casanova <jaime@2ndquadrant.com>)
Responses Re: time-delayed standbys  (Fujii Masao <masao.fujii@gmail.com>)
Re: time-delayed standbys  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
On Sat, Apr 23, 2011 at 9:46 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
> On Tue, Apr 19, 2011 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> That is, a standby configured such that replay lags a prescribed
>> amount of time behind the master.
>>
>> This seemed easy to implement, so I did.  Patch (for 9.2, obviously) attached.
>>
>
> This crashes when stoping recovery to a target (i tried with a named
> restore point and with a poin in time) after executing
> pg_xlog_replay_resume(). here is the backtrace. I will try to check
> later but i wanted to report it before...
>
> #0  0xb7777537 in raise () from /lib/libc.so.6
> #1  0xb777a922 in abort () from /lib/libc.so.6
> #2  0x08393a19 in errfinish (dummy=0) at elog.c:513
> #3  0x083944ba in elog_finish (elevel=22, fmt=0x83d5221 "wal receiver
> still active") at elog.c:1156
> #4  0x080f04cb in StartupXLOG () at xlog.c:6691
> #5  0x080f2825 in StartupProcessMain () at xlog.c:10050
> #6  0x0811468f in AuxiliaryProcessMain (argc=2, argv=0xbfa326a8) at
> bootstrap.c:417
> #7  0x0827c2ea in StartChildProcess (type=StartupProcess) at postmaster.c:4488
> #8  0x08280b85 in PostmasterMain (argc=3, argv=0xa4c17e8) at postmaster.c:1106
> #9  0x0821730f in main (argc=3, argv=0xa4c17e8) at main.c:199

Sorry for the slow response on this - I was on vacation for a week and
my schedule got a big hole in it.

I was able to reproduce something very like this in unpatched master,
just by letting recovery pause at a named restore point, and then
resuming it.

LOG:  recovery stopping at restore point "stop", time 2011-05-07
09:28:01.652958-04
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.
(at this point I did pg_xlog_replay_resume())
LOG:  redo done at 0/5000020
PANIC:  wal receiver still active
LOG:  startup process (PID 38762) was terminated by signal 6: Abort trap
LOG:  terminating any other active server processes

I'm thinking that this code is wrong:
                   if (recoveryPauseAtTarget && standbyState ==
STANDBY_SNAPSHOT_READY)                   {                       SetRecoveryPause(true);
recoveryPausesHere();                  }                   reachedStopPoint = true;    /* see below */
recoveryContinue = false; 

I think that recoveryContinue = false assignment should not happen if
we decide to pause.  That is, we should say if (recoveryPauseAtTarget
&& standbyState == STANDBY_SNAPSHOT_READY) { same as now } else
recoveryContinue = false.

I haven't tested that, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Fix for pg_upgrade user flag
Next
From: Robert Haas
Date:
Subject: Re: Fix for pg_upgrade user flag