Re: max_standby_delay considered harmful - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: max_standby_delay considered harmful
Date
Msg-id 1273692314.308.1059.camel@ebony
Whole thread Raw
In response to Re: max_standby_delay considered harmful  (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>)
List pgsql-hackers
On Wed, 2010-05-12 at 21:10 +0200, Stefan Kaltenbrunner wrote:

> > There is no evidence to link this behaviour with HS, as yet, and you
> > should be considering the possibility the problem lies elsewhere,
> > especially since it could be code you committed that is at fault.
> 
> Well I'm not sure why people seem to have that hard a time reproducing 
> that issue - it seems that I can provoke it really trivially(in this 
> case no loops, no pgbench, no tricks). A few minutes ago I logged into 
> my test standby (which is idle except for the odd connect to template1 
> caused by nagios - the master is idle as well and has been for days):

Thanks, good report.

> so it restarted two times successfully - however if one looks at the 
> third time one can see that it received the smart shutdown request 
> BEFORE it reached a consistent recovery state - yet it continued to 
> enable HS and reenabled SR as well.
> 
> The database is now sitting there doing nothing and it more or less 
> broken because you cannot connect to it in the current state:
> 
> ~$ psql
> psql: FATAL:  the database system is shutting down
> 
> the startup process has the following backtrace:
> 
> (gdb) bt
> #0  0x00007fbe24cb2c83 in select () from /lib/libc.so.6
> #1  0x00000000006e811a in pg_usleep ()
> #2  0x000000000048c333 in XLogPageRead ()
> #3  0x000000000048c967 in ReadRecord ()
> #4  0x0000000000493ab6 in StartupXLOG ()
> #5  0x0000000000495a88 in StartupProcessMain ()
> #6  0x00000000004ab25e in AuxiliaryProcessMain ()
> #7  0x00000000005d4a7d in StartChildProcess ()
> #8  0x00000000005d70c2 in PostmasterMain ()
> #9  0x000000000057d898 in main ()

Well, its waiting for new info from primary. Nothing to do with locking,
but that's not an indication that its an SR issue though either. ;-)

I'll put some waits into that part of the code and see if I can induce
the failure. Maybe its just a simple lack of a CHECK_FOR_INTERRUPTS().

-- Simon Riggs           www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: primary/secondary/master/slave/standby
Next
From: Tom Lane
Date:
Subject: Re: pg_upgrade versus MSVC build scripts