On Wed, 2010-05-12 at 21:10 +0200, Stefan Kaltenbrunner wrote:
> > There is no evidence to link this behaviour with HS, as yet, and you
> > should be considering the possibility the problem lies elsewhere,
> > especially since it could be code you committed that is at fault.
>
> Well I'm not sure why people seem to have that hard a time reproducing
> that issue - it seems that I can provoke it really trivially(in this
> case no loops, no pgbench, no tricks). A few minutes ago I logged into
> my test standby (which is idle except for the odd connect to template1
> caused by nagios - the master is idle as well and has been for days):
Thanks, good report.
> so it restarted two times successfully - however if one looks at the
> third time one can see that it received the smart shutdown request
> BEFORE it reached a consistent recovery state - yet it continued to
> enable HS and reenabled SR as well.
>
> The database is now sitting there doing nothing and it more or less
> broken because you cannot connect to it in the current state:
>
> ~$ psql
> psql: FATAL: the database system is shutting down
>
> the startup process has the following backtrace:
>
> (gdb) bt
> #0 0x00007fbe24cb2c83 in select () from /lib/libc.so.6
> #1 0x00000000006e811a in pg_usleep ()
> #2 0x000000000048c333 in XLogPageRead ()
> #3 0x000000000048c967 in ReadRecord ()
> #4 0x0000000000493ab6 in StartupXLOG ()
> #5 0x0000000000495a88 in StartupProcessMain ()
> #6 0x00000000004ab25e in AuxiliaryProcessMain ()
> #7 0x00000000005d4a7d in StartChildProcess ()
> #8 0x00000000005d70c2 in PostmasterMain ()
> #9 0x000000000057d898 in main ()
Well, its waiting for new info from primary. Nothing to do with locking,
but that's not an indication that its an SR issue though either. ;-)
I'll put some waits into that part of the code and see if I can induce
the failure. Maybe its just a simple lack of a CHECK_FOR_INTERRUPTS().
-- Simon Riggs www.2ndQuadrant.com