Thread: SR standby hangs

SR standby hangs

From
Andrew Dunstan
Date:
[this time from the right address - apologies if we get a duplicate]

PostgreSQL Experts Inc has a client with a 9.0.2 streaming replication server that somehow becomes wedged after running
forsome time.
 

The server is running as a warm standby, and the client's application
tries to connect to both the master and the slave, accepting whichever
lets it connect (hence hot standby is not turned on).

Archive files are being shipped as well as WAL streaming.

The symptom is that the recovery process blocks forever on a semaphore.
We've crashed it and got the following backtrace:
   #0  0x0000003493ed5337 in semop () from /lib64/libc.so.6   #1  0x00000000005bd103 in PGSemaphoreLock
(sema=0x2b14986aec38,interruptOK=1   '\001') at pg_sema.c:420   #2  0x00000000005de645 in LockBufferForCleanup () at
bufmgr.c:2432  #3  0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>,   record=0x1787e1c0) at
heapam.c:4168  #4  heap2_redo (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4858   #5  0x0000000000488780
inStartupXLOG () at xlog.c:6250   #6  0x000000000048a888 in StartupProcessMain () at xlog.c:9254   #7
0x00000000004a11efin AuxiliaryProcessMain (argc=2, argv=<value optimized   out>) at bootstrap.c:412   #8
0x00000000005c66c9in StartChildProcess (type=StartupProcess) at   postmaster.c:4427   #9  0x00000000005c8ab7 in
PostmasterMain(argc=1, argv=0x17858bb0) at   postmaster.c:1088   #10 0x00000000005725fe in main (argc=1, argv=<value
optimizedout>) at main.c:188
 


The platform is CentOS 5.5 x86-64, kernel version 2.6.18-194.11.4.el5

I'm not quite sure where to start digging. Has anyone else seen
something similar? Our consultant reports having seen a similar problem
elsewhere, at a client who was running hot standby on 9.0.1, but the
problem did not recur, as it does fairly regularly with this client.

cheers

andrew




Re: SR standby hangs

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> The symptom is that the recovery process blocks forever on a semaphore.
> We've crashed it and got the following backtrace:

>     #0  0x0000003493ed5337 in semop () from /lib64/libc.so.6
>     #1  0x00000000005bd103 in PGSemaphoreLock (sema=0x2b14986aec38, interruptOK=1
>     '\001') at pg_sema.c:420
>     #2  0x00000000005de645 in LockBufferForCleanup () at bufmgr.c:2432
>     #3  0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>,
>     record=0x1787e1c0) at heapam.c:4168
>     #4  heap2_redo (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4858
>     #5  0x0000000000488780 in StartupXLOG () at xlog.c:6250

So who's holding the buffer lock that it wants?  Are you sure this is an
actual hang, and not just recovery waiting for a standby query to complete?
        regards, tom lane


Re: SR standby hangs

From
Robert Haas
Date:
On Fri, Feb 18, 2011 at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> I'm not quite sure where to start digging. Has anyone else seen
> something similar? Our consultant reports having seen a similar problem
> elsewhere, at a client who was running hot standby on 9.0.1, but the
> problem did not recur, as it does fairly regularly with this client.

I've seen a very similar backtrace that only involved one system (no
Hot Standby).  The problem in that case appears to have been an open
cursor holding a buffer pin.  LockBufferForCleanup() has logic that's
supposed to prevent that from going on too long during HS - it should
nuke the guy with the buffer in when the timeout expires - but maybe
there's a bug in that mechanism.

As a side matter, it would be good to improve this in the non-Hot
Standby case also.  An open cursor can tie down an autovacuum worker
forever, which is not a good thing, as it's easily possible for the
number of open cursors to be larger than the number of available
autovacuum workers...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SR standby hangs

From
Andrew Dunstan
Date:

On 02/18/2011 02:23 PM, Tom Lane wrote:
> Andrew Dunstan<andrew@dunslane.net>  writes:
>> The symptom is that the recovery process blocks forever on a semaphore.
>> We've crashed it and got the following backtrace:
>>      #0  0x0000003493ed5337 in semop () from /lib64/libc.so.6
>>      #1  0x00000000005bd103 in PGSemaphoreLock (sema=0x2b14986aec38, interruptOK=1
>>      '\001') at pg_sema.c:420
>>      #2  0x00000000005de645 in LockBufferForCleanup () at bufmgr.c:2432
>>      #3  0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>,
>>      record=0x1787e1c0) at heapam.c:4168
>>      #4  heap2_redo (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4858
>>      #5  0x0000000000488780 in StartupXLOG () at xlog.c:6250
> So who's holding the buffer lock that it wants?  Are you sure this is an
> actual hang, and not just recovery waiting for a standby query to complete?
>
>             

It's not running HS, so there's no query to wait on.

cheers

andrew


Re: SR standby hangs

From
Robert Haas
Date:
On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> It's not running HS, so there's no query to wait on.

That seems to imply that recovery has leaked a buffer pin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SR standby hangs

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> It's not running HS, so there's no query to wait on.

> That seems to imply that recovery has leaked a buffer pin.

No, because then the sanity check in LockBufferForCleanup would have
fired:
/* There should be exactly one local pin */if (PrivateRefCount[buffer - 1] != 1)    elog(ERROR, "incorrect local pin
count:%d",         PrivateRefCount[buffer - 1]);
 

Some sort of deadly embrace with the bgwriter, maybe?
        regards, tom lane


Re: SR standby hangs

From
Robert Haas
Date:
On Fri, Feb 18, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>> It's not running HS, so there's no query to wait on.
>
>> That seems to imply that recovery has leaked a buffer pin.
>
> No, because then the sanity check in LockBufferForCleanup would have
> fired:
>
>        /* There should be exactly one local pin */
>        if (PrivateRefCount[buffer - 1] != 1)
>                elog(ERROR, "incorrect local pin count: %d",
>                         PrivateRefCount[buffer - 1]);

Hmm, yeah.

> Some sort of deadly embrace with the bgwriter, maybe?

Maybe.

I think it'd be useful to know what the buffer header thinks the
refcount on that buffer is, and what the startup process and the
bgwriter each have for PrivateRefCount[buffer].

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SR standby hangs

From
Andrew Dunstan
Date:

On 02/18/2011 03:42 PM, Robert Haas wrote:
> On Fri, Feb 18, 2011 at 2:50 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Robert Haas<robertmhaas@gmail.com>  writes:
>>> On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan<andrew@dunslane.net>  wrote:
>>>> It's not running HS, so there's no query to wait on.
>>> That seems to imply that recovery has leaked a buffer pin.
>> No, because then the sanity check in LockBufferForCleanup would have
>> fired:
>>
>>         /* There should be exactly one local pin */
>>         if (PrivateRefCount[buffer - 1] != 1)
>>                 elog(ERROR, "incorrect local pin count: %d",
>>                          PrivateRefCount[buffer - 1]);
> Hmm, yeah.
>
>> Some sort of deadly embrace with the bgwriter, maybe?
> Maybe.
>
> I think it'd be useful to know what the buffer header thinks the
> refcount on that buffer is, and what the startup process and the
> bgwriter each have for PrivateRefCount[buffer].


I'll see what I can find out (damn I hate driving debuggers).

cheers

andrew