Thread: Warm Standby startup process unconditionally hangs

Warm Standby startup process unconditionally hangs

From
Bernd Helmle
Date:
PostgreSQL is 8.4.7, WAL-Logshipping is actually using SkyTools 2.1.12
(both installed from PGDG rpms), running with CentOS 5.5 XEN-VM with Kernel
2.6.18-194.26.1.el5xen.

recovery.conf on the standby installed by walmgr is

restore_command = '/usr/bin/walmgr.py
/media/pgarchive/wal-config/wal-slave.ini xrestore %f "%p" %r'
#recovery_target_time=
#recovery_target_xid=
#recovery_target_inclusive=true
#recovery_target_timeline=

The archive is located on a NFS share. We have seen the following happening
on a PostgreSQL Warm Standby:

Once for a while the Startup Process just stops recovering from archive. No
more xlogs replayed, no more activity in the logs from the startup process.
We've seen this two or three times in the past, this time we had the chance
to do some further investigation what's happening.

strace shows the startup process hanging with

strace -p 31898
Process 31898 attached - interrupt to quit
semop(425990, 0x7fff19919060, 1

Here is a backtrace from the startup process:

(gdb) bt
#0  0x0000003e214d5337 in semop () from /lib64/libc.so.6
#1  0x00000000005a7ff3 in PGSemaphoreLock (sema=0x2b35a0140ce8,
interruptOK=1 '\001') at pg_sema.c:420
#2  0x00000000005c68a5 in LockBufferForCleanup (buffer=78428) at
bufmgr.c:2467
#3  0x0000000000485446 in RestoreBkpBlocks (lsn=<value optimized out>,
record=0xcf2b830, cleanup=1 '\001') at xlog.c:3292
#4  0x0000000000460ce0 in heap2_redo (lsn=..., record=0x7fff19919060) at
heapam.c:4805
#5  0x0000000000482eff in StartupXLOG () at xlog.c:5665
#6  0x0000000000484cfb in StartupProcessMain () at xlog.c:8102
#7  0x000000000049b5a8 in AuxiliaryProcessMain (argc=2, argv=<value
optimized out>) at bootstrap.c:419
#8  0x00000000005b0ef9 in StartChildProcess (type=StartupProcess) at
postmaster.c:4319
#9  0x00000000005b30c7 in PostmasterMain (argc=5, argv=0xcf079e0) at
postmaster.c:1036
#10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
main.c:188

Looking at the code, it seems it is stuck in waiting for UnpinBuffer(),
ProcWaitForSignal(). Getting the backtrace for the bgwriter process, it
seems it is working as expected:

(gdb)
#0  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000006b086a in pg_usleep (microsec=<value optimized out>) at
pgsleep.c:43
#2  0x00000000005ac00e in BackgroundWriterMain () at bgwriter.c:538
#3  0x000000000049b5b7 in AuxiliaryProcessMain (argc=2, argv=<value
optimized out>) at bootstrap.c:424
#4  0x00000000005b0ef9 in StartChildProcess (type=BgWriterProcess) at
postmaster.c:4319
#5  0x00000000005b4df6 in sigusr1_handler (postgres_signal_arg=<value
optimized out>) at postmaster.c:4095
#6  <signal handler called>
#7  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
#8  0x00000000005b1b5e in ServerLoop () at postmaster.c:1347
#9  0x00000000005b30dc in PostmasterMain (argc=5, argv=0xcf079e0) at
postmaster.c:1040
#10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
main.c:188

The logfile shows the following sequence before this issue:

<@ 2011-03-21 13:56:46 CET 4d7a3ae1.7c9a-1457> LOG:  restored log file
"000000010000002000000070" from archive
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48794> DEBUG:  updated min
recovery point to 20/704A6EC8
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48795> CONTEXT:  writing block 157
of relation base/70631854/70632521
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48796> DEBUG:  updated min
recovery point to 20/7054BD78
<@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48797> CONTEXT:  writing block 156
of relation base/70631854/70632521
<@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48798> LOG:  restartpoint
complete: wrote 4518 buffers (1.7%); write=225.415 s, sync=0.203 s,
total=225.621 s
<@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48799> LOG:  recovery restart
point at 20/68244EC8
<@ 2011-03-21 13:58:18 CET 4d7a3ae2.7c9e-48800> LOG:  restartpoint
starting: time
<@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48801> LOG:  restartpoint
complete: wrote 3733 buffers (1.4%); write=239.997 s, sync=0.022 s,
total=240.022 s
<@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48802> LOG:  recovery restart
point at 20/6BEB2EB0
<@ 2011-03-21 14:03:18 CET 4d7a3ae2.7c9e-48803> DEBUG:  skipping
restartpoint, already performed at 20/6BEB2EB0
<@ 2011-03-21 14:03:33 CET 4d7a3ae2.7c9e-48804> DEBUG:  skipping
restartpoint, already performed at 20/6BEB2EB0
<@ 2011-03-21 14:03:48 CET 4d7a3ae2.7c9e-48805> DEBUG:  skipping
restartpoint, already performed at 20/6BEB2EB0

After this, the startup process just "hangs", with the symptoms shown
above. I need to restart the standby now, because the customer wants to
have it back as soon as possible, but if you need more infos, please let me
know.

--
Thanks

    Bernd

Re: Warm Standby startup process unconditionally hangs

From
Simon Riggs
Date:
On Mon, Mar 21, 2011 at 4:44 PM, Bernd Helmle <mailings@oopsware.de> wrote:
> PostgreSQL is 8.4.7, WAL-Logshipping is actually using SkyTools 2.1.12 (both
> installed from PGDG rpms), running with CentOS 5.5 XEN-VM with Kernel
> 2.6.18-194.26.1.el5xen.
>
> recovery.conf on the standby installed by walmgr is
>
> restore_command = '/usr/bin/walmgr.py
> /media/pgarchive/wal-config/wal-slave.ini xrestore %f "%p" %r'
> #recovery_target_time=
> #recovery_target_xid=
> #recovery_target_inclusive=true
> #recovery_target_timeline=
>
> The archive is located on a NFS share. We have seen the following happening
> on a PostgreSQL Warm Standby:
>
> Once for a while the Startup Process just stops recovering from archive. No
> more xlogs replayed, no more activity in the logs from the startup process.
> We've seen this two or three times in the past, this time we had the chance
> to do some further investigation what's happening.
>
> strace shows the startup process hanging with
>
> strace -p 31898
> Process 31898 attached - interrupt to quit
> semop(425990, 0x7fff19919060, 1
>
> Here is a backtrace from the startup process:
>
> (gdb) bt
> #0  0x0000003e214d5337 in semop () from /lib64/libc.so.6
> #1  0x00000000005a7ff3 in PGSemaphoreLock (sema=0x2b35a0140ce8,
> interruptOK=1 '\001') at pg_sema.c:420
> #2  0x00000000005c68a5 in LockBufferForCleanup (buffer=78428) at
> bufmgr.c:2467
> #3  0x0000000000485446 in RestoreBkpBlocks (lsn=<value optimized out>,
> record=0xcf2b830, cleanup=1 '\001') at xlog.c:3292
> #4  0x0000000000460ce0 in heap2_redo (lsn=..., record=0x7fff19919060) at
> heapam.c:4805
> #5  0x0000000000482eff in StartupXLOG () at xlog.c:5665
> #6  0x0000000000484cfb in StartupProcessMain () at xlog.c:8102
> #7  0x000000000049b5a8 in AuxiliaryProcessMain (argc=2, argv=<value
> optimized out>) at bootstrap.c:419
> #8  0x00000000005b0ef9 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4319
> #9  0x00000000005b30c7 in PostmasterMain (argc=5, argv=0xcf079e0) at
> postmaster.c:1036
> #10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
> main.c:188
>
> Looking at the code, it seems it is stuck in waiting for UnpinBuffer(),
> ProcWaitForSignal(). Getting the backtrace for the bgwriter process, it
> seems it is working as expected:
>
> (gdb)
> #0  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
> #1  0x00000000006b086a in pg_usleep (microsec=<value optimized out>) at
> pgsleep.c:43
> #2  0x00000000005ac00e in BackgroundWriterMain () at bgwriter.c:538
> #3  0x000000000049b5b7 in AuxiliaryProcessMain (argc=2, argv=<value
> optimized out>) at bootstrap.c:424
> #4  0x00000000005b0ef9 in StartChildProcess (type=BgWriterProcess) at
> postmaster.c:4319
> #5  0x00000000005b4df6 in sigusr1_handler (postgres_signal_arg=<value
> optimized out>) at postmaster.c:4095
> #6  <signal handler called>
> #7  0x0000003e214cd1c3 in __select_nocancel () from /lib64/libc.so.6
> #8  0x00000000005b1b5e in ServerLoop () at postmaster.c:1347
> #9  0x00000000005b30dc in PostmasterMain (argc=5, argv=0xcf079e0) at
> postmaster.c:1040
> #10 0x000000000056079e in main (argc=5, argv=<value optimized out>) at
> main.c:188
>
> The logfile shows the following sequence before this issue:
>
> <@ 2011-03-21 13:56:46 CET 4d7a3ae1.7c9a-1457> LOG:  restored log file
> "000000010000002000000070" from archive
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48794> DEBUG:  updated min recovery
> point to 20/704A6EC8
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48795> CONTEXT:  writing block 157
> of relation base/70631854/70632521
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48796> DEBUG:  updated min recovery
> point to 20/7054BD78
> <@ 2011-03-21 13:56:46 CET 4d7a3ae2.7c9e-48797> CONTEXT:  writing block 156
> of relation base/70631854/70632521
> <@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48798> LOG:  restartpoint complete:
> wrote 4518 buffers (1.7%); write=225.415 s, sync=0.203 s, total=225.621 s
> <@ 2011-03-21 13:57:03 CET 4d7a3ae2.7c9e-48799> LOG:  recovery restart point
> at 20/68244EC8
> <@ 2011-03-21 13:58:18 CET 4d7a3ae2.7c9e-48800> LOG:  restartpoint starting:
> time
> <@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48801> LOG:  restartpoint complete:
> wrote 3733 buffers (1.4%); write=239.997 s, sync=0.022 s, total=240.022 s
> <@ 2011-03-21 14:02:18 CET 4d7a3ae2.7c9e-48802> LOG:  recovery restart point
> at 20/6BEB2EB0
> <@ 2011-03-21 14:03:18 CET 4d7a3ae2.7c9e-48803> DEBUG:  skipping
> restartpoint, already performed at 20/6BEB2EB0
> <@ 2011-03-21 14:03:33 CET 4d7a3ae2.7c9e-48804> DEBUG:  skipping
> restartpoint, already performed at 20/6BEB2EB0
> <@ 2011-03-21 14:03:48 CET 4d7a3ae2.7c9e-48805> DEBUG:  skipping
> restartpoint, already performed at 20/6BEB2EB0
>
> After this, the startup process just "hangs", with the symptoms shown above.
> I need to restart the standby now, because the customer wants to have it
> back as soon as possible, but if you need more infos, please let me know.

Bernd, I just realised that my email to you was private, not public.
Copying whole message again here.

The answer looks clear from here. What I'm stunned about is that we've
not had a report about this before now. (You replied there had been one).

ProcSendSignal() doesn't do anything at all when called with the
Startup proc's pid, because BackendPidGetProc() returns NULL.

The root cause appears to be the separation of the two parts of the
Hot Standby patch that occurred late in 8.4 cycle. Looks like the
protection intended to allow the bgwriter to coexist with the startup
process was never applied.

Similar issue exists in 9.0 also when HS not enabled. (Fixed)

The patches are slightly different because the infrastructure isn't
all there in 8.4.

Please can you test these before I commit to 8.4.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Warm Standby startup process unconditionally hangs

From
Alvaro Herrera
Date:
Excerpts from Simon Riggs's message of mié mar 23 17:37:11 -0300 2011:

> The answer looks clear from here. What I'm stunned about is that we've
> not had a report about this before now. (You replied there had been one).

I've actually seen reports that the recovery process seems stuck but
I've never had the chance to research it.

About the patch: shouldn't the startup PID be reset eventually?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Warm Standby startup process unconditionally hangs

From
Bernd Helmle
Date:
--On 23. M=C3=A4rz 2011 20:37:11 +0000 Simon Riggs <simon@2ndQuadrant.com> =
wrote:

> The patches are slightly different because the infrastructure isn't
> all there in 8.4.
>
> Please can you test these before I commit to 8.4.

FYI, we have deployed patched RPMs to one of the standby servers, the one w=
here=20
most of the hangs occured in the past. We'll see wether we get further hang=
s=20
there.

        Bernd

Re: Warm Standby startup process unconditionally hangs

From
Bernd Helmle
Date:
--On 23. M=C3=A4rz 2011 20:37:11 +0000 Simon Riggs <simon@2ndQuadrant.com> =
wrote:

> The patches are slightly different because the infrastructure isn't
> all there in 8.4.
>
> Please can you test these before I commit to 8.4.

Hi Simon,

I've searched the committers mailinglist wether this is already committed, =
but=20
either i've missed one or this is still open? If not, sorry for bothering y=
ou...

Our customers reports that with the patches applied they didn't had any fur=
ther
hangs on the standby instances anymore for the last couple of months now.


--=20
Thanks

    Bernd