Thread: Allocation in critical section after node exits archive recovery

Allocation in critical section after node exits archive recovery

From
Michael Paquier
Date:
Hi all,

While performing some archive recovery tests I found the following
assertion failure once standby gets out of archive recovery (HEAD at
a5eb85e):
(lldb)
* thread #1: tid = 0x0000, 0x00007fff8faaf282
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal
SIGSTOP * frame #0: 0x00007fff8faaf282 libsystem_kernel.dylib`__pthread_kill + 10   frame #1: 0x00007fff888774c3
libsystem_pthread.dylib`pthread_kill+ 90   frame #2: 0x00007fff898acb73 libsystem_c.dylib`abort + 129   frame #3:
0x0000000107106029
postgres`ExceptionalCondition(conditionName=0x000000010725770b,
errorType=0x0000000107219b83, fileName=0x00000001072c4c84,
lineNumber=555) + 137 at assert.c:54   frame #4: 0x0000000107140ec5
postgres`MemoryContextCreate(tag=T_AllocSetContext, size=192,
methods=0x00000001072fc7b0, parent=0x00007fca6a405710,
name=0x0000000107257e1b) + 117 at mcxt.c:555   frame #5: 0x000000010713e6c7
postgres`AllocSetContextCreate(parent=0x00007fca6a405710,
name=0x0000000107257e1b, minContextSize=0, initBlockSize=8192,
maxBlockSize=8388608) + 71 at aset.c:444   frame #6: 0x0000000106c51449 postgres`InitXLogInsert + 73 at
xloginsert.c:869   frame #7: 0x0000000106c450ee postgres`InitXLOGAccess + 190 at xlog.c:7281   frame #8:
0x0000000106c43e7cpostgres`LocalSetXLogInsertAllowed +
 
76 at xlog.c:7133   frame #9: 0x0000000106c445dd postgres`CreateCheckPoint(flags=38) +
1165 at xlog.c:7693   frame #10: 0x0000000106ebf935 postgres`CheckpointerMain + 1989 at
checkpointer.c:502   frame #11: 0x0000000106c5a0c3
postgres`AuxiliaryProcessMain(argc=2, argv=0x00007fff5909bb30) + 1907
at bootstrap.c:426   frame #12: 0x0000000106ecf696
postgres`StartChildProcess(type=CheckpointerProcess) + 358 at
postmaster.c:5109   frame #13: 0x0000000106ece229
postgres`sigusr1_handler(postgres_signal_arg=30) + 265 at
postmaster.c:4795   frame #14: 0x00007fff91468f1a libsystem_platform.dylib`_sigtramp + 26   frame #15:
0x00007fff8faaf3f7libsystem_kernel.dylib`__select + 11   frame #16: 0x0000000106ecd717 postgres`PostmasterMain(argc=3,
 
argv=0x00007fca6a4056a0) + 5479 at postmaster.c:1220   frame #17: 0x0000000106e11fc5 postgres`main(argc=3,
argv=0x00007fca6a4056a0) + 773 at main.c:220

Logs:
2014-11-28 15:28:20.733 JST: LOG:  selected new timeline ID: 2
cp: /Users/ioltas/archive/5432/00000001.history: No such file or
directory
2014-11-28 15:28:20.776 JST: LOG:  archive recovery complete
TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 555)

This area of the code has been changed recently by the WAL API patch.
Note that I haven't looked yet at implementing a fix.
Regards,
-- 
Michael



Re: Allocation in critical section after node exits archive recovery

From
Heikki Linnakangas
Date:
On 11/28/2014 08:35 AM, Michael Paquier wrote:
> Hi all,
>
> While performing some archive recovery tests I found the following
> assertion failure once standby gets out of archive recovery (HEAD at
> a5eb85e):
> (lldb)
> * thread #1: tid = 0x0000, 0x00007fff8faaf282
> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal
> SIGSTOP
>    * frame #0: 0x00007fff8faaf282 libsystem_kernel.dylib`__pthread_kill + 10
>      frame #1: 0x00007fff888774c3 libsystem_pthread.dylib`pthread_kill + 90
>      frame #2: 0x00007fff898acb73 libsystem_c.dylib`abort + 129
>      frame #3: 0x0000000107106029
> postgres`ExceptionalCondition(conditionName=0x000000010725770b,
> errorType=0x0000000107219b83, fileName=0x00000001072c4c84,
> lineNumber=555) + 137 at assert.c:54
>      frame #4: 0x0000000107140ec5
> postgres`MemoryContextCreate(tag=T_AllocSetContext, size=192,
> methods=0x00000001072fc7b0, parent=0x00007fca6a405710,
> name=0x0000000107257e1b) + 117 at mcxt.c:555
>      frame #5: 0x000000010713e6c7
> postgres`AllocSetContextCreate(parent=0x00007fca6a405710,
> name=0x0000000107257e1b, minContextSize=0, initBlockSize=8192,
> maxBlockSize=8388608) + 71 at aset.c:444
>      frame #6: 0x0000000106c51449 postgres`InitXLogInsert + 73 at
> xloginsert.c:869
>      frame #7: 0x0000000106c450ee postgres`InitXLOGAccess + 190 at xlog.c:7281
>      frame #8: 0x0000000106c43e7c postgres`LocalSetXLogInsertAllowed +
> 76 at xlog.c:7133
>      frame #9: 0x0000000106c445dd postgres`CreateCheckPoint(flags=38) +
> 1165 at xlog.c:7693
>      frame #10: 0x0000000106ebf935 postgres`CheckpointerMain + 1989 at
> checkpointer.c:502
>      frame #11: 0x0000000106c5a0c3
> postgres`AuxiliaryProcessMain(argc=2, argv=0x00007fff5909bb30) + 1907
> at bootstrap.c:426
>      frame #12: 0x0000000106ecf696
> postgres`StartChildProcess(type=CheckpointerProcess) + 358 at
> postmaster.c:5109
>      frame #13: 0x0000000106ece229
> postgres`sigusr1_handler(postgres_signal_arg=30) + 265 at
> postmaster.c:4795
>      frame #14: 0x00007fff91468f1a libsystem_platform.dylib`_sigtramp + 26
>      frame #15: 0x00007fff8faaf3f7 libsystem_kernel.dylib`__select + 11
>      frame #16: 0x0000000106ecd717 postgres`PostmasterMain(argc=3,
> argv=0x00007fca6a4056a0) + 5479 at postmaster.c:1220
>      frame #17: 0x0000000106e11fc5 postgres`main(argc=3,
> argv=0x00007fca6a4056a0) + 773 at main.c:220
>
> Logs:
> 2014-11-28 15:28:20.733 JST: LOG:  selected new timeline ID: 2
> cp: /Users/ioltas/archive/5432/00000001.history: No such file or
> directory
> 2014-11-28 15:28:20.776 JST: LOG:  archive recovery complete
> TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 555)
>
> This area of the code has been changed recently by the WAL API patch.
> Note that I haven't looked yet at implementing a fix.

Fixed, thanks. I had tested PITR, but apparently not with hot_standby=off.

It might be more robust to call InitXLogInsert() directly at process 
startup, somewhere in postinit.c. Currently, it's called by 
InitXLOGAccess(), which has the advantage that processes that have no 
intention of creating WAL records don't need to allocate the XLogInsert 
working buffers, notably hot standby backends. But it means that it's 
not safe to call RecoveryInProgress() in a critical section, when 
InitXLOGAccess() hasn't been called yet. All the critical sections that 
cover WAL-logging page modifications are safe, because you wouldn't get 
to the critical section without having called InitXLOGAccess() earlier, 
but it's a bit fiddly.

I think I'll leave it as it is for now, but if any more issues like this 
crop up, I'll move the call to InitXLogInsert to backend startup.

- Heikki