Thread: WAL recovery is broken by FSM patch

WAL recovery is broken by FSM patch

From
Tom Lane
Date:
I just managed to make a backend dump core while fooling with the CTE
patch, and found out that the system failed to recover, because the
ensuing startup process *also* dumped core.  Here's the backtrace:

Core was generated by `postgres: startup'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000048df59 in XLogInsert (rmid=2 '\002', info=32 ' ',    rdata=0x7fff41713550) at xlog.c:813
813             record->xl_prev = Insert->PrevRecord;
(gdb) bt
#0  0x000000000048df59 in XLogInsert (rmid=2 '\002', info=32 ' ',    rdata=0x7fff41713550) at xlog.c:813
#1  0x00000000005ec8d0 in smgrtruncate (reln=0x206a148, forknum=FSM_FORKNUM,    nblocks=3, isTemp=0 '\0') at
smgr.c:594
#2  0x00000000005dc194 in FreeSpaceMapTruncateRel (rel=0x2072050, nblocks=15)   at freespace.c:275
#3  0x00000000005dc2ee in fsm_redo (lsn=<value optimized out>,    record=<value optimized out>) at freespace.c:779
#4  0x000000000049003f in StartupXLOG () at xlog.c:5146
#5  0x00000000004a9cd8 in AuxiliaryProcessMain (argc=2, argv=0x7fff41713790)   at bootstrap.c:420
#6  0x00000000005bd24d in StartChildProcess (type=StartupProcess)   at postmaster.c:4074
#7  0x00000000005c053f in PostmasterStateMachine () at postmaster.c:2737
#8  0x00000000005c0965 in reaper (postgres_signal_arg=<value optimized out>)   at postmaster.c:2325
#9  <signal handler called>
#10 0x0000003f71edcbb3 in __select_nocancel () from /lib64/libc.so.6
#11 0x00000000006ce41a in pg_usleep (microsec=<value optimized out>)   at pgsleep.c:43
#12 0x00000000005bed05 in ServerLoop () at postmaster.c:1232
#13 0x00000000005bf99a in PostmasterMain (argc=3, argv=0x203a890)   at postmaster.c:1031
#14 0x0000000000568fd8 in main (argc=3, argv=0x203a890) at main.c:188

We should of course not be attempting XLogInsert during WAL replay.
Now smgr_redo knows about that.  I rather wonder why fsm_redo is
attempting to call smgrtruncate at all, seeing that there's presumably
smgr's own redo record to tell it to deal with that.  I think that all
fsm_redo need do is clear out the last untruncated block of FSM.
        regards, tom lane


Re: WAL recovery is broken by FSM patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I just managed to make a backend dump core while fooling with the CTE
> patch, and found out that the system failed to recover, because the
> ensuing startup process *also* dumped core.  Here's the backtrace:
> ...
> 
> We should of course not be attempting XLogInsert during WAL replay.
> Now smgr_redo knows about that.  I rather wonder why fsm_redo is
> attempting to call smgrtruncate at all, seeing that there's presumably
> smgr's own redo record to tell it to deal with that.  I think that all
> fsm_redo need do is clear out the last untruncated block of FSM.

Agreed. Fixed, thanks.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com