Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> FYI: we now have at least 4 machines(otter,kingfisher,lionfish,corgi) on
> the buildfarm crashing during testing of GIST-related things in contrib.
I'm seeing some problems on Mac OS X, too. The tsearch regression test
crashed ... which we may not care about much since tsearch is presumably
going away ... but after that, the postmaster failed to restart:
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/4980748
PANIC: gistRedoEntryUpdateRecord: uninitialized page
LOG: startup process (PID 1728) was terminated by signal 6
LOG: aborting startup due to startup process failure
Stack trace for that is
#0 0x9004a12c in kill ()
#1 0x90120954 in abort ()
#2 0x001cfcdc in errfinish (dummy=0) at elog.c:451
#3 0x001d06bc in elog_finish (elevel=-1073761600, fmt=0x2818dc "\002") at elog.c:932
#4 0x0000c7c4 in gistRedoEntryUpdateRecord (lsn={xlogid = 0, xrecoff = 77628316}, record=0x57, isnewroot=0 '\0') at
gistxlog.c:186
#5 0x0000ca0c in gist_redo (lsn={xlogid = 0, xrecoff = 77628316}, record=0xa8d000) at gistxlog.c:399
#6 0x00042b4c in StartupXLOG () at xlog.c:4509
#7 0x0004c9d8 in BootstrapMain (argc=4, argv=0xbfffde28) at bootstrap.c:413
#8 0x00126278 in StartChildProcess (xlop=2) at postmaster.c:3484
#9 0x00126858 in reaper (postgres_signal_arg=0) at postmaster.c:2165
#10 <signal handler called>
#11 0x9001efe8 in select ()
#12 0x00126e70 in ServerLoop () at postmaster.c:1168
#13 0x00128c54 in PostmasterMain (argc=3, argv=0xd00600) at postmaster.c:930
#14 0x000e3050 in main (argc=3, argv=0xbfffe57c) at main.c:268
I checked that it was processing a type-0 (XLOG_GIST_ENTRY_UPDATE)
record, but am not sure what else to look at. I do think it's
questionable that this log record type doesn't appear to reference
any buffers in the list it passes to XLogInsert. I believe the
general rule is that an xlog record that is describing a change in
a buffered page ought to tell XLogInsert so --- and the redo routine
has to check to see if the change was replaced by a whole-page record.
I'm not entirely sure that violating that rule leads directly to this
class of failure, but it definitely leads to failure to recover from
partial page writes during a hardware crash.
regards, tom lane