On Fri, Feb 5, 2010 at 10:46 PM, Martin Pihlak <martin.pihlak@gmail.com> wrote:
> Encountered the following FailedAssertion while testing database
> recovery (actually this would be HS) with partial WAL file:
>
> LOG: restored log file "000000010000000000000003" from archive
> LOG: consistent recovery state reached at 0/3001EEC
> LOG: record with zero length at 0/3001EEC
> LOG: redo done at 0/3001D68
> LOG: last completed transaction was at log time 2010-02-05 11:02:49.695544+02
> LOG: database system is ready to accept read only connections
> LOG: selected new timeline ID: 3
> TRAP: FailedAssertion("!(readFile >= 0)", File: "xlog.c", Line: 5117)
>
> and the assorted backtrace:
> #0 0xb7f2e410 in __kernel_vsyscall ()
> #1 0xb7dc9085 in raise () from /lib/tls/i686/cmov/libc.so.6
> #2 0xb7dcaa01 in abort () from /lib/tls/i686/cmov/libc.so.6
> #3 0x0834926e in ExceptionalCondition (conditionName=0x8389a82 "!(readFile >= 0)",
> errorType=0x837d5b4 "FailedAssertion",
> fileName=0x8390338 "xlog.c", lineNumber=5117) at assert.c:57
> #4 0x080dde34 in exitArchiveRecovery (endTLI=1, endLogId=0, endLogSeg=18) at xlog.c:5117
> #5 0x080e3eab in StartupXLOG () at xlog.c:6029
> #6 0x080e6055 in StartupProcessMain () at xlog.c:8666
> #7 0x08106a30 in AuxiliaryProcessMain (argc=2, argv=0xbfebdd64) at bootstrap.c:412
> #8 0x08253f0c in StartChildProcess (type=StartupProcess) at postmaster.c:4340
> #9 0x0825669e in PostmasterMain (argc=3, argv=0x8516e08) at postmaster.c:1078
> #10 0x081f7919 in main (argc=3, argv=0x8516e08) at main.c:188
>
> The crash happened on a HS slave which was fed a partial WAL file. The
> partial was constructed by extracting data up to pg_current_xlog_location()
> and zero padding it up to 16MB.
>
> This only seems to be happening on HEAD - quick tests indicate that both
> 8.3 and 8.4 are not affected (or maybe I didn't try hard enough).
Thanks for the report!
This assertion failure derives from the recent refactoring of ReadRecord().
Which changed the startup process so as to re-fetch the last applied WAL
record at the end of recovery from the buffer instead of the WAL file if it's
stored in the buffer. In this case, the last applied WAL file remains closed.
So readFile (which ought to have been the file descriptor of that WAL file)
might be -1 in exitArchiveRecovery().
In the now, that assertion is obsolete. So I attached the patch that removes
the assert() from exitArchiveRecovery().
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center