Corner-case bugs in ThisStartUpID accounting - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Corner-case bugs in ThisStartUpID accounting |
Date | |
Msg-id | 7463.1053532206@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
I've been studying a log file that Rao Kumar sent me, showing a 7.3.2 system having some serious difficulties :-(. I believe it exposes an error in the startup-id logic in xlog.c. The system is crashing regularly due to apparent data corruption in heavily-used tables. Now watch this: 2003-05-19 18:04:18 PANIC: open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory 2003-05-19 18:04:18 LOG: server process (pid 72392) was terminated by signal 6 2003-05-19 18:04:18 LOG: terminating any other active server processes 2003-05-19 18:04:18 LOG: all server processes terminated; reinitializing shared memory and semaphores 2003-05-19 18:04:18 LOG: database system was interrupted at 2003-05-19 18:04:16 EDT 2003-05-19 18:04:18 LOG: checkpoint record is at 9/54ACD800 2003-05-19 18:04:18 LOG: redo record is at 9/54ACD800; undo record is at 0/0; shutdown TRUE 2003-05-19 18:04:18 LOG: next transaction id: 2762375; next oid: 4292578 2003-05-19 18:04:18 LOG: database system was not properly shut down; automatic recovery in progress 2003-05-19 18:04:18 LOG: redo starts at 9/54ACD840 2003-05-19 18:04:18 LOG: ReadRecord: unexpected pageaddr 9/4BB2A000 in log file 9, segment 84, offset 11706368 2003-05-19 18:04:18 LOG: redo done at 9/54B27F90 2003-05-19 18:04:21 LOG: database system is ready 2003-05-19 18:04:33 PANIC: open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory 2003-05-19 18:04:33 LOG: server process (pid 72401) was terminated by signal 6 2003-05-19 18:04:33 LOG: terminating any other active server processes 2003-05-19 18:04:33 LOG: all server processes terminated; reinitializing shared memory and semaphores 2003-05-19 18:04:33 LOG: database system was interrupted at 2003-05-19 18:04:21 EDT 2003-05-19 18:04:33 LOG: checkpoint record is at 9/54B2A010 2003-05-19 18:04:33 LOG: redo record is at 9/54B2A010; undo record is at 0/0; shutdown TRUE 2003-05-19 18:04:33 LOG: next transaction id: 2762386; next oid: 4300770 2003-05-19 18:04:33 LOG: database system was not properly shut down; automatic recovery in progress 2003-05-19 18:04:33 LOG: redo starts at 9/54B2A050 2003-05-19 18:04:34 LOG: ReadRecord: record with zero length at 9/54CBFFC4 2003-05-19 18:04:34 LOG: redo done at 9/54CBFFA0 2003-05-19 18:04:36 LOG: database system is ready 2003-05-19 18:04:36 PANIC: open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file or directory 2003-05-19 18:04:37 LOG: server process (pid 72410) was terminated by signal 6 2003-05-19 18:04:37 LOG: terminating any other active server processes 2003-05-19 18:04:37 LOG: all server processes terminated; reinitializing shared memory and semaphores 2003-05-19 18:04:37 LOG: database system was interrupted at 2003-05-19 18:04:36 EDT 2003-05-19 18:04:37 LOG: ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344 2003-05-19 18:04:37 LOG: invalid primary checkpoint record 2003-05-19 18:04:37 LOG: using previous checkpoint record at 9/54B2A010 2003-05-19 18:04:37 LOG: redo record is at 9/54B2A010; undo record is at 0/0; shutdown TRUE 2003-05-19 18:04:37 LOG: next transaction id: 2762386; next oid: 4300770 2003-05-19 18:04:37 LOG: database system was not properly shut down; automatic recovery in progress 2003-05-19 18:04:37 LOG: redo starts at 9/54B2A050 2003-05-19 18:04:37 LOG: ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344 2003-05-19 18:04:37 LOG: redo done at 9/54CBFFA0 2003-05-19 18:04:39 LOG: database system is ready My interpretation of this trace is: 1. The crash restart at 18:04:18 recovered from the prior checkpoint (at 9/54ACD800), which contained SUI 43. It ran forward through additional WAL entries, redoing as needed, then wrote a fresh checkpoint record (at 9/54B2A010), incremented its SUI to 44, and opened for business. 2. In the next few seconds the DB was able to do enough work to fill one or more XLOG pages, so that there were some pages with xlp_sui = 44 in their headers. It did not live long enough to write another checkpoint though. 3. The crash at 18:04:33 occurs with WAL filled almost but not quite to a page boundary (note "redo done at 9/54CBFFA0"). The redo starts at 9/54B2A010 which contains SUI 43. It rolls forward to the end of the log, writes another checkpoint record, increments its SUI to 44, and is open for business. However, this checkpoint record crosses a WAL page boundary, and the xlp_sui recorded in the second page's header will be 43. Also, we have failed to advance StartUpID --- it is now the same as in the previous cycle of life, which I do not think is the intended behavior. 4. Now we are screwed, because any attempt to re-read this checkpoint record will detect that the first half is on an XLOG page with xlp_sui 44 while the second half is on a page with xlp_sui 43. xlog.c will quite properly suspect that the second page is stale, reject the checkpoint, and fall back to the prior checkpoint --- as indeed we see it doing in the crash at 18:04:37. The cycle will then repeat *exactly*, right down to the recreation of the broken checkpoint record. Subsequent entries in the log indeed show several cycles of reversion to the checkpoint at 9/54B2A010. The only way to get out of the loop is for the system to stay up long enough to perform a normal checkpoint that will write a new not-broken primary checkpoint record. But it gets worse :-(. What actually happened, a few crashes later, was that the system stayed up long enough to get a pretty fair amount of work done. Not long enough for a checkpoint, but long enough to write out quite a few data pages that have WAL LSNs beyond the broken checkpoint. After the next crash and restart, we start getting XLogFlush complaints about those pages: 2003-05-19 18:06:01 PANIC: open of /usr/local/pgsql/data/pg_clog/06E7 failed: No such file or directory 2003-05-19 18:06:01 LOG: server process (pid 72485) was terminated by signal 6 2003-05-19 18:06:01 LOG: terminating any other active server processes 2003-05-19 18:06:01 LOG: all server processes terminated; reinitializing shared memory and semaphores 2003-05-19 18:06:01 LOG: database system was interrupted at 2003-05-19 18:05:41 EDT 2003-05-19 18:06:01 LOG: ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344 2003-05-19 18:06:01 LOG: invalid primary checkpoint record 2003-05-19 18:06:01 LOG: using previous checkpoint record at 9/54B2A010 2003-05-19 18:06:01 LOG: redo record is at 9/54B2A010; undo record is at 0/0; shutdown TRUE 2003-05-19 18:06:01 LOG: next transaction id: 2762386; next oid: 4300770 2003-05-19 18:06:01 LOG: database system was not properly shut down; automatic recovery in progress 2003-05-19 18:06:01 LOG: redo starts at 9/54B2A050 2003-05-19 18:06:01 LOG: ReadRecord: out-of-sequence SUI 43 (after 44) in log file 9, segment 84, offset 13369344 2003-05-19 18:06:01 LOG: redo done at 9/54CBFFA0 2003-05-19 18:06:03 LOG: database system is ready 2003-05-19 18:06:07 ERROR: XLogFlush: request 9/552E847C is not satisfied --- flushed only to 9/54D18E18 This is disastrous because it means we have abandoned a bunch of WAL entries that correspond to data page updates that are already down to disk. We now have no guarantee of data consistency, since we can't be sure that all actions described by the lost WAL entries are down to disk. To prevent this problem in future we need to take steps to ensure that ThisStartUpID can never go backwards. I think that it is correct for WAL replay to be executed with the same startupID that was current when the WAL entries were being made (Vadim, can you confirm that that is the required behavior?). I propose the following changes: 1. It is still correct to adopt ThisStartUpID from the checkpoint we are restarting from as we begin to roll forward. However, redo for a checkpoint record must adopt the ThisStartUpID seen in the checkpoint; else we fail to handle the case where we are rolling forward over a crash-restart checkpoint with a new SUI. 2. ThisStartUpID must be advanced to its new value *before* we write the end-of-recovery checkpoint, rather than after as it is now. This ensures that the new checkpoint will have the current startup ID (preventing the reversion-to-a-previously-used-SUI behavior seen above) and that if it chances to cross an XLOG page boundary we won't have an out-of-sequence SUI in the continuation page header. 3. When we select a new StartUpID after finishing rollforward, I think we ought to use one plus the value shown in pg_control's copy of the latest checkpoint, rather than relying on what we extracted from the last checkpoint we were able to read from XLOG. This should prevent any possibility of StartUpID going backwards or not incrementing because we lost a checkpoint record in XLOG. 4. The bit in StartupXLOG where it initializes a new XLOG page header with if (InRecovery) Insert->currpage->xlp_sui = ThisStartUpID; else Insert->currpage->xlp_sui= ThisStartUpID + 1; now scares the heck out of me, not least because it is using a startup id adopted from a checkpoint that might not be the last one. What I would suggest we do is simply dike out that whole chunk of code (lines 2676-2706 in CVS tip) and not try to be smart about almost-at-end-of-page cases. It would be better to handle the advance-to-new-page case in the normal flow of control when we are actually ready to write the first new WAL entry; by that point we should know the correct SUI to put into the page header. I'm not totally sure about this though. Is there some reason why StartupXLOG needs to handle end-of-page as a special case? I would really like to fix this for 7.3.3, so please comment ASAP. regards, tom lane
pgsql-hackers by date: