Thread: WAL replay of truncate fails if the table was dropped
mdtruncate throws an error if the relation file doesn't exist. However, that's not an error condition if the relation was dropped later. Non-existent file should be treated the same as an already truncated file; we now end up with an unrecoverable database. This bug seems to be present from 8.0 onwards. Attached is a test case to reproduce it, along with a patch for CVS HEAD, and an adapted version of the patch for 8.0-8.2. Thanks to my colleague Dharmendra Goyal for finding this bug and constructing an initial test case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com DROP TABLE test; CREATE TABLE test(i char(1800)); BEGIN; INSERT INTO test VALUES('a'); ROLLBACK; -- Crash database. This leaves behind an unitialized page at the end of the -- relation, which vacuum will then truncate. \! DATA=`psql regression -t -c 'show data_directory'`&& echo $DATA; pg_ctl restart -w -m immediate -D $DATA \c regression; VACUUM test; DROP TABLE test; -- Crash database \! DATA=`psql regression -t -c 'show data_directory'`&& echo $DATA; pg_ctl restart -w -m immediate -D $DATA \c regression; Index: src/backend/storage/smgr/md.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v retrieving revision 1.128 diff -c -r1.128 md.c *** src/backend/storage/smgr/md.c 12 Apr 2007 17:10:55 -0000 1.128 --- src/backend/storage/smgr/md.c 20 Jul 2007 12:31:42 -0000 *************** *** 717,722 **** --- 717,734 ---- #endif /* + * Open the relation. We call mdopen before mdnblocks, because that will + * fail if the first segment doesn't exist. That can happen in recovery, + * if the relation was dropped after the truncate. + */ + v = mdopen(reln, InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL); + if (v == NULL) + { + Assert(InRecovery); + return; + } + + /* * NOTE: mdnblocks makes sure we have opened all active segments, so * that truncation loop will get them all! */ *************** *** 736,743 **** if (nblocks == curnblk) return; /* no work */ - v = mdopen(reln, EXTENSION_FAIL); - #ifndef LET_OS_MANAGE_FILESIZE priorblocks = 0; while (v != NULL) --- 748,753 ---- Index: src/backend/storage/smgr/md.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v retrieving revision 1.114.4.2 diff -c -r1.114.4.2 md.c *** src/backend/storage/smgr/md.c 26 Apr 2007 23:25:30 -0000 1.114.4.2 --- src/backend/storage/smgr/md.c 20 Jul 2007 12:37:45 -0000 *************** *** 601,606 **** --- 601,618 ---- #endif /* + * Open the relation. We call mdopen before mdnblocks, because that will + * fail if the first segment doesn't exist. That can happen in recovery, + * if the relation was dropped after the truncate. + */ + v = mdopen(reln, InRecovery); + if (v == NULL) + { + Assert(InRecovery); + return; + } + + /* * NOTE: mdnblocks makes sure we have opened all active segments, so * that truncation loop will get them all! */ *************** *** 612,619 **** if (nblocks == curnblk) return nblocks; /* no work */ - v = mdopen(reln, false); - #ifndef LET_OS_MANAGE_FILESIZE priorblocks = 0; while (v != NULL) --- 624,629 ----
Heikki Linnakangas <heikki@enterprisedb.com> writes: > mdtruncate throws an error if the relation file doesn't exist. Interesting corner case. The proposed fix seems not very consistent with the way we handle comparable cases elsewhere, though. In general, md.c will cut some slack when InRecovery if a relation is shorter than expected, but not if it's not there at all. (This is, indeed, what justifies mdtruncate's response to file-too-short...) We handle dropped files during recovery by forced smgrcreate() in places like XLogOpenRelation. I'm inclined to think smgr_redo should force smgrcreate() before trying to truncate. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> mdtruncate throws an error if the relation file doesn't exist. > > Interesting corner case. The proposed fix seems not very consistent > with the way we handle comparable cases elsewhere, though. In general, > md.c will cut some slack when InRecovery if a relation is shorter than > expected, but not if it's not there at all. (This is, indeed, what > justifies mdtruncate's response to file-too-short...) We handle > dropped files during recovery by forced smgrcreate() in places like > XLogOpenRelation. I'm inclined to think smgr_redo should force > smgrcreate() before trying to truncate. I followed the example of the file-too-short case. Yeah, calling smgrcreate would work and I can see the justification for that as well. Interestingly, this bug isn't triggered unless there's an already empty or uninitialized page at the end of table. If vacuum removes the last tuple from the page, that will be WAL-logged and replay of that calls smgrcreate. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Interestingly, this bug isn't triggered unless there's an already empty > or uninitialized page at the end of table. If vacuum removes the last > tuple from the page, that will be WAL-logged and replay of that calls > smgrcreate. Yeah, I tried other ways to provoke the failure and came to the same conclusion. The reproducer really is relying on the fact that vacuum's PageInit of an uninitialized page doesn't get WAL-logged. Which is a bit nervous-making. As far as I can think at the moment, it won't provoke any problem because the first subsequent WAL-logged touch of the page would be an INSERT with the INIT bit set; but it does mean that a warm-standby slave would be out of sync with the master for an indefinitely long period with respect to the on-disk contents of such a page. Does that matter? Note that we have to fix truncate replay anyway, since you could have the same failure if a checkpoint happened just before an ordinary vacuum's truncate. This PageInit behavior merely allows a simpler reproducer script with no race condition involved. regards, tom lane
On Fri, 2007-07-20 at 11:38 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > Interestingly, this bug isn't triggered unless there's an already empty > > or uninitialized page at the end of table. If vacuum removes the last > > tuple from the page, that will be WAL-logged and replay of that calls > > smgrcreate. > > Yeah, I tried other ways to provoke the failure and came to the same > conclusion. The reproducer really is relying on the fact that vacuum's > PageInit of an uninitialized page doesn't get WAL-logged. Which is a > bit nervous-making. As far as I can think at the moment, it won't > provoke any problem because the first subsequent WAL-logged touch of > the page would be an INSERT with the INIT bit set; but it does mean > that a warm-standby slave would be out of sync with the master for an > indefinitely long period with respect to the on-disk contents of such a > page. Does that matter? If I understand this: the primary would be initialised yet the standby would remain uninitialised? I don't think that matters because the actual the data contents are still zero. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > If I understand this: the primary would be initialised yet the standby > would remain uninitialised? I don't think that matters because the > actual the data contents are still zero. From a logical perspective the page is "empty" either way. The only behavioral difference I can think of is that the initialized page is a candidate for insertion of new tuples, whereas on the slave it would not be a candidate until after another VACUUM. So the histories would diverge faster once the slave comes alive. As long as the slave is just following WAL records and not making any decisions of its own, I can't see a failure mode; but it looks like a potential weak spot for future extensions (particularly, trying to allow slave servers to execute queries). regards, tom lane