Thread: WAL replay of truncate fails if the table was dropped

WAL replay of truncate fails if the table was dropped

From
Heikki Linnakangas
Date:
mdtruncate throws an error if the relation file doesn't exist. However,
that's not an error condition if the relation was dropped later.
Non-existent file should be treated the same as an already truncated
file; we now end up with an unrecoverable database.

This bug seems to be present from 8.0 onwards.

Attached is a test case to reproduce it, along with a patch for CVS
HEAD, and an adapted version of the patch for 8.0-8.2.

Thanks to my colleague Dharmendra Goyal for finding this bug and
constructing an initial test case.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
DROP TABLE test;
CREATE TABLE test(i char(1800));

BEGIN;
INSERT INTO test VALUES('a');
ROLLBACK;
-- Crash database. This leaves behind an unitialized page at the end of the
-- relation, which vacuum will then truncate.
\! DATA=`psql regression -t -c 'show data_directory'`&& echo $DATA; pg_ctl restart  -w -m immediate  -D $DATA
\c regression;

VACUUM test;

DROP TABLE test;
-- Crash database
\! DATA=`psql regression -t -c 'show data_directory'`&& echo $DATA; pg_ctl restart  -w -m immediate  -D $DATA
\c regression;


Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.128
diff -c -r1.128 md.c
*** src/backend/storage/smgr/md.c    12 Apr 2007 17:10:55 -0000    1.128
--- src/backend/storage/smgr/md.c    20 Jul 2007 12:31:42 -0000
***************
*** 717,722 ****
--- 717,734 ----
  #endif

      /*
+      * Open the relation. We call mdopen before mdnblocks, because that will
+      * fail if the first segment doesn't exist. That can happen in recovery,
+      * if the relation was dropped after the truncate.
+      */
+     v = mdopen(reln, InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+     if (v == NULL)
+     {
+         Assert(InRecovery);
+         return;
+     }
+
+     /*
       * NOTE: mdnblocks makes sure we have opened all active segments, so
       * that truncation loop will get them all!
       */
***************
*** 736,743 ****
      if (nblocks == curnblk)
          return;                    /* no work */

-     v = mdopen(reln, EXTENSION_FAIL);
-
  #ifndef LET_OS_MANAGE_FILESIZE
      priorblocks = 0;
      while (v != NULL)
--- 748,753 ----
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.114.4.2
diff -c -r1.114.4.2 md.c
*** src/backend/storage/smgr/md.c    26 Apr 2007 23:25:30 -0000    1.114.4.2
--- src/backend/storage/smgr/md.c    20 Jul 2007 12:37:45 -0000
***************
*** 601,606 ****
--- 601,618 ----
  #endif

      /*
+      * Open the relation. We call mdopen before mdnblocks, because that will
+      * fail if the first segment doesn't exist. That can happen in recovery,
+      * if the relation was dropped after the truncate.
+      */
+     v = mdopen(reln, InRecovery);
+     if (v == NULL)
+     {
+         Assert(InRecovery);
+         return;
+     }
+
+     /*
       * NOTE: mdnblocks makes sure we have opened all active segments, so
       * that truncation loop will get them all!
       */
***************
*** 612,619 ****
      if (nblocks == curnblk)
          return nblocks;            /* no work */

-     v = mdopen(reln, false);
-
  #ifndef LET_OS_MANAGE_FILESIZE
      priorblocks = 0;
      while (v != NULL)
--- 624,629 ----

Re: WAL replay of truncate fails if the table was dropped

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> mdtruncate throws an error if the relation file doesn't exist.

Interesting corner case.  The proposed fix seems not very consistent
with the way we handle comparable cases elsewhere, though.  In general,
md.c will cut some slack when InRecovery if a relation is shorter than
expected, but not if it's not there at all.  (This is, indeed, what
justifies mdtruncate's response to file-too-short...)  We handle
dropped files during recovery by forced smgrcreate() in places like
XLogOpenRelation.  I'm inclined to think smgr_redo should force
smgrcreate() before trying to truncate.

            regards, tom lane

Re: WAL replay of truncate fails if the table was dropped

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> mdtruncate throws an error if the relation file doesn't exist.
>
> Interesting corner case.  The proposed fix seems not very consistent
> with the way we handle comparable cases elsewhere, though.  In general,
> md.c will cut some slack when InRecovery if a relation is shorter than
> expected, but not if it's not there at all.  (This is, indeed, what
> justifies mdtruncate's response to file-too-short...)  We handle
> dropped files during recovery by forced smgrcreate() in places like
> XLogOpenRelation.  I'm inclined to think smgr_redo should force
> smgrcreate() before trying to truncate.

I followed the example of the file-too-short case. Yeah, calling
smgrcreate would work and I can see the justification for that as well.

Interestingly, this bug isn't triggered unless there's an already empty
or uninitialized page at the end of table. If vacuum removes the last
tuple from the page, that will be WAL-logged and replay of that calls
smgrcreate.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: WAL replay of truncate fails if the table was dropped

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Interestingly, this bug isn't triggered unless there's an already empty
> or uninitialized page at the end of table. If vacuum removes the last
> tuple from the page, that will be WAL-logged and replay of that calls
> smgrcreate.

Yeah, I tried other ways to provoke the failure and came to the same
conclusion.  The reproducer really is relying on the fact that vacuum's
PageInit of an uninitialized page doesn't get WAL-logged.  Which is a
bit nervous-making.  As far as I can think at the moment, it won't
provoke any problem because the first subsequent WAL-logged touch of
the page would be an INSERT with the INIT bit set; but it does mean
that a warm-standby slave would be out of sync with the master for an
indefinitely long period with respect to the on-disk contents of such a
page.  Does that matter?

Note that we have to fix truncate replay anyway, since you could have
the same failure if a checkpoint happened just before an ordinary
vacuum's truncate.  This PageInit behavior merely allows a simpler
reproducer script with no race condition involved.

            regards, tom lane

Re: WAL replay of truncate fails if the table was dropped

From
"Simon Riggs"
Date:
On Fri, 2007-07-20 at 11:38 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > Interestingly, this bug isn't triggered unless there's an already empty
> > or uninitialized page at the end of table. If vacuum removes the last
> > tuple from the page, that will be WAL-logged and replay of that calls
> > smgrcreate.
>
> Yeah, I tried other ways to provoke the failure and came to the same
> conclusion.  The reproducer really is relying on the fact that vacuum's
> PageInit of an uninitialized page doesn't get WAL-logged.  Which is a
> bit nervous-making.  As far as I can think at the moment, it won't
> provoke any problem because the first subsequent WAL-logged touch of
> the page would be an INSERT with the INIT bit set; but it does mean
> that a warm-standby slave would be out of sync with the master for an
> indefinitely long period with respect to the on-disk contents of such a
> page.  Does that matter?

If I understand this: the primary would be initialised yet the standby
would remain uninitialised? I don't think that matters because the
actual the data contents are still zero.

--
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com

Re: WAL replay of truncate fails if the table was dropped

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> If I understand this: the primary would be initialised yet the standby
> would remain uninitialised? I don't think that matters because the
> actual the data contents are still zero.

From a logical perspective the page is "empty" either way.  The only
behavioral difference I can think of is that the initialized page is a
candidate for insertion of new tuples, whereas on the slave it would not
be a candidate until after another VACUUM.  So the histories would
diverge faster once the slave comes alive.  As long as the slave is just
following WAL records and not making any decisions of its own, I can't
see a failure mode; but it looks like a potential weak spot for future
extensions (particularly, trying to allow slave servers to execute
queries).

            regards, tom lane