Thread: Probable bug with CreateFakeRelcacheEntry
I haven't produced a real problem in a small test case (yet), but I convinced myself that it's wrong enough to be called a bug. If you add the following assertion to BufferAlloc: Assert(relpersistence == RELPERSISTENCE_PERMANENT || relpersistence == RELPERSISTENCE_UNLOGGED || relpersistence == RELPERSISTENCE_TEMP); (which seems like a reasonable assertion to me), then do: create table foo(i int); insert into foo values(1); vacuum foo; vacuum foo; insert into foo values(2); Then do an immediate shutdown, then restart, you hit the assertion. The problem is using CreateFakeRelcacheEntry, which has the following comment: Only the fields related to physical storage, like rd_rel, are initialized, so the fake entry is only usable in low-level operations like ReadBuffer(). That doesn't seem right, because ReadBuffer needs relpersistence set. One of the consequences is that you can get buffers with the wrong flags set; in particular, missing BM_PERMANENT, which seems like it could be a serious problem. Are there other areas where we might have similar problems? Regards, Jeff Davis
Re: Probable bug with CreateFakeRelcacheEntry (now with reproducible test case)
From
Jeff Davis
Date:
Indeed, this is a nasty bug that leads to data corruption. The following sequence results in corruption of the visibility map, but I believe it can be shown to cause problems for a btree or GIN index as well. So it's recoverable if you do a VACUUM or a reindex. drop table foo; create table foo(i int); create index foo_idx on foo(i); insert into foo values(1); vacuum foo; checkpoint; insert into foo values(2); -- Send a SIGQUIT to the postmaster, and restart it. -- Now the buffer for the VM is in shared memory without BM_PERMANENT. -- VM block comes from backup block which still has VM bit set, but wal -- record for insert (2) unsets it in memory. delete from foo where i = 2; -- This checkpoint will *not* write out the buffer because it's not -- BM_PERMANENT. Therefore, it remains the same on disk, with the VM -- bit set. checkpoint; -- Send a SIGQUIT to the postmaster and restart. WAL records prior to -- that last checkpoint aren't replayed, so VM bit is still set. set enable_seqscan=false; select * from foo where i = 2; vacuum foo; WARNING: page is not marked all-visible but visibility map bit is set in relation "foo" page 0 VACUUM select * from foo; i --- 1 (1 row) This bug seems particularly troublesome because the right fix would be to include the relpersistence in the WAL records that need it. But that can't be backported (right?). The problem might not be a live problem before 7e4911b2ae33acff7b85234b91372133ec6df9d4, because that bug was hiding this one. Regards, Jeff Davis
Re: Re: Probable bug with CreateFakeRelcacheEntry (now with reproducible test case)
From
Robert Haas
Date:
On Wed, Sep 12, 2012 at 7:19 PM, Jeff Davis <pgsql@j-davis.com> wrote: > This bug seems particularly troublesome because the right fix would be > to include the relpersistence in the WAL records that need it. But that > can't be backported (right?). No, because if a WAL record was written at all, then by definition the table is RELPERSISTENCE_PERMANENT. So there's probably a localized fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: Probable bug with CreateFakeRelcacheEntry (now with reproducible test case)
From
Jeff Davis
Date:
On Thu, 2012-09-13 at 12:39 -0400, Robert Haas wrote: > On Wed, Sep 12, 2012 at 7:19 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > This bug seems particularly troublesome because the right fix would be > > to include the relpersistence in the WAL records that need it. But that > > can't be backported (right?). > > No, because if a WAL record was written at all, then by definition the > table is RELPERSISTENCE_PERMANENT. So there's probably a localized > fix. Oh, of course (I was worried there for some reason). I suppose we just need to set the relpersistence to permanent in CreateFakeRelcacheEntry, kind of like ReadBufferWithoutRelcache. And we should probably assert that both functions are only called during recovery (though perhaps redundant for CreateFakeRelcacheEntry, which is in xlogutils.c). Trivial patch attached. Regards, Jeff Davis
Attachment
Re: Re: Probable bug with CreateFakeRelcacheEntry (now with reproducible test case)
From
Robert Haas
Date:
On Thu, Sep 13, 2012 at 1:45 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2012-09-13 at 12:39 -0400, Robert Haas wrote: >> On Wed, Sep 12, 2012 at 7:19 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> > This bug seems particularly troublesome because the right fix would be >> > to include the relpersistence in the WAL records that need it. But that >> > can't be backported (right?). >> >> No, because if a WAL record was written at all, then by definition the >> table is RELPERSISTENCE_PERMANENT. So there's probably a localized >> fix. > > Oh, of course (I was worried there for some reason). I suppose we just > need to set the relpersistence to permanent in CreateFakeRelcacheEntry, > kind of like ReadBufferWithoutRelcache. > > And we should probably assert that both functions are only called during > recovery (though perhaps redundant for CreateFakeRelcacheEntry, which is > in xlogutils.c). > > Trivial patch attached. Committed and back-patched to 9.1. Thanks for the report, diagnosis, and fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company