Thread: Probable bug with CreateFakeRelcacheEntry

Probable bug with CreateFakeRelcacheEntry

From
Jeff Davis
Date:
I haven't produced a real problem in a small test case (yet), but I
convinced myself that it's wrong enough to be called a bug.

If you add the following assertion to BufferAlloc:

  Assert(relpersistence == RELPERSISTENCE_PERMANENT ||
         relpersistence == RELPERSISTENCE_UNLOGGED ||
         relpersistence == RELPERSISTENCE_TEMP);

(which seems like a reasonable assertion to me), then do:

  create table foo(i int);
  insert into foo values(1);
  vacuum foo;
  vacuum foo;
  insert into foo values(2);

Then do an immediate shutdown, then restart, you hit the assertion.

The problem is using CreateFakeRelcacheEntry, which has the following
comment:

  Only the fields related to physical storage, like rd_rel, are
  initialized, so the fake entry is only usable in low-level
  operations like ReadBuffer().

That doesn't seem right, because ReadBuffer needs relpersistence set.

One of the consequences is that you can get buffers with the wrong flags
set; in particular, missing BM_PERMANENT, which seems like it could be a
serious problem.

Are there other areas where we might have similar problems?

Regards,
    Jeff Davis

Re: Probable bug with CreateFakeRelcacheEntry (now with reproducible test case)

From
Jeff Davis
Date:
Indeed, this is a nasty bug that leads to data corruption. The following
sequence results in corruption of the visibility map, but I believe it
can be shown to cause problems for a btree or GIN index as well. So it's
recoverable if you do a VACUUM or a reindex.

drop table foo;
create table foo(i int);
create index foo_idx on foo(i);
insert into foo values(1);
vacuum foo;
checkpoint;
insert into foo values(2);

-- Send a SIGQUIT to the postmaster, and restart it.
-- Now the buffer for the VM is in shared memory without BM_PERMANENT.
-- VM block comes from backup block which still has VM bit set, but wal
-- record for insert (2) unsets it in memory.

delete from foo where i = 2;
-- This checkpoint will *not* write out the buffer because it's not
-- BM_PERMANENT. Therefore, it remains the same on disk, with the VM
-- bit set.
checkpoint;

-- Send a SIGQUIT to the postmaster and restart. WAL records prior to
-- that last checkpoint aren't replayed, so VM bit is still set.

set enable_seqscan=false;
select * from foo where i = 2;
vacuum foo;
  WARNING:  page is not marked all-visible but visibility map bit is
  set in relation "foo" page 0
  VACUUM
select * from foo;
   i
  ---
   1
  (1 row)

This bug seems particularly troublesome because the right fix would be
to include the relpersistence in the WAL records that need it. But that
can't be backported (right?).

The problem might not be a live problem before
7e4911b2ae33acff7b85234b91372133ec6df9d4, because that bug was hiding
this one.

Regards,
    Jeff Davis
On Wed, Sep 12, 2012 at 7:19 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> This bug seems particularly troublesome because the right fix would be
> to include the relpersistence in the WAL records that need it. But that
> can't be backported (right?).

No, because if a WAL record was written at all, then by definition the
table is RELPERSISTENCE_PERMANENT.  So there's probably a localized
fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, 2012-09-13 at 12:39 -0400, Robert Haas wrote:
> On Wed, Sep 12, 2012 at 7:19 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > This bug seems particularly troublesome because the right fix would be
> > to include the relpersistence in the WAL records that need it. But that
> > can't be backported (right?).
>
> No, because if a WAL record was written at all, then by definition the
> table is RELPERSISTENCE_PERMANENT.  So there's probably a localized
> fix.

Oh, of course (I was worried there for some reason). I suppose we just
need to set the relpersistence to permanent in CreateFakeRelcacheEntry,
kind of like ReadBufferWithoutRelcache.

And we should probably assert that both functions are only called during
recovery (though perhaps redundant for CreateFakeRelcacheEntry, which is
in xlogutils.c).

Trivial patch attached.

Regards,
    Jeff Davis

Attachment
On Thu, Sep 13, 2012 at 1:45 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2012-09-13 at 12:39 -0400, Robert Haas wrote:
>> On Wed, Sep 12, 2012 at 7:19 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > This bug seems particularly troublesome because the right fix would be
>> > to include the relpersistence in the WAL records that need it. But that
>> > can't be backported (right?).
>>
>> No, because if a WAL record was written at all, then by definition the
>> table is RELPERSISTENCE_PERMANENT.  So there's probably a localized
>> fix.
>
> Oh, of course (I was worried there for some reason). I suppose we just
> need to set the relpersistence to permanent in CreateFakeRelcacheEntry,
> kind of like ReadBufferWithoutRelcache.
>
> And we should probably assert that both functions are only called during
> recovery (though perhaps redundant for CreateFakeRelcacheEntry, which is
> in xlogutils.c).
>
> Trivial patch attached.

Committed and back-patched to 9.1.  Thanks for the report, diagnosis, and fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company