Concurrent-update problem in bufmgr.c - Mailing list pgsql-hackers

From Tom Lane
Subject Concurrent-update problem in bufmgr.c
Date
Msg-id 15427.969839290@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
The comments for bufmgr.c's BufferSync routine point out that it's a
bad thing for some other backend to be modifying a page while it is
written out.  The following text has gone unchanged since Postgres95:
*    Also, we need to be sure that no other transaction is*    modifying the page as we flush it.  This is only a
problemfor objects*    that use a non-two-phase locking protocol, like btree indices.  For*    those objects, we would
liketo set a write lock for the duration of*    our IO.  Another possibility is to code updates to btree pages*
carefully,so that writing them out out of order cannot cause*    any unrecoverable errors.**    I don't want to think
hardabout this right now, so I will try*    to come back to it later.
 

Unfortunately, the comment is wrong about this being a problem only for
indexes.  It's possible for an invalid state of a heap page to be
written out, as well.  PageAddItem() sets the item pointer for an added
tuple before it copies the tuple onto the page, so if it is recycling an
existing item pointer slot, there is a state where a valid item pointer
is pointing at a tuple that's wholly or partly not valid.  This doesn't
matter as far as active backends are concerned because we should be
holding BUFFER_LOCK_EXCLUSIVE on the page while modifying it.  But some
other backend could be in process of writing out the page (if it had
previously dirtied the page), and so it's possible for this invalid
state to reach disk.  If the database is shut down before the new
update of the page can be written out, then we have a problem.

Normally, the new page state will be written out at transaction commit,
but what happens if the current transaction aborts?  In that case, the
dirty page just sits in shared memory.  It will get written the next
time a transaction modifies the page (and commits), or when some backend
decides to recycle the buffer to hold another page.  But if the
postmaster gets shut down before that happens, we lose; the dirty page
is never written at all, and when it's re-read after database restart,
the corrupted page state becomes visible.

The window of vulnerability is considerably wider in 7.0 than in prior
releases, because in prior releases *any* transaction commit will write
all dirty pages.  In 7.0 the dirtied page will not get written out until
we commit a transaction that modified that particular page (or decide to
recycle the buffer).  The odds of seeing a problem are still pretty
small, but the risk is definitely there.

I believe the correct fix for this problem is for bufmgr.c to grab
a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
A read lock is sufficient since there's no need to prevent other
backends from reading the page, we just need to prevent them from
changing it during the I/O.

Comments anyone?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: RFC - change of behaviour of pg_get_userbyid & pg_get_viewdef?
Next
From: "Dominic J. Eidson"
Date:
Subject: Re: RFC - change of behaviour of pg_get_userbyid & pg_get_viewdef?