Thread: Concurrent-update problem in bufmgr.c
The comments for bufmgr.c's BufferSync routine point out that it's a bad thing for some other backend to be modifying a page while it is written out. The following text has gone unchanged since Postgres95: * Also, we need to be sure that no other transaction is* modifying the page as we flush it. This is only a problemfor objects* that use a non-two-phase locking protocol, like btree indices. For* those objects, we would liketo set a write lock for the duration of* our IO. Another possibility is to code updates to btree pages* carefully,so that writing them out out of order cannot cause* any unrecoverable errors.** I don't want to think hardabout this right now, so I will try* to come back to it later. Unfortunately, the comment is wrong about this being a problem only for indexes. It's possible for an invalid state of a heap page to be written out, as well. PageAddItem() sets the item pointer for an added tuple before it copies the tuple onto the page, so if it is recycling an existing item pointer slot, there is a state where a valid item pointer is pointing at a tuple that's wholly or partly not valid. This doesn't matter as far as active backends are concerned because we should be holding BUFFER_LOCK_EXCLUSIVE on the page while modifying it. But some other backend could be in process of writing out the page (if it had previously dirtied the page), and so it's possible for this invalid state to reach disk. If the database is shut down before the new update of the page can be written out, then we have a problem. Normally, the new page state will be written out at transaction commit, but what happens if the current transaction aborts? In that case, the dirty page just sits in shared memory. It will get written the next time a transaction modifies the page (and commits), or when some backend decides to recycle the buffer to hold another page. But if the postmaster gets shut down before that happens, we lose; the dirty page is never written at all, and when it's re-read after database restart, the corrupted page state becomes visible. The window of vulnerability is considerably wider in 7.0 than in prior releases, because in prior releases *any* transaction commit will write all dirty pages. In 7.0 the dirtied page will not get written out until we commit a transaction that modified that particular page (or decide to recycle the buffer). The odds of seeing a problem are still pretty small, but the risk is definitely there. I believe the correct fix for this problem is for bufmgr.c to grab a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out. A read lock is sufficient since there's no need to prevent other backends from reading the page, we just need to prevent them from changing it during the I/O. Comments anyone? regards, tom lane
Tom Lane wrote: [snip] > > The window of vulnerability is considerably wider in 7.0 than in prior > releases, because in prior releases *any* transaction commit will write > all dirty pages. In 7.0 the dirtied page will not get written out until > we commit a transaction that modified that particular page (or decide to > recycle the buffer). The odds of seeing a problem are still pretty > small, but the risk is definitely there. > > I believe the correct fix for this problem is for bufmgr.c to grab > a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out. > A read lock is sufficient since there's no need to prevent other > backends from reading the page, we just need to prevent them from > changing it during the I/O. > > Comments anyone? This seems to be almost same as I posted 4 months ago(smgrwrite() without LockBuffer(was RE: ...). Maybe Vadim would take care of it in the inplementation of WAL. The following was Vadim's reply to you and me. > > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > > As far as I see,PostgreSQL doesn't call LockBuffer() before > > > calling smgrwrite(). This seems to mean that smgrwrite() > > > could write buffers to disk which are being changed by > > > another backend. If the(another) backend was aborted by > > > some reason the buffer page would remain half-changed. > > > > Hmm ... looks fishy to me too. Seems like we ought to hold > > BUFFER_LOCK_SHARE on the buffer while dumping it out. It > > wouldn't matter under normal circumstances, but as you say > > there could be trouble if the other backend crashed before > > it could mark the buffer dirty again, or if we had a system > > crash before the dirtied page got written again. > > Well, known issue. Buffer latches were implemented in 6.5 only > and there was no time to think about subj hard -:) > Yes, we have to shlock buffer before writing and this is what > bufmgr will must do for WAL anyway (to ensure that last buffer > changes already logged)... but please check that buffer is not > exc-locked by backend itself somewhere before smgrwrite()... > > Vadim > > Regards. Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > This seems to be almost same as I posted 4 months ago(smgrwrite() > without LockBuffer(was RE: ...). You are right, this was already a known issue (and I had it buried in my to-do list, in fact). I rediscovered it while puzzling over some of the corrupted-data reports we've seen lately. I'll go ahead and make the change in current sources. Does anyone have a strong feeling about whether or not to back-patch it into REL7_0 branch for the upcoming 7.0.3 release? regards, tom lane
> I believe the correct fix for this problem is for bufmgr.c to grab > a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out. > A read lock is sufficient since there's no need to prevent other > backends from reading the page, we just need to prevent them from > changing it during the I/O. > > Comments anyone? Do it. Vadim