Thread: Shouldn't flush dirty buffers at shutdown ?
Hi all, While testing vacuum,I noticed that dirty buffers aren't necessarily written to disk in case of abort. In addtion they have no guarantee to be flushed before the shutdown of postmaster. The recent bufmgr changes seems to have increased the possibility much more than before. Certainly it's not bad unless there are indexes. However if heap data wasn't flushed while corresponding indices are written to disk,the indices would point to non-existence heap block. It would be the cause of inconsistency after the restart of postmaster. Shouldn't there be a mechanism to flush dirty buffers at(or before) the shutdown of postmaster ? Comments ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > if heap data wasn't flushed while corresponding indices > are written to disk,the indices would point to non-existence > heap block. It would be the cause of inconsistency after the > restart of postmaster. Shouldn't there be a mechanism to > flush dirty buffers at(or before) the shutdown of postmaster ? Hmm, good point, but that doesn't seem like the right answer. Suppose the system crashes before we are able to flush the dirty buffers? I think you have identified a problem that needs a more general solution: we need to be robust in the case that an index entry is on disk that points to a tuple that never made it to disk. We can legitimately assume that the tuple is uncommitted and ignore the index entry --- we just have to not fail ;-) Not sure how to do it offhand; maybe some additional checking in the index fetch code will do? regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Tuesday, May 09, 2000 11:50 PM > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > if heap data wasn't flushed while corresponding indices > > are written to disk,the indices would point to non-existence > > heap block. It would be the cause of inconsistency after the > > restart of postmaster. Shouldn't there be a mechanism to > > flush dirty buffers at(or before) the shutdown of postmaster ? > > Hmm, good point, but that doesn't seem like the right answer. > Suppose the system crashes before we are able to flush the > dirty buffers? You are right but we could hardly expect the completeness of indexes in case of system crash because indexes are out of transactional control currently. What surprized me was that such problems could easily occur even in case of graceful shutdown of postmaster. > I think you have identified a problem that needs > a more general solution: we need to be robust in the case that > an index entry is on disk that points to a tuple that never made > it to disk. Probably an index entry that points to a non-existent heap block doesn't cause a problem immediately because heap_ fetch() ignores UNUSED heap blocks. However it will not be long before the heap block is filled with another tuple. heap_insert/update() could hardly check that the inserting heap block is already pointed from some index entry. There could be another case which I came across while testing vacuum abort. i.e,index_delete() wasn't flushed while the corresponding heap block is cleaned(set to UNUSED) and flushed. Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> Hmm, good point, but that doesn't seem like the right answer. >> Suppose the system crashes before we are able to flush the >> dirty buffers? > You are right but we could hardly expect the completeness > of indexes in case of system crash because indexes are out > of transactional control currently. What? We do flush the index blocks before committing, just the same as heap blocks, so I do not see the risk there. If we have committed a tuple then its index entries should be on disk too. The risk I think you have shown is that there may be extra index entries that correspond to no tuple because we wrote the index blocks first and then crashed before writing the tuple. > Probably an index entry that points to a non-existent heap > block doesn't cause a problem immediately because heap_ > fetch() ignores UNUSED heap blocks. > However it will not be long before the heap block is filled with > another tuple. heap_insert/update() could hardly check that > the inserting heap block is already pointed from some index > entry. Yes. We need some way of cross-checking that an index entry actually does match the tuple it thinks it's pointing at. The first thought that comes to mind is that the index tuple should contain the OID of the tuple it is for, and then we can check that against the pointed-to tuple. If they don't match, the index entry is bogus and can be discarded at the next VACUUM. regards, tom lane
> > I think you have identified a problem that needs > > a more general solution: we need to be robust in the case that > > an index entry is on disk that points to a tuple that never made > > it to disk. And this general solution is WAL. Vadim
> -----Original Message----- > From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM] > > > > I think you have identified a problem that needs > > > a more general solution: we need to be robust in the case that > > > an index entry is on disk that points to a tuple that never made > > > it to disk. > > And this general solution is WAL. > Yes exactly. But I've thought it's mainly for aborts in the middle of btree page splitting or for system crash in which we couldn't expect synchronous flushing of dirty buffers. Now I feel I couldn't stop postmaster easily. For example I have to read a sufficiently large table to flush dirty buffers before shutdown of postmaster in my test case. Must I recommend everyone to do so ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
> > > > I think you have identified a problem that needs > > > > a more general solution: we need to be robust in the case that > > > > an index entry is on disk that points to a tuple that never made > > > > it to disk. > > > > And this general solution is WAL. > > > Yes exactly. > But I've thought it's mainly for aborts in the middle of btree page > splitting or for system crash in which we couldn't expect synchronous > flushing of dirty buffers. Central idea of WAL - write (and flush) to log all changes made in data buffers _before_ data files will be changed. Buffer mgmr will be responsible for this. Changes made in table buffers will be logged before changes made in index ones, redo will insert un-inserted table rows and index rows will not point to unexistent tuples in table. Undo will erase all uncommitted changes (but will not shrink tables/indices). Vadim
> -----Original Message----- > From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM] > > > > > > I think you have identified a problem that needs > > > > > a more general solution: we need to be robust in the case that > > > > > an index entry is on disk that points to a tuple that never made > > > > > it to disk. > > > > > > And this general solution is WAL. > > > > > Yes exactly. > > But I've thought it's mainly for aborts in the middle of btree page > > splitting or for system crash in which we couldn't expect synchronous > > flushing of dirty buffers. > > Central idea of WAL - write (and flush) to log all changes made in data > buffers _before_ data files will be changed. Buffer mgmr will be > responsible for this. Changes made in table buffers will be logged before > changes made in index ones, redo will insert un-inserted table rows and > index rows will not point to unexistent tuples in table. Undo will erase > all uncommitted changes (but will not shrink tables/indices). > Yes WAL would naturally solve a current flaw of index updation as a part of its effect. What I've never understood until recently is that even normal aborts(not in the middle of b-tree splitting) and normal shutdown could cause an inconsistency between heap and indices. Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > What I've never understood until recently is that even normal aborts(not > in the middle of b-tree splitting) and normal shutdown could cause an > inconsistency between heap and indices. Yes. Since WAL will provide the real solution in 7.1, I think we need only look for a simple stopgap answer for 7.0.x. Perhaps we could just tweak bufmgr.c so that dirty buffers are flushed out on both transaction commit and abort. That doesn't solve the consistency-after-crash issue, but at least you can do an orderly shutdown of a postmaster without fear. Is it worth trying to do more now, rather than working on WAL? regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > What I've never understood until recently is that even normal aborts(not > > in the middle of b-tree splitting) and normal shutdown could cause an > > inconsistency between heap and indices. > > Yes. Since WAL will provide the real solution in 7.1, I think we need > only look for a simple stopgap answer for 7.0.x. Perhaps we could just > tweak bufmgr.c so that dirty buffers are flushed out on both transaction > commit and abort. That doesn't solve the consistency-after-crash issue, > but at least you can do an orderly shutdown of a postmaster without > fear. Is it worth trying to do more now, rather than working on WAL? > Hmm,performance vs. consistency. I vote for consistency this time. However other people may prefer performance because the consistency isn't complete in any case and 7.1 would provide a real solution. Regards. Hiroshi Inoue Inoue@tpf.co.jp
smgrwrite() without LockBuffer(was RE: Shouldn't flush dirty buffers at shutdown ?)
From
"Hiroshi Inoue"
Date:
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > What I've never understood until recently is that even normal aborts(not > > in the middle of b-tree splitting) and normal shutdown could cause an > > inconsistency between heap and indices. > > Yes. Since WAL will provide the real solution in 7.1, I think we need > only look for a simple stopgap answer for 7.0.x. Perhaps we could just > tweak bufmgr.c so that dirty buffers are flushed out on both transaction > commit and abort. That doesn't solve the consistency-after-crash issue, > but at least you can do an orderly shutdown of a postmaster without > fear. Is it worth trying to do more now, rather than working on WAL? > I have another anxiety now. As far as I see,PostgreSQL doesn't call LockBuffer() before calling smgrwrite(). This seems to mean that smgrwrite() could write buffers to disk which are being changed by another backend. If the(another) backend was aborted by some reason the buffer page would remain half-changed. Is it well known ? Please correct me if I'm wrong. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Re: smgrwrite() without LockBuffer(was RE: Shouldn't flush dirty buffers at shutdown ?)
From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > As far as I see,PostgreSQL doesn't call LockBuffer() before > calling smgrwrite(). This seems to mean that smgrwrite() > could write buffers to disk which are being changed by > another backend. If the(another) backend was aborted by > some reason the buffer page would remain half-changed. Hmm ... looks fishy to me too. Seems like we ought to hold BUFFER_LOCK_SHARE on the buffer while dumping it out. It wouldn't matter under normal circumstances, but as you say there could be trouble if the other backend crashed before it could mark the buffer dirty again, or if we had a system crash before the dirtied page got written again. Vadim, what do you think? regards, tom lane