Thread: Shouldn't flush dirty buffers at shutdown ?

Shouldn't flush dirty buffers at shutdown ?

From
"Hiroshi Inoue"
Date:
Hi all,

While testing vacuum,I noticed that dirty buffers aren't
necessarily written to disk in case of abort.
In addtion they have no guarantee to be flushed before
the shutdown of postmaster. The recent bufmgr changes
seems to have increased the possibility much more than
before.
Certainly it's not bad unless there are indexes. However
if heap data wasn't flushed while corresponding indices
are written to disk,the indices would point to non-existence
heap block. It would be the cause of inconsistency after the
restart of postmaster. Shouldn't there be a mechanism to
flush dirty buffers at(or before) the shutdown of postmaster ?

Comments ?

Regards. 

Hiroshi Inoue
Inoue@tpf.co.jp


Re: Shouldn't flush dirty buffers at shutdown ?

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> if heap data wasn't flushed while corresponding indices
> are written to disk,the indices would point to non-existence
> heap block. It would be the cause of inconsistency after the
> restart of postmaster. Shouldn't there be a mechanism to
> flush dirty buffers at(or before) the shutdown of postmaster ?

Hmm, good point, but that doesn't seem like the right answer.
Suppose the system crashes before we are able to flush the
dirty buffers?  I think you have identified a problem that needs
a more general solution: we need to be robust in the case that
an index entry is on disk that points to a tuple that never made
it to disk.  We can legitimately assume that the tuple is uncommitted
and ignore the index entry --- we just have to not fail ;-)

Not sure how to do it offhand; maybe some additional checking in the
index fetch code will do?
        regards, tom lane


RE: Shouldn't flush dirty buffers at shutdown ?

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Tuesday, May 09, 2000 11:50 PM
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > if heap data wasn't flushed while corresponding indices
> > are written to disk,the indices would point to non-existence
> > heap block. It would be the cause of inconsistency after the
> > restart of postmaster. Shouldn't there be a mechanism to
> > flush dirty buffers at(or before) the shutdown of postmaster ?
> 
> Hmm, good point, but that doesn't seem like the right answer.
> Suppose the system crashes before we are able to flush the
> dirty buffers?

You are right but we could hardly expect the completeness
of indexes in case of system crash because indexes are out
of transactional control currently. What surprized me was that
such problems could easily occur even in case of graceful
shutdown of postmaster.
> I think you have identified a problem that needs
> a more general solution: we need to be robust in the case that
> an index entry is on disk that points to a tuple that never made
> it to disk.

Probably an index entry that points to a non-existent heap
block doesn't cause a problem immediately because heap_
fetch() ignores UNUSED heap blocks.
However it will not be long before the heap block is filled with
another tuple. heap_insert/update() could hardly check that
the inserting heap block is already pointed from some index
entry.

There could be another case which I came across while
testing vacuum abort. i.e,index_delete() wasn't flushed while
the corresponding heap block is cleaned(set to UNUSED) 
and flushed. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


Re: Shouldn't flush dirty buffers at shutdown ?

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> Hmm, good point, but that doesn't seem like the right answer.
>> Suppose the system crashes before we are able to flush the
>> dirty buffers?

> You are right but we could hardly expect the completeness
> of indexes in case of system crash because indexes are out
> of transactional control currently.

What?  We do flush the index blocks before committing, just the
same as heap blocks, so I do not see the risk there.  If we have
committed a tuple then its index entries should be on disk too.
The risk I think you have shown is that there may be extra index
entries that correspond to no tuple because we wrote the index
blocks first and then crashed before writing the tuple.

> Probably an index entry that points to a non-existent heap
> block doesn't cause a problem immediately because heap_
> fetch() ignores UNUSED heap blocks.
> However it will not be long before the heap block is filled with
> another tuple. heap_insert/update() could hardly check that
> the inserting heap block is already pointed from some index
> entry.

Yes.  We need some way of cross-checking that an index entry actually
does match the tuple it thinks it's pointing at.  The first thought that
comes to mind is that the index tuple should contain the OID of the
tuple it is for, and then we can check that against the pointed-to
tuple.  If they don't match, the index entry is bogus and can be
discarded at the next VACUUM.
        regards, tom lane


RE: Shouldn't flush dirty buffers at shutdown ?

From
"Mikheev, Vadim"
Date:
> > I think you have identified a problem that needs
> > a more general solution: we need to be robust in the case that
> > an index entry is on disk that points to a tuple that never made
> > it to disk.

And this general solution is WAL.

Vadim


RE: Shouldn't flush dirty buffers at shutdown ?

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
> 
> > > I think you have identified a problem that needs
> > > a more general solution: we need to be robust in the case that
> > > an index entry is on disk that points to a tuple that never made
> > > it to disk.
> 
> And this general solution is WAL.
>

Yes exactly.
But I've thought it's mainly for aborts in the middle of btree page
splitting or for system crash in which we couldn't expect synchronous
flushing of dirty buffers.
Now I feel I couldn't stop postmaster easily.
For example I have to read a sufficiently large table to
flush dirty buffers before shutdown of postmaster in my
test case. Must I recommend everyone to do so ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


RE: Shouldn't flush dirty buffers at shutdown ?

From
"Mikheev, Vadim"
Date:
> > > > I think you have identified a problem that needs
> > > > a more general solution: we need to be robust in the case that
> > > > an index entry is on disk that points to a tuple that never made
> > > > it to disk.
> > 
> > And this general solution is WAL.
> >
> Yes exactly.
> But I've thought it's mainly for aborts in the middle of btree page
> splitting or for system crash in which we couldn't expect synchronous
> flushing of dirty buffers.

Central idea of WAL - write (and flush) to log all changes made in data
buffers _before_ data files will be changed. Buffer mgmr will be
responsible for this. Changes made in table buffers will be logged before
changes made in index ones, redo will insert un-inserted table rows and
index rows will not point to unexistent tuples in table. Undo will erase
all uncommitted changes (but will not shrink tables/indices).

Vadim



RE: Shouldn't flush dirty buffers at shutdown ?

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
> 
> > > > > I think you have identified a problem that needs
> > > > > a more general solution: we need to be robust in the case that
> > > > > an index entry is on disk that points to a tuple that never made
> > > > > it to disk.
> > > 
> > > And this general solution is WAL.
> > >
> > Yes exactly.
> > But I've thought it's mainly for aborts in the middle of btree page
> > splitting or for system crash in which we couldn't expect synchronous
> > flushing of dirty buffers.
> 
> Central idea of WAL - write (and flush) to log all changes made in data
> buffers _before_ data files will be changed. Buffer mgmr will be
> responsible for this. Changes made in table buffers will be logged before
> changes made in index ones, redo will insert un-inserted table rows and
> index rows will not point to unexistent tuples in table. Undo will erase
> all uncommitted changes (but will not shrink tables/indices).
>

Yes WAL would naturally solve a current flaw of index updation as a part
of its effect.
What I've never understood until recently is that even normal aborts(not
in the middle of b-tree splitting) and normal shutdown could cause an
inconsistency between heap and indices. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: Shouldn't flush dirty buffers at shutdown ?

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> What I've never understood until recently is that even normal aborts(not
> in the middle of b-tree splitting) and normal shutdown could cause an
> inconsistency between heap and indices. 

Yes.  Since WAL will provide the real solution in 7.1, I think we need
only look for a simple stopgap answer for 7.0.x.  Perhaps we could just
tweak bufmgr.c so that dirty buffers are flushed out on both transaction
commit and abort.  That doesn't solve the consistency-after-crash issue,
but at least you can do an orderly shutdown of a postmaster without
fear.  Is it worth trying to do more now, rather than working on WAL?
        regards, tom lane


RE: Shouldn't flush dirty buffers at shutdown ?

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > What I've never understood until recently is that even normal aborts(not
> > in the middle of b-tree splitting) and normal shutdown could cause an
> > inconsistency between heap and indices.
>
> Yes.  Since WAL will provide the real solution in 7.1, I think we need
> only look for a simple stopgap answer for 7.0.x.  Perhaps we could just
> tweak bufmgr.c so that dirty buffers are flushed out on both transaction
> commit and abort.  That doesn't solve the consistency-after-crash issue,
> but at least you can do an orderly shutdown of a postmaster without
> fear.  Is it worth trying to do more now, rather than working on WAL?
>

Hmm,performance vs. consistency.
I vote for consistency this time.
However other people may prefer performance because the
consistency isn't complete in any case and 7.1 would provide
a real solution.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > What I've never understood until recently is that even normal aborts(not
> > in the middle of b-tree splitting) and normal shutdown could cause an
> > inconsistency between heap and indices.
>
> Yes.  Since WAL will provide the real solution in 7.1, I think we need
> only look for a simple stopgap answer for 7.0.x.  Perhaps we could just
> tweak bufmgr.c so that dirty buffers are flushed out on both transaction
> commit and abort.  That doesn't solve the consistency-after-crash issue,
> but at least you can do an orderly shutdown of a postmaster without
> fear.  Is it worth trying to do more now, rather than working on WAL?
>

I have another anxiety now.
As far as I see,PostgreSQL doesn't call LockBuffer() before
calling smgrwrite(). This seems to mean that smgrwrite()
could write buffers to disk which are being changed by
another backend. If the(another) backend was aborted by
some reason the buffer page would remain half-changed.

Is it well known ?  Please correct me if I'm wrong.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> As far as I see,PostgreSQL doesn't call LockBuffer() before
> calling smgrwrite(). This seems to mean that smgrwrite()
> could write buffers to disk which are being changed by
> another backend. If the(another) backend was aborted by
> some reason the buffer page would remain half-changed.

Hmm ... looks fishy to me too.  Seems like we ought to hold
BUFFER_LOCK_SHARE on the buffer while dumping it out.  It
wouldn't matter under normal circumstances, but as you say
there could be trouble if the other backend crashed before
it could mark the buffer dirty again, or if we had a system
crash before the dirtied page got written again.

Vadim, what do you think?
        regards, tom lane