"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> This is pretty much what heapam and btree currently do, but on looking
> at it I think it's got a problem: we really ought to mark the buffer
> dirty before releasing the critical section. Otherwise, if there's an
> elog(ERROR) before the WriteBuffer call is reached, the backend would go
> on about its business, and we'd have changes in a disk buffer that isn't
> marked dirty. The changes would be uncommitted, presumably, because of
> the error --- but nonetheless this could result in inconsistency down
> the road. One example scenario is:
> 1. We insert a tuple at, say, index 3 on a page.
> 2. elog after making the XLOG entry, but before WriteBuffer.
> 3. page is later discarded from shared buffers; since it's not
> marked dirty, it'll just be dropped without writing it.
> 4. Later we need to insert another tuple in same table, and
> we again choose index 3 on this page as the place to put it.
> 5. system crash leads to replay from WAL.
> Now we'll have two different WAL records trying to insert tuple 3.
> Not good.
>
It may be not good but not harmful either. On step2, the transaction will
abort and leave a page that has been changed but not marked dirty. There are
two situtations could happen after that. One is step 3, the other is the
page is still in the buffer pool and another transaction will write on it
(no problem, the tuple slot is already marked used). For step 3, yes, we
will see two WAL records trying to insert to the same tuple slot, but the
2nd one will cover the 1st one -- no problem. If the 2nd one will not cover
the 1st one (say that WAL record is broken), also no prolbem since the tuple
header will gaurantee that tuple is invisible. Can you give an example that
this will lead data corruption?
Regards,
Qingqing