Thread: AW: WAL-based allocation of XIDs is insecure

AW: WAL-based allocation of XIDs is insecure

From
Zeugswetter Andreas SB
Date:
> 1. A new transaction inserts a tuple.  The tuple is entered into its
> heap file with the new transaction's XID, and an associated WAL log
> entry is made.  Neither one of these are on disk yet --- the heap tuple
> is in a shmem disk buffer, and the WAL entry is in the shmem 
> WAL buffer.
> 
> 2. Now do a lot of read-only operations, in the same or another backend.
> The WAL log stays where it is, but eventually the shmem disk buffer will
> get flushed to disk so that the buffer can be re-used for some other
> disk page.
> 
> 3. Assume we now crash.  Now, we have a heap tuple on disk with an XID
> that does not correspond to any XID visible in the on-disk WAL log.
> 
> 4. Upon restart, WAL will initialize the XID counter to the first XID
> not seen in the WAL log.  Guess which one that is.
> 
> 5. We will now run a new transaction with the same XID that was in use
> before the crash.  If that transaction commits, then we have a tuple on
> disk that will be considered valid --- and should not be.

I do not think this is true. Before any modification to a page the original page will be 
written to the log (aka physical log).
On startup rollforward this original page, that does not contain the inserted
tuple with the stale XID is rewritten over the modified page.

Andreas

PS: I thus object to your proposed XID allocation change


Re: AW: WAL-based allocation of XIDs is insecure

From
Hiroshi Inoue
Date:
Zeugswetter Andreas SB wrote:
> 
> > 1. A new transaction inserts a tuple.  The tuple is entered into its
> > heap file with the new transaction's XID, and an associated WAL log
> > entry is made.  Neither one of these are on disk yet --- the heap tuple
> > is in a shmem disk buffer, and the WAL entry is in the shmem
> > WAL buffer.
> >
> > 2. Now do a lot of read-only operations, in the same or another backend.
> > The WAL log stays where it is, but eventually the shmem disk buffer will
> > get flushed to disk so that the buffer can be re-used for some other
> > disk page.
> >
> > 3. Assume we now crash.  Now, we have a heap tuple on disk with an XID
> > that does not correspond to any XID visible in the on-disk WAL log.
> >
> > 4. Upon restart, WAL will initialize the XID counter to the first XID
> > not seen in the WAL log.  Guess which one that is.
> >
> > 5. We will now run a new transaction with the same XID that was in use
> > before the crash.  If that transaction commits, then we have a tuple on
> > disk that will be considered valid --- and should not be.
> 
> I do not think this is true. Before any modification to a page the original page will be
> written to the log (aka physical log).

Yes there must be XLogFlush() before writing buffers.
BTW how do we get the next XID if WAL files are corrupted ?

Regards,
Hiroshi Inoue


Re: AW: WAL-based allocation of XIDs is insecure

From
Tom Lane
Date:
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
>> 5. We will now run a new transaction with the same XID that was in use
>> before the crash.  If that transaction commits, then we have a tuple on
>> disk that will be considered valid --- and should not be.

> I do not think this is true. Before any modification to a page the
> original page will be written to the log (aka physical log).

Hmm.  Actually, what is written to the log is the *modified* page not
its original contents.  However, on studying the buffer manager I see
that it tries to fsync the log entry describing the last mod to a data
page before it writes out the page itself.  So perhaps that can be
relied on to ensure all XIDs known in the heap are known in the log.

However, I'd just as soon have the NEXTXID log records too to be doubly
sure.  I do now agree that we needn't fsync the NEXTXID records,
however.
        regards, tom lane


Re: AW: WAL-based allocation of XIDs is insecure

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Yes there must be XLogFlush() before writing buffers.
> BTW how do we get the next XID if WAL files are corrupted ?

My not-yet-committed changes include storing the latest CheckPoint
record in pg_control (as well as in the WAL files).  Recovery from
XLOG disaster will consist of generating a new XLOG that's empty
except for a CheckPoint record based on the one cached in pg_control.
In particular we can extract the nextOid and nextXid fields.

It might be that writing NEXTXID or NEXTOID log records should update
pg_control too with new nextXid/nextOid values --- what do you think?
Otherwise there's a possibility that the stored checkpoint is too far
back to cover all the values used since then.  OTOH, we are not going
to be able to guarantee absolute consistency in this disaster recovery
scenario anyway; duplicate XIDs may be the least of one's worries.

Of course, if you lose both XLOG and pg_control, you're still in big
trouble.  So it seems we should minimize the number of writes to
pg_control, which is an argument not to update it more than we must.
        regards, tom lane