Thread: WAL-based allocation of XIDs is insecure

WAL-based allocation of XIDs is insecure

From
Tom Lane
Date:
Consider the following scenario:

1. A new transaction inserts a tuple.  The tuple is entered into its
heap file with the new transaction's XID, and an associated WAL log
entry is made.  Neither one of these are on disk yet --- the heap tuple
is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer.

2. Now do a lot of read-only operations, in the same or another backend.
The WAL log stays where it is, but eventually the shmem disk buffer will
get flushed to disk so that the buffer can be re-used for some other
disk page.

3. Assume we now crash.  Now, we have a heap tuple on disk with an XID
that does not correspond to any XID visible in the on-disk WAL log.

4. Upon restart, WAL will initialize the XID counter to the first XID
not seen in the WAL log.  Guess which one that is.

5. We will now run a new transaction with the same XID that was in use
before the crash.  If that transaction commits, then we have a tuple on
disk that will be considered valid --- and should not be.


After thinking about this for a little, it seems to me that XID
assignment should be handled more like OID assignment: rather than
handing out XIDs one-at-a-time, varsup.c should allocate them in blocks,
and should write an XLOG record to reflect the allocation of each block
of XIDs.  Furthermore, the above example demonstrates that *we must
flush that XLOG entry to disk* before we can start to actually hand out
the XIDs.  This ensures that the next system cycle won't re-use any XIDs
that may have been in use at the time of a crash.

OID assignment is not quite so critical.  Consider again the scenario
above: we don't really care if after restart we reuse the OID that was
assigned to the crashed transaction's inserted tuple.  As long as the
tuple itself is not considered committed, it doesn't matter what OID it
contains.  So, it's not necessary to force XLOG flush for OID-assignment
XLOG entries.

In short then: make the XID allocation machinery just like the OID
allocation machinery presently is, plus an XLogFlush() after writing
the NEXTXID XLOG record.

Comments?
        regards, tom lane

PS: oh, another thing: redo of a checkpoint record ought to advance the
XID and OID counters to be at least what the checkpoint record shows.


Re: WAL-based allocation of XIDs is insecure

From
Ian Lance Taylor
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> After thinking about this for a little, it seems to me that XID
> assignment should be handled more like OID assignment: rather than
> handing out XIDs one-at-a-time, varsup.c should allocate them in blocks,
> and should write an XLOG record to reflect the allocation of each block
> of XIDs.  Furthermore, the above example demonstrates that *we must
> flush that XLOG entry to disk* before we can start to actually hand out
> the XIDs.  This ensures that the next system cycle won't re-use any XIDs
> that may have been in use at the time of a crash.

I think your example demonstrates something slightly different.  I
think it demonstrates that Postgres must flush the XLOG entry to disk
before it flushes any buffer to disk which uses an XID which was just
allocated.

For each buffer, heap_update could record the last XID stored into
that buffer.  When a buffer is forced out to disk, Postgres could make
sure that the XLOG entry which uses the XID is previously forced out
to disk.

A simpler and less accurate approach: when any dirty buffer is forced
to disk in order to allocate a buffer, make sure that any XLOG entry
which allocates new XIDs is flushed to disk first.

I don't know if these are better.  I raise them because you are
suggesting putting an occasional fsync at transaction start to avoid
an unlikely scenario.  A bit of bookkeeping can be used instead to
notice the unlikely scenario when it occurs.

Ian


Re: WAL-based allocation of XIDs is insecure

From
Tom Lane
Date:
Ian Lance Taylor <ian@airs.com> writes:
> I think your example demonstrates something slightly different.  I
> think it demonstrates that Postgres must flush the XLOG entry to disk
> before it flushes any buffer to disk which uses an XID which was just
> allocated.

That would be an alternative solution, but it's considerably more
complex to implement and I'm not convinced it is more efficient.

The above could result, worst case, in double the normal number of
fsyncs --- each new transaction might need an fsync to dump its first
few XLOG records (in addition to the fsync for its commit), if the
shmem disk buffer traffic is not in your favor.  This worst case is
not even difficult to produce: consider a series of standalone
transactions that each touch more than -B pages (-B = # of buffers).

In contrast, syncing NEXTXID records will require exactly one extra
fsync every few thousand transactions.  That seems quite acceptable
to me, and better than an fsync load that we can't predict.  Perhaps
the average case of fsync-on-buffer-flush would be better than that,
or perhaps not, but the worst case is definitely far worse.
        regards, tom lane


Re: WAL-based allocation of XIDs is insecure

From
Ian Lance Taylor
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Ian Lance Taylor <ian@airs.com> writes:
> > I think your example demonstrates something slightly different.  I
> > think it demonstrates that Postgres must flush the XLOG entry to disk
> > before it flushes any buffer to disk which uses an XID which was just
> > allocated.
> 
> That would be an alternative solution, but it's considerably more
> complex to implement and I'm not convinced it is more efficient.
> 
> The above could result, worst case, in double the normal number of
> fsyncs --- each new transaction might need an fsync to dump its first
> few XLOG records (in addition to the fsync for its commit), if the
> shmem disk buffer traffic is not in your favor.  This worst case is
> not even difficult to produce: consider a series of standalone
> transactions that each touch more than -B pages (-B = # of buffers).
> 
> In contrast, syncing NEXTXID records will require exactly one extra
> fsync every few thousand transactions.  That seems quite acceptable
> to me, and better than an fsync load that we can't predict.  Perhaps
> the average case of fsync-on-buffer-flush would be better than that,
> or perhaps not, but the worst case is definitely far worse.

I described myself unclearly.  I was suggesting an addition to what
you are suggesting.  The worst case can not be worse.

If you are going to allocate a few thousand XIDs each time, then I
agree that my suggested addition is not worth it.  But how do you deal
with XID wraparound on an unstable system?

Ian


Re: WAL-based allocation of XIDs is insecure

From
Ian Lance Taylor
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Ian Lance Taylor <ian@airs.com> writes:
> > I described myself unclearly.  I was suggesting an addition to what
> > you are suggesting.  The worst case can not be worse.
> 
> Then I didn't (and still don't) understand your suggestion.  Want to
> try again?

Your suggestion requires an obligatory fsync at an occasional
transaction start.

I was observing that in most cases, that fsync is not needed.  It can
be avoided with a bit of additional bookkeeping.

I was assuming, incorrectly, that you would not want to allocate that
many XIDs at once.  If you allocate 1000s of XIDs at once, the
obligatory fsync is not that bad, and my suggestion should be ignored.

> > If you are going to allocate a few thousand XIDs each time, then I
> > agree that my suggested addition is not worth it.  But how do you deal
> > with XID wraparound on an unstable system?
> 
> About the same as we do now: not very well.  But if your system is that
> unstable, XID wrap is the least of your worries, I think.
> 
> Up through 7.0, Postgres allocated XIDs a thousand at a time, and not
> only did the not-yet-used XIDs get lost in a crash, they'd get lost in
> a normal shutdown too.  What I propose will waste XIDs in a crash but
> not in a normal shutdown, so it's still an improvement over prior
> versions as far as XID consumption goes.

I find this somewhat troubling, since I like to think in terms of
long-running systems--like, decades.  But I guess it's OK (for me) if
it is fixed in the next couple of years.

Ian


Re: WAL-based allocation of XIDs is insecure

From
Tom Lane
Date:
Ian Lance Taylor <ian@airs.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Up through 7.0, Postgres allocated XIDs a thousand at a time, and not
>> only did the not-yet-used XIDs get lost in a crash, they'd get lost in
>> a normal shutdown too.  What I propose will waste XIDs in a crash but
>> not in a normal shutdown, so it's still an improvement over prior
>> versions as far as XID consumption goes.

> I find this somewhat troubling, since I like to think in terms of
> long-running systems--like, decades.  But I guess it's OK (for me) if
> it is fixed in the next couple of years.

Agreed, we need to do something about the XID-wrap problem pretty soon.
But we're not solving it for 7.1, and in the meantime I don't think
these changes make much difference either way.
        regards, tom lane


Re: WAL-based allocation of XIDs is insecure

From
Tom Lane
Date:
Ian Lance Taylor <ian@airs.com> writes:
> I described myself unclearly.  I was suggesting an addition to what
> you are suggesting.  The worst case can not be worse.

Then I didn't (and still don't) understand your suggestion.  Want to
try again?

> If you are going to allocate a few thousand XIDs each time, then I
> agree that my suggested addition is not worth it.  But how do you deal
> with XID wraparound on an unstable system?

About the same as we do now: not very well.  But if your system is that
unstable, XID wrap is the least of your worries, I think.

Up through 7.0, Postgres allocated XIDs a thousand at a time, and not
only did the not-yet-used XIDs get lost in a crash, they'd get lost in
a normal shutdown too.  What I propose will waste XIDs in a crash but
not in a normal shutdown, so it's still an improvement over prior
versions as far as XID consumption goes.
        regards, tom lane


Re: WAL-based allocation of XIDs is insecure

From
"Vadim Mikheev"
Date:
> Consider the following scenario:
> 
> 1. A new transaction inserts a tuple.  The tuple is entered into its
> heap file with the new transaction's XID, and an associated WAL log
> entry is made.  Neither one of these are on disk yet --- the heap tuple
> is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer.
> 
> 2. Now do a lot of read-only operations, in the same or another backend.
> The WAL log stays where it is, but eventually the shmem disk buffer will
> get flushed to disk so that the buffer can be re-used for some other
> disk page.
> 
> 3. Assume we now crash.  Now, we have a heap tuple on disk with an XID
> that does not correspond to any XID visible in the on-disk WAL log.

Impossible (with fsync ON -:)).

Seems my description of core WAL rule was bad, I'm sorry -:(
WAL = Write-*Ahead*-Log = Write data pages *only after* log records
reflecting data pages modifications are *flushed* on disk =
If a modification was not logged then it's neither in data pages.

No matter when bufmgr writes data buffer (at commit time or to re-use
it) bufmgr first ensures that buffer' modifications are logged.

Vadim




Re: WAL-based allocation of XIDs is insecure

From
"Vadim Mikheev"
Date:
> The point is to make the allocation of XIDs and OIDs work the same way.
> In particular, if we are forced to reset the XLOG using what's stored in
> pg_control, it would be good if what's stored in pg_control is a value
> beyond the last-used XID/OID, not a value less than the last-used ones.

If we're forced to reset log (ie it's corrupted/lost) then we're forced
to dump, and only dump, data *because of they are not consistent*.
So, I wouldn't worry about XID/OID/anything - we can only provide user
with way to restore data ... *manually*.

If user really cares about his data he must

U1. Buy good disks for WAL (data may be on not so good disks).
U2. Set up distributed DB if U1. is not enough.

To help user with above we must

D1. Avoid bugs in WAL
D2. Implement WAL based BAR (so U1 will have sence).
D3. Implement distributed DB.

There will be no D2 & D3 in 7.1, and who knows about D1. 
So, manual restoring data is the best we can do for 7.1.
And actually, "manual restoring" is what we had before,
anyway.

Vadim




Re: WAL-based allocation of XIDs is insecure

From
"Vadim Mikheev"
Date:
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> wrote:
>
> In short I do not think that the current implementation of
> "physical log" does what it was intended to do :-(

Hm, wasn't it handling non-atomic disk writes, Andreas?
And for what else "physical log" could be used?

The point was - copy entire page content on first after
checkpoint modification, so on recovery first restore page
to consistent state, so all subsequent logged modifications
could be applied without fear about page inconsistency.

Now, why should we log page as it was *before* modification?
We would log modification anyway (yet another log record!) and
would apply it to page, so result would be the same as now when
we log page after modification - consistent *modifyed* page.

?

Vadim




Re: WAL-based allocation of XIDs is insecure

From
"Vadim Mikheev"
Date:
> On third thought --- we could still log the original page contents and
> the modification log record atomically, if what were logged in the xlog
> record were (essentially) the parameters to the operation being logged,
> not its results. That is, make the log entry before you start doing the
> mod work, not after. This might also simplify redo, since redo would be
> no different from the normal case. I'm not sure why Vadim didn't choose
> to do it that way; maybe there's some other fine point I'm missing.

There is one - indices over user defined data types: catalog is not
available at the time of recovery, so, eg, we can't know how to order
keys of "non-standard" types. (This is also why we have to recover
aborted index split ops at runtime, when catalog is already available.)

Also, there is no point why should we log original page content and
the next modification record separately.

Vadim