Thread: Low hanging fruit in lazy-XID-assignment patch?

Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
Simon was complaining a bit ago that we still have problems with
excessive contention for the ProcArrayLock, and that much of this stems
from the need for transaction exit to take that lock exclusively.
The lazy-XID patch, as committed, doesn't help that situation at all,
saying
       /*        * Lock ProcArrayLock because that's what GetSnapshotData uses.        * You might assume that we can
skipthis step if we had no        * transaction id assigned, because the failure case outlined        * in
GetSnapshotDatacannot happen in that case. This is true,        * but we *still* need the lock guarantee that two
concurrent       * computations of the *oldest* xmin will get the same result.        */
 

On reflection though this seems wrong: I believe that we could skip
taking the lock when exiting a transaction with no XID.  The actions
being guarded with the lock are
       MyProc->xid = InvalidTransactionId;       MyProc->lxid = InvalidLocalTransactionId;       MyProc->xmin =
InvalidTransactionId;      MyProc->inVacuum = false;        /* must be cleared with xid/xmin */
 
       /* Clear the subtransaction-XID cache too while holding the lock */       MyProc->subxids.nxids = 0;
MyProc->subxids.overflowed= false;
 

Clearing xid is obviously a no-op if we had no xid, and if we had no xid
we have no subxids either, so the last 2 lines are also no-ops.  I
cannot see any reason why we need a guard on clearing lxid, either.
inVacuum is only interesting if xmin is, since if there's no xid
assigned then it's effectively just a filter on whether other backends
pay attention to this one's xmin.  That leaves xmin, which AFAICS is
only interesting for the computations of GetOldestXmin() and
RecentGlobalXmin.  And I assert it doesn't matter if those numbers
advance asynchronously, so long as they never go backward.

Comments?
        regards, tom lane


Re: Low hanging fruit in lazy-XID-assignment patch?

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> Simon was complaining a bit ago that we still have problems with
> excessive contention for the ProcArrayLock, and that much of this stems
> from the need for transaction exit to take that lock exclusively.
> The lazy-XID patch, as committed, doesn't help that situation at all,
> saying
> 
>         /*
>          * Lock ProcArrayLock because that's what GetSnapshotData uses.
>          * You might assume that we can skip this step if we had no
>          * transaction id assigned, because the failure case outlined
>          * in GetSnapshotData cannot happen in that case. This is true,
>          * but we *still* need the lock guarantee that two concurrent
>          * computations of the *oldest* xmin will get the same result.
>          */

I think the comment is correct in principle - If we remove the oldest
xmin without locking, then two concurrent OldestXmin calculations
will get two different results. The question is if that has any
negative effects, though.

> That leaves xmin, which AFAICS is
> only interesting for the computations of GetOldestXmin() and
> RecentGlobalXmin.  And I assert it doesn't matter if those numbers
> advance asynchronously, so long as they never go backward.
Yes, the xmin is surely the only field that might need need the locking.

It was this comment in GetSnapshotData that made me keep the locking
in the first place:
 * It is sufficient to get shared lock on ProcArrayLock, even if we are * computing a serializable snapshot and
thereforewill be setting * MyProc->xmin. This is because any two backends that have overlapping * shared holds on
ProcArrayLockwill certainly compute the same xmin * (since no xact, in particular not the oldest, can exit the set of *
runningtransactions while we hold ProcArrayLock --- see further * discussion just below). So it doesn't matter whether
anotherbackend * concurrently doing GetSnapshotData or GetOldestXmin sees our xmin as * set or not; he'd compute the
samexmin for himself either way. * (We are assuming here that xmin can be set and read atomically, * just like xid.)
 

But now that I read this again, I think that comment is just missleading -
especially the part "So it doesn't matter whether another backend concurrently
doing GetSnapshotData or GetOldestXmin sees our xmin as set or not; he'd compute
the same xmin for himself either way."
This sounds as if the Proc->xmin that *one* backend announces had
influence over the Proc->xmin that *another* backend might compute.
Which isn't true - it only influences the GlobalXmin that another backend might
compute.

So I believe you're right, and we can skip taking the lock in the no xid case -
I actually think with just a little bit of more work, we can go even further,
and get rid of the ReadNewTransactionId() call completely during snapshotting.

There are two things we must ensure when I comes to snapshots, commits and
xid assignment.

1) A transaction must either be not in progress, be in our snapshot, or   have an xid >= xmax.
2) If transaction A sees B as committed, and B sees C as committed, then   A must see C as committed.

ad 1): We guarantee that by storing the xid in the proc array before releasing       the XidGenLock. Therefore, when we
laterobtain our xmax value,       we can be sure that we see all xacts in the proc array that have an       xid < xmax
andare in progress.
 

ad 2): We guarantee that by serializing snapshotting against committing. Since       we use ReadNewTransactionId() as
thesnapshot's xmax this implies that       we take the ProcArrayLock *before* reading our xmax value.
 

Now, ReadNewTransactionId() is actually larger than necessary as a xmax.
The minimal xmax that we can set is "largest committed xid"+1. We can
easily track that value during commit when we hold the ProcArrayLock
(If we have no xid, and therefor don't have to hold the lock, we also
don't need to update that value).

If we used this "LatestCommittedXid" as xmax, we'd still guarantee (2),
but without having to hold the XidGenLock during GetSnapshotData().

I wouldn't have dared to suggest this for 8.3, but since you came up with
locking improvements in the first place... ;-)

greetings, Florian Pflug



Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> The lazy-XID patch, as committed, doesn't help that situation at all,

> I think the comment is correct in principle - If we remove the oldest
> xmin without locking, then two concurrent OldestXmin calculations
> will get two different results. The question is if that has any
> negative effects, though.

My point is that there's a difference between what you compute (and
publish) as your own xmin, and what you compute as the RecentGlobalXmin.
I don't think there's any need for a guarantee that two concurrent
processes get the same estimate of RecentGlobalXmin, as long as
they do not get an estimate less than reality, ie, that someone cannot
later compute and publish a smaller xmin.

There are reasons why we want two concurrent GetSnapshotDatas to compute
the same xmin, but I think in the end it just comes down to being a
prerequisite for the above constraint --- without that you're not sure
that someone might not be about to publish an xmin less than what you
obtained as RecentGlobalXmin.

Dropping a live xid is a whole different issue.  There, you have the
problem that you need everyone to see a consistent commit order, which
is what the example in GetSnapshotData is about.  But I don't think that
xmin enters into that.  xmin is only about "is it safe to drop this
tuple because no one can see it?".  There, we don't have to be exactly
correct, we only have to err in the conservative direction.

> It was this comment in GetSnapshotData that made me keep the locking
> in the first place:

>   * It is sufficient to get shared lock on ProcArrayLock, even if we are
>   * computing a serializable snapshot and therefore will be setting
>   * MyProc->xmin. This is because any two backends that have overlapping
>   * shared holds on ProcArrayLock will certainly compute the same xmin

If I recall correctly, that text was written to justify downgrading
GetSnapshotData's hold on ProcArrayLock from exclusive to shared --- it
was merely arguing that the results wouldn't change if we did that.
I don't see an argument there that this condition is really *necessary*.
We do have to think carefully about whether GetOldestXmin can compute a
value that's too large, but right at the moment I see no problem there.

> So I believe you're right, and we can skip taking the lock in the no
> xid case - I actually think with just a little bit of more work, we
> can go even further, and get rid of the ReadNewTransactionId() call
> completely during snapshotting.

[ squint... ]  This goes a bit far for me.  In particular, I think this
will fail in the edge case when there are no live XIDs visible in
ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
at least not without re-scanning the ProcArray a second time, which
makes it at best a questionable win.
        regards, tom lane


Re: Low hanging fruit in lazy-XID-assignment patch?

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> So I believe you're right, and we can skip taking the lock in the no
>> xid case - I actually think with just a little bit of more work, we
>> can go even further, and get rid of the ReadNewTransactionId() call
>> completely during snapshotting.
> 
> [ squint... ]  This goes a bit far for me.  In particular, I think this
> will fail in the edge case when there are no live XIDs visible in
> ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
> at least not without re-scanning the ProcArray a second time, which
> makes it at best a questionable win.

Why would it? The idea was to remember the largest committed xid, and that
won't go away just because the proc array is rather empty xid-wise. Actually,
in that case the "largest comitted xid"+1 will (nearly) be what
ReadNewTransactionId() returns. (Nearly because the transaction with the xid
ReadNewTransactionId()-1 might have aborted, so largestCommittedXid might be
a bit further behind ReadNewTransactionId().)

(That slightly lagging of largestCommittedXid might cause some tuples not to
be VACUUMED though, so we might want to update largestCommittedXid for
ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ).

I would go as far as saying that largestCommittedXid+1 is the natural choice
for xmax - after all, xmax is the cutoff point after which a xid *cannot*
be seen as committed, and largestCommittedXid+1 is the smallest xmax that
guarantees that we see xacts committed before the snapshot as committed.

The xmin computation won't change - apart from using some other initial value.

This would rid us of the rather complicated entanglement of XidGenLock and
the ProcArrayLock, lessen the lock contention, and reduce the average snapshot
size a bit.

greetings, Florian Pflug


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Simon Riggs
Date:
On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote:
> Tom Lane wrote:
> > "Florian G. Pflug" <fgp@phlo.org> writes:
> >> So I believe you're right, and we can skip taking the lock in the no
> >> xid case 

Sounds good.

> - I actually think with just a little bit of more work, we
> >> can go even further, and get rid of the ReadNewTransactionId() call
> >> completely during snapshotting.
> > 
> > [ squint... ]  This goes a bit far for me.  In particular, I think this
> > will fail in the edge case when there are no live XIDs visible in
> > ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
> > at least not without re-scanning the ProcArray a second time, which
> > makes it at best a questionable win.
> 
> Why would it?

I think the additional suggestion goes a bit too far. You may be right,
but I don't want to change the transaction system in advanced ways this
close to the next release. We may have difficulty spotting bugs in that
thinking during beta.

lazy XID assignment will reduce the number of times
GetNextTransactionId() is called and if we also avoid taking the
ProcArrayLock for CommitTransaction() then we will have significantly
reduced contention for mixed workloads (i.e. most real ones).

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Low hanging fruit in lazy-XID-assignment patch?

From
"Florian G. Pflug"
Date:
Simon Riggs wrote:
> On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote:
>> Tom Lane wrote:
>>> "Florian G. Pflug" <fgp@phlo.org> writes:
>>> - I actually think with just a little bit of more work, we
>>>> can go even further, and get rid of the ReadNewTransactionId() call
>>>> completely during snapshotting.
>>> [ squint... ]  This goes a bit far for me.  In particular, I think this
>>> will fail in the edge case when there are no live XIDs visible in
>>> ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
>>> at least not without re-scanning the ProcArray a second time, which
>>> makes it at best a questionable win.
>> Why would it?
> 
> I think the additional suggestion goes a bit too far. You may be right,
> but I don't want to change the transaction system in advanced ways this
> close to the next release. We may have difficulty spotting bugs in that
> thinking during beta.

Ok, those were two clear votes against doing this, so I'll stop
arguing ;-). I do think that we should have another look at this when
8.4 opens, though.

greetings, Florian Pflug



Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Why would it? The idea was to remember the largest committed xid, and that
> won't go away just because the proc array is rather empty xid-wise.

I hadn't fully absorbed this idea last night, but now that I have, I'm
starting to think it's a good one.

> (That slightly lagging of largestCommittedXid might cause some tuples not to
> be VACUUMED though, so we might want to update largestCommittedXid for
> ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ).

LargestCompletedXid, perhaps?

> This would rid us of the rather complicated entanglement of XidGenLock and
> the ProcArrayLock, lessen the lock contention, and reduce the average snapshot
> size a bit.

In view of Simon's earlier comments at
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00948.php
it seems like not having to take the XidGenLock during GetSnapshotData
ought to be a pretty serious win performance-wise, and it might open up
some other possibilities that are now foreclosed by the risk of deadlock
between the two locks.

I've spent the past hour or so trying to consolidate the comments in
GetSnapshotData and related places into a single chunk of text to be
added to src/backend/access/transam/README.  Attached is what I have so
far --- this incorporates the idea of not taking ProcArrayLock to exit
an XID-less transaction, but not yet Florian's idea.  I think it'd get
simpler if we changed to that, but am posting this for comments.
        regards, tom lane


Interlocking transaction begin, transaction end, and snapshots
--------------------------------------------------------------

We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot.  Unfortunately, we must have some interlocking for this, because
it is critical that all backends agree on the commit order of transactions.
For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot.  Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A.  That
is, if A thinks it committed after B, C had better think the same.  We
enforce this by not allowing any transaction to exit the set of running
transactions while a snapshot is being taken.  (This rule is probably
stronger than necessary, but see the next problem.)  The implementation
of this is that GetSnapshotData takes the ProcArrayLock in shared mode
(thereby allowing multiple backends to take snapshots in parallel), but
xact.c must take the ProcArrayLock in exclusive mode while clearing
MyProc->xid at transaction end (either commit or abort).

Here is another variant of the risk scenario:

1. Xact A is running (in Read Committed mode).
2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is  swapped out before it can acquire
ProcArrayLock.
3. Xact B gets new XID (>= C's xmax), makes changes and commits.
4. Xact A changes some row R changed by xact B and commits.
5. Xact C finishes getting its snapshot data.  It sees xact A as done,  but sees xact B as still running (since B >=
xmax).

Now C will see R changed by xact B and then xact A, *but* does not see
other changes made by xact B.  If C is supposed to be in Serializable mode,
this is wrong.

To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
before it calls ReadNewTransactionId.  This prevents xact A from exiting
the set of running transactions seen by xact C.  Therefore both A and B
will be seen as still running => no inconsistency.

In short, then, the rule is that no transactions may exit the set of
currently-running transactions between the time we fetch xmax and the time
we finish building our snapshot.  However, this restriction only applies
to transactions that have an XID --- read-only transactions can end without
acquiring ProcArrayLock, since they don't affect anyone else's snapshot.

We must also require GetNewTransactionId to store the new XID into the
shared ProcArray before releasing XidGenLock.  This ensures that when
GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock),
all active XIDs before the returned value of nextXid are already present in
the ProcArray and can't be missed by GetSnapshotData.  Unfortunately, we
can't have GetNewTransactionId take ProcArrayLock to do this, else it could
deadlock against GetSnapshotData.  Therefore, we simply let
GetNewTransactionId store into MyProc->xid without any lock.  We are
thereby relying on fetch/store of an XID to be atomic, else other backends
might see a partially-set XID.  (NOTE: for multiprocessors that need
explicit memory access fence instructions, this means that
acquiring/releasing XidGenLock is just as necessary as acquiring/releasing
ProcArrayLock for GetSnapshotData to ensure it sees up-to-date xid fields.)
This also means that readers of the ProcArray xid fields must be careful to
fetch a value only once, rather than assume they can read it multiple times
and get the same answer each time.

Another important activity that uses the shared ProcArray is GetOldestXmin,
which must determine a lower bound for the oldest xmin of any active MVCC
snapshot, system-wide.  Each individual backend advertises the smallest
xmin of its own snapshots in MyProc->xmin, or zero if it currently has no
live snapshots (eg, if it's between transactions or hasn't yet set a
snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
valid xmin fields.  It does this with only shared lock on ProcArrayLock,
which means there is a potential race condition against other backends
doing GetSnapshotData concurrently: we must be certain that a concurrent
backend that is about to set its xmin does not compute an xmin less than
what GetOldestXmin returns.  We ensure that by including all the active
XIDs into the MIN() calculation, along with the valid xmins.  The rule that
transactions can't exit without taking exclusive ProcArrayLock ensures that
concurrent holders of shared ProcArrayLock will compute the same minimum of
currently-active XIDs: no xact, in particular not the oldest, can exit
while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
active XID will be the same as that of any concurrent GetSnapshotData, and
so it can't produce an overestimate.  If there is no active transaction at
all, GetOldestXmin returns the result of ReadNewTransactionId.  Note that
two concurrent executions of GetOldestXmin might not see the same result
from ReadNewTransactionId --- but if there is a difference, the intervening
execution(s) of GetNewTransactionId must have stored their XIDs into the
ProcArray, so the later execution of GetOldestXmin will see them and
compute the same global xmin anyway.

GetSnapshotData also performs an oldest-xmin calculation (which had better
match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
too expensive.  Note that while it is certain that two concurrent
executions of GetSnapshotData will compute the same xmin for their own
snapshots, as argued above, it is not certain that they will arrive at the
same estimate of RecentGlobalXmin.  This is because we allow XID-less
transactions to clear their MyProc->xmin asynchronously (without taking
ProcArrayLock), so one execution might see what had been the oldest xmin,
and another not.  This is OK since RecentGlobalXmin need only be a valid
lower bound.  As noted above, we are already assuming that fetch/store
of the xid fields is atomic, so assuming it for xmin as well is no extra
risk.


Re: Low hanging fruit in lazy-XID-assignment patch?

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> I've spent the past hour or so trying to consolidate the comments in
> GetSnapshotData and related places into a single chunk of text to be
> added to src/backend/access/transam/README.  Attached is what I have so
> far --- this incorporates the idea of not taking ProcArrayLock to exit
> an XID-less transaction, but not yet Florian's idea.  I think it'd get
> simpler if we changed to that, but am posting this for comments.

> Interlocking transaction begin, transaction end, and snapshots
> --------------------------------------------------------------
> 
> We try hard to minimize the amount of overhead and lock contention involved
> in the frequent activities of beginning/ending a transaction and taking a
> snapshot.  Unfortunately, we must have some interlocking for this, because
> it is critical that all backends agree on the commit order of transactions.
This is actually still a slightly stronger requirement than what we really
need I think.

To simplify the following discussion, "A -> B" shall mean that transaction A saw
B as committed. Conversely, "A !> B" shall mean that A treats B as in-progress.
If A was in read-committed mode, the visibility refers to the latest snapshot
that A used.

Now assume A and B commit at nearly the same time, and for two other
transactions C and D the following holds:
C -> A, C !> B but
D !> A, D -> B.

This would violate the requirement that the commit order is globally
agreed upon, yet as long as both A !> B and B !> A holds, there is no
conflict. (Note that if A and B are serializable, A !> B && B !> A
implies that A and B cannot have touched the same record and have
both committed - one would have been aborted due to a
SerializationError).

I must admit, though, that this is a quite academic case, since the
prerequisite A !> B and B !> A is something we have no control over
for read-committed transactions - who knows when they might have taken
their last snapshot...

Still, I wanted to mention this because I believe that the minimal
requirement that we actually *need* to enforce is
A -> B and B -> C imply A -> C.  (T1)

The actual implementation will probably always have to enforce
something slightly stronger, but it's still nice to know the
minimal guarantee needed to be able to judge correctness.

> For example, suppose an UPDATE in xact A is blocked by xact B's prior
> update of the same row, and xact B is doing commit while xact C gets a
> snapshot.  Xact A can complete and commit as soon as B releases its locks.
> If xact C's GetSnapshotData sees xact B as still running, then it had
> better see xact A as still running as well, or it will be able to see two
> tuple versions - one deleted by xact B and one inserted by xact A. 
In my notation this becomes: A -> B and C !> B implies C !> A.

This then follows from (T1) - Assume that A -> B, C !> B but C -> A,
then with (A) C -> B follows from C -> A and A -> B, which contradicts
C !> B.

> We
> enforce this by not allowing any transaction to exit the set of running
> transactions while a snapshot is being taken.  (This rule is probably
> stronger than necessary, but see the next problem.)  The implementation
> of this is that GetSnapshotData takes the ProcArrayLock in shared mode
> (thereby allowing multiple backends to take snapshots in parallel), but
> xact.c must take the ProcArrayLock in exclusive mode while clearing
> MyProc->xid at transaction end (either commit or abort).
Agreed. We *do* enforce a strict global ordering of committs and snapshots.
This then guarantees (T1) because if A -> B and B -> C, than A *must*
have taken it's snapshot after B committed, and B in turn *must* have
taken it's snapshot after C committed, so surely A -> C will hold too.

> Here is another variant of the risk scenario:
> 
> 1. Xact A is running (in Read Committed mode).
> 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
>    swapped out before it can acquire ProcArrayLock.
> 3. Xact B gets new XID (>= C's xmax), makes changes and commits.
> 4. Xact A changes some row R changed by xact B and commits.
> 5. Xact C finishes getting its snapshot data.  It sees xact A as done,
>    but sees xact B as still running (since B >= xmax).
> 
> Now C will see R changed by xact B and then xact A, *but* does not see
> other changes made by xact B.  If C is supposed to be in Serializable mode,
> this is wrong.
I never really grasped why we need to assume serializable here - this seems
wrong if C is read-committed too. Seeing only half of a transaction's changes
can never be right, can it?

> To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
> before it calls ReadNewTransactionId.  This prevents xact A from exiting
> the set of running transactions seen by xact C.  Therefore both A and B
> will be seen as still running => no inconsistency.
Another point of view is that determining the xmax of a snapshot really *is*
part of taking the snapshot. Since we already obtained that we need to
serialize snapshotting and committing, it follows that we must not allow
committs to happen between the time we take the xmax and the time we
complete the snapshot.

> In short, then, the rule is that no transactions may exit the set of
> currently-running transactions between the time we fetch xmax and the time
> we finish building our snapshot.  However, this restriction only applies
> to transactions that have an XID --- read-only transactions can end without
> acquiring ProcArrayLock, since they don't affect anyone else's snapshot.
Yes, they completely fall out of the picture, because nobody ever cares if
they even committed.

> We must also require GetNewTransactionId to store the new XID into the
> shared ProcArray before releasing XidGenLock.  This ensures that when
> GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock),
> all active XIDs before the returned value of nextXid are already present in
> the ProcArray and can't be missed by GetSnapshotData.  Unfortunately, we
> can't have GetNewTransactionId take ProcArrayLock to do this, else it could
> deadlock against GetSnapshotData.  Therefore, we simply let
> GetNewTransactionId store into MyProc->xid without any lock.  We are
> thereby relying on fetch/store of an XID to be atomic, else other backends
> might see a partially-set XID.  (NOTE: for multiprocessors that need
> explicit memory access fence instructions, this means that
> acquiring/releasing XidGenLock is just as necessary as acquiring/releasing
> ProcArrayLock for GetSnapshotData to ensure it sees up-to-date xid fields.)
> This also means that readers of the ProcArray xid fields must be careful to
> fetch a value only once, rather than assume they can read it multiple times
> and get the same answer each time.
In other words: It won't do for transaction to acquire an xid, then go into
hiding, and later suddenly reappear and start storing it's rather old xid
somewhere.
Or, again, differently put: When we take a snapshot we must be sure that any
xid < our xmax is either not in use anymore, or shows up in the proc array.

By storing the xid in the proc array while holding the XidGenLock, we
ensure that there is always is at most one xid assigned, but not showing
up in the proc array. This xid is always equal to either nextXid, or
nextXid-1. It is the "nextXid-1" case that forces us to take the XidGenLock
- it ensures that *no* xid is assigned but not showing up in the proc array.
If we used any older value for the xmax, the nextXid-1 case wouldn't hurt us -
the "hidden" xid will still be >= our xmax.

The last paragraph is in essence what guarantees the correctness of using
LargestCompletedXid instead of ReadNewTransactionId() as the xmax.

> Another important activity that uses the shared ProcArray is GetOldestXmin,
> which must determine a lower bound for the oldest xmin of any active MVCC
> snapshot, system-wide.  Each individual backend advertises the smallest
> xmin of its own snapshots in MyProc->xmin, or zero if it currently has no
> live snapshots (eg, if it's between transactions or hasn't yet set a
> snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
> valid xmin fields.  It does this with only shared lock on ProcArrayLock,
> which means there is a potential race condition against other backends
> doing GetSnapshotData concurrently: we must be certain that a concurrent
> backend that is about to set its xmin does not compute an xmin less than
> what GetOldestXmin returns.  We ensure that by including all the active
> XIDs into the MIN() calculation, along with the valid xmins.  The rule that
> transactions can't exit without taking exclusive ProcArrayLock ensures that
> concurrent holders of shared ProcArrayLock will compute the same minimum of
> currently-active XIDs: no xact, in particular not the oldest, can exit
> while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
> active XID will be the same as that of any concurrent GetSnapshotData, and
> so it can't produce an overestimate.  If there is no active transaction at
> all, GetOldestXmin returns the result of ReadNewTransactionId.  Note that
> two concurrent executions of GetOldestXmin might not see the same result
> from ReadNewTransactionId --- but if there is a difference, the intervening
> execution(s) of GetNewTransactionId must have stored their XIDs into the
> ProcArray, so the later execution of GetOldestXmin will see them and
> compute the same global xmin anyway.
Agreed. A shorter reasoning would maybe be:

We need to ensure that no snapshot that is still in use has a xmin smaller
than what GetOldestXmin() returns. We have to check two cases, snapshot
creation and transaction exit.

During snapshot creation, we compute the xmin as MIN() of all xids in our
snapshot. A concurrent GetOldestXmin() or GlobalXmin computation cannot derive a
larger value, since it takes the published xids into account, as well as the
published xmins. This, together with the rule that no transaction can remove
it's xid from the set of running transactions while we either take a snapshot or
do GetOldestXmin() guarantees correctness.

During transaction exit, we don't need to take any lock to clear our xmin.
The xmin won't influence the xmin calculations of other transactions, only
GlobalXmin and GetOldestXmin() computations. Since we won't use our snapshot
anymore, we don't care if they take our xmin into account for that or not.

> GetSnapshotData also performs an oldest-xmin calculation (which had better
> match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
> for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
> too expensive.  Note that while it is certain that two concurrent
> executions of GetSnapshotData will compute the same xmin for their own
> snapshots, as argued above, it is not certain that they will arrive at the
> same estimate of RecentGlobalXmin.  This is because we allow XID-less
> transactions to clear their MyProc->xmin asynchronously (without taking
> ProcArrayLock), so one execution might see what had been the oldest xmin,
> and another not.  This is OK since RecentGlobalXmin need only be a valid
> lower bound.  As noted above, we are already assuming that fetch/store
> of the xid fields is atomic, so assuming it for xmin as well is no extra
> risk.
Agreed - see above for a equivalent description

I had a rather hard time understanding these things initially - at least
for me, the enlightenment came when I realized that most of the locking
in there actually ensures that (T1) holds - or in other words, that
the relation "A sees B as committed" *has* to be a transitive one.

So I'd like to see this reasoning in that README - though maybe I'm the
only one to whom this is the clearest wording...

greetings, Florian Pflug



Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> Here is another variant of the risk scenario:
>> 
>> 1. Xact A is running (in Read Committed mode).
>> 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
>> swapped out before it can acquire ProcArrayLock.
>> 3. Xact B gets new XID (>= C's xmax), makes changes and commits.
>> 4. Xact A changes some row R changed by xact B and commits.
>> 5. Xact C finishes getting its snapshot data.  It sees xact A as done,
>> but sees xact B as still running (since B >= xmax).
>> 
>> Now C will see R changed by xact B and then xact A, *but* does not see
>> other changes made by xact B.  If C is supposed to be in Serializable mode,
>> this is wrong.

> I never really grasped why we need to assume serializable here - this seems
> wrong if C is read-committed too. Seeing only half of a transaction's changes
> can never be right, can it?

You'd be surprised ;-) ... Read Committed is not totally consistent,
even considering a single statement using a single snapshot, because of
the rules about UPDATEs starting from the latest committed version.
But let's not get into that today.

Anyway, one thing I realized while making this text is that the example
cases mentioned in CommitTransaction and GetSnapshotData really were
basically the same case.  The difference is that in this case, the
failure to see B as committed is closely related to the way that xmax is
computed.  We could get rid of this example if we switched over to your
LatestCompletedXid proposal.

> I had a rather hard time understanding these things initially - at least
> for me, the enlightenment came when I realized that most of the locking
> in there actually ensures that (T1) holds - or in other words, that
> the relation "A sees B as committed" *has* to be a transitive one.
> So I'd like to see this reasoning in that README - though maybe I'm the
> only one to whom this is the clearest wording...

I'll put some of this into the README.
        regards, tom lane


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
Here's some revised text for the README file, based on using Florian's
idea of a global latestCompletedXid variable.  As I worked through it
I realized that in this design, XidGenLock gates entry of new XIDs into
the ProcArray while ProcArrayLock gates their removal.  Which is an
interesting sort of symmetry property.  It also turns out that the
reason we need to gate entry with XidGenLock is to keep from breaking
GetOldestXmin, rather than to ensure correctness of snapshots per se.

(Note: I refer in the text to ProcArrayEndTransaction(), which is a
function I'm thinking of putting into procarray.c to replace the current
inline-in-xact.c code that clears xid and related fields.)

Comments?
        regards, tom lane


Interlocking transaction begin, transaction end, and snapshots
--------------------------------------------------------------

We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot.  Unfortunately, we must have some interlocking for this, because
we must ensure consistency about the commit order of transactions.
For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot.  Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A.  Another
reason why this would be bad is that C would see (in the row inserted by A)
earlier changes by B, and it would be inconsistent for C not to see any
of B's changes elsewhere in the database.

Formally, the correctness requirement is "if a snapshot A considers
transaction X as committed, and any of transaction X's snapshots considered
transaction Y as committed, then snapshot A must consider transaction Y as
committed".

What we actually enforce is strict serialization of commits and rollbacks
with snapshot-taking: we do not allow any transaction to exit the set of
running transactions while a snapshot is being taken.  (This rule is
stronger than necessary for consistency, but is relatively simple to
enforce, and it assists with some other issues as explained below.)  The
implementation of this is that GetSnapshotData takes the ProcArrayLock in
shared mode (so that multiple backends can take snapshots in parallel),
but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
while clearing MyProc->xid at transaction end (either commit or abort).

ProcArrayEndTransaction also holds the lock while advancing the shared
latestCompletedXid variable.  This allows GetSnapshotData to use
latestCompletedXid + 1 as xmax for its snapshot: there can be no
transaction >= this xid value that the snapshot needs to consider as
completed.

In short, then, the rule is that no transaction may exit the set of
currently-running transactions between the time we fetch latestCompletedXid
and the time we finish building our snapshot.  However, this restriction
only applies to transactions that have an XID --- read-only transactions
can end without acquiring ProcArrayLock, since they don't affect anyone
else's snapshot nor latestCompletedXid.

Transaction start, per se, doesn't have any interlocking with these
considerations, since we no longer assign an XID immediately at transaction
start.  But when we do decide to allocate an XID, GetNewTransactionId must
store the new XID into the shared ProcArray before releasing XidGenLock.
This ensures that all top-level XIDs <= latestCompletedXid are either
present in the ProcArray, or not running anymore.  (This guarantee doesn't
apply to subtransaction XIDs, because of the possibility that there's not
room for them in the subxid array; instead we guarantee that they are
present or the overflow flag is set.)  If a backend released XidGenLock
before storing its XID into MyProc, then it would be possible for another
backend to allocate and commit a later XID, causing latestCompletedXid to
pass the first backend's XID, before that value became visible in the
ProcArray.  That would break GetOldestXmin, as discussed below.

We allow GetNewTransactionId to store the XID into MyProc->xid (or the
subxid array) without taking ProcArrayLock.  This was once necessary to
avoid deadlock; while that is no longer the case, it's still beneficial for
performance.  We are thereby relying on fetch/store of an XID to be atomic,
else other backends might see a partially-set XID.  This also means that
readers of the ProcArray xid fields must be careful to fetch a value only
once, rather than assume they can read it multiple times and get the same
answer each time.

Another important activity that uses the shared ProcArray is GetOldestXmin,
which must determine a lower bound for the oldest xmin of any active MVCC
snapshot, system-wide.  Each individual backend advertises the smallest
xmin of its own snapshots in MyProc->xmin, or zero if it currently has no
live snapshots (eg, if it's between transactions or hasn't yet set a
snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
valid xmin fields.  It does this with only shared lock on ProcArrayLock,
which means there is a potential race condition against other backends
doing GetSnapshotData concurrently: we must be certain that a concurrent
backend that is about to set its xmin does not compute an xmin less than
what GetOldestXmin returns.  We ensure that by including all the active
XIDs into the MIN() calculation, along with the valid xmins.  The rule that
transactions can't exit without taking exclusive ProcArrayLock ensures that
concurrent holders of shared ProcArrayLock will compute the same minimum of
currently-active XIDs: no xact, in particular not the oldest, can exit
while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
active XID will be the same as that of any concurrent GetSnapshotData, and
so it can't produce an overestimate.  If there is no active transaction at
all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
for the xmin that might be computed by concurrent or later GetSnapshotData
calls.  (We know that no XID less than this could be about to appear in
the ProcArray, because of the XidGenLock interlock discussed above.)

GetSnapshotData also performs an oldest-xmin calculation (which had better
match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
too expensive.  Note that while it is certain that two concurrent
executions of GetSnapshotData will compute the same xmin for their own
snapshots, as argued above, it is not certain that they will arrive at the
same estimate of RecentGlobalXmin.  This is because we allow XID-less
transactions to clear their MyProc->xmin asynchronously (without taking
ProcArrayLock), so one execution might see what had been the oldest xmin,
and another not.  This is OK since RecentGlobalXmin need only be a valid
lower bound.  As noted above, we are already assuming that fetch/store
of the xid fields is atomic, so assuming it for xmin as well is no extra
risk.


Re: Low hanging fruit in lazy-XID-assignment patch?

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> Here's some revised text for the README file, based on using Florian's idea 
> of a global latestCompletedXid variable.  As I worked through it I realized 
> that in this design, XidGenLock gates entry of new XIDs into the ProcArray 
> while ProcArrayLock gates their removal.  Which is an interesting sort of 
> symmetry property.  It also turns out that the reason we need to gate entry 
> with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure
> correctness of snapshots per se.
I believe it would break both, no? If an xid <= latestCompletedXid is
not included in the snapshot, but later used for updates, the snapshot
will see those changes as committed when they really are not.

But other than that, it really sounds fine. It certainly explains things much
better than the comments in the existing code.

I noticed two rather cosmetic issues
.) latestCompletedXid sounds as it might refer to the *last* completed xid,
but it actually refers to the largest / highest completed xid. So maybe we
should call it highestCompletedXid or largestCompletedXid.

.) Since you mention that we assume reading and writing int4s are atomic
operations, maybe we should mention that for safety's sake we mark the
corresponding pointers with volatile?

greetings, Florian Pflug




Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> I noticed two rather cosmetic issues
> .) latestCompletedXid sounds as it might refer to the *last* completed xid,
> but it actually refers to the largest / highest completed xid. So maybe we
> should call it highestCompletedXid or largestCompletedXid.

Actually that was an intentional choice: because of the wraparound
behavior of XIDs, the "latest" value is not necessarily numerically
largest.  I'm not wedded to it though.

> .) Since you mention that we assume reading and writing int4s are atomic
> operations, maybe we should mention that for safety's sake we mark the
> corresponding pointers with volatile?

Couldn't hurt.

I have a draft patch that I'll post shortly.
        regards, tom lane


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Alvaro Herrera
Date:
As a fallout of this work that I haven't seen made explicit, a session
opening a transaction and then sitting around doing nothing will not
cause as many problems as it used to -- for example it won't cause
VACUUM to be unable to clean up dead rows.  Is this correct?

Nowadays this is no longer much of a problem, but it used to be, back
when drivers were more broken than they are now (when they had the
commit method send "COMMIT; BEGIN"); however, it still seems to me like
a big step forwards.

-- 
Alvaro Herrera                  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> As a fallout of this work that I haven't seen made explicit, a session
> opening a transaction and then sitting around doing nothing will not
> cause as many problems as it used to -- for example it won't cause
> VACUUM to be unable to clean up dead rows.  Is this correct?

Yeah, if you just issue BEGIN and then sit, you won't have acquired
either an xid or an xmin, so you don't create a VACUUM problem anymore.

If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin
but not an xid, so at that point you become a problem for VACUUM.
However, internally you don't have any live snapshots (if you're in READ
COMMITTED mode), so eventually we could have you stop publishing an xmin
too.  That's something for 8.4 though.
        regards, tom lane


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin
> but not an xid, so at that point you become a problem for VACUUM.
> However, internally you don't have any live snapshots (if you're in READ
> COMMITTED mode), so eventually we could have you stop publishing an xmin
> too.  That's something for 8.4 though.

Aren't there some things that depend on the idea that even READ COMMITTED
transactions still have a serializable snapshot lying around for them to use?

This is actually one of the rough spots in HOT that I'm afraid you'll have
problems with when you review it. If there were any HOT chains which are
broken according to a new index definition then a any transaction considering
using that index needs to know whether there's any possibility the plan will
be used with a snapshot which can see those old tuples. It currently does this
by checking if the transaction which created the index is in its serializable
snapshot.

It seems weird to have the planner using snapshots in any way, but I don't see
any direct problems which arise. Essentially it's using the serializable
snapshot as a proxy for better live snapshot bookkeeping. If there's no
serializable snapshot then any future snapshot taken will surely not be able
to see the old tuples, and if there is then it should be fine (though not as
good as possible if we did more bookkeeping) to use it to determine whether
the old tuples could possibly be visible when the plan is executed.

I'm a bit afraid the plan will stay cached longer than would be ideal. If the
plan doesn't use the new index then ideally we would want to invalidate it
from the cache at the end of this transaction I suppose.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin
>> but not an xid, so at that point you become a problem for VACUUM.
>> However, internally you don't have any live snapshots (if you're in READ
>> COMMITTED mode), so eventually we could have you stop publishing an xmin
>> too.  That's something for 8.4 though.

> Aren't there some things that depend on the idea that even READ COMMITTED
> transactions still have a serializable snapshot lying around for them to use?

I don't see why.  A READ COMMITTED transaction that is between
statements does still have a serializable snapshot sitting around, but
it won't ever reference it again.  (AFAIR you can't switch an active
transaction into serializable mode.)  So with a bit of work on snapshot
management we could recognize that we have no live snapshots and clear
the published xmin.  This has been discussed before, eg this thread:
http://archives.postgresql.org/pgsql-patches/2007-03/msg00381.php

> This is actually one of the rough spots in HOT that I'm afraid you'll have
> problems with when you review it. If there were any HOT chains which are
> broken according to a new index definition then a any transaction considering
> using that index needs to know whether there's any possibility the plan will
> be used with a snapshot which can see those old tuples. It currently does this
> by checking if the transaction which created the index is in its serializable
> snapshot.

This seems to be a proxy for "what's the oldest live snapshot in this
backend", which is exactly what we'd have to track to adjust xmin
anyway.
        regards, tom lane


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Josh Berkus
Date:
Greg,

> Aren't there some things that depend on the idea that even READ COMMITTED
> transactions still have a serializable snapshot lying around for them to
> use?

No, that would be REPEATABLE READ.  That's one of the areas where we need to 
test HOT; does RR and SERIALIZABLE still work correctly with HOT?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Gregory Stark
Date:
"Josh Berkus" <josh@agliodbs.com> writes:

> No, that would be REPEATABLE READ.  That's one of the areas where we need to 
> test HOT; does RR and SERIALIZABLE still work correctly with HOT?

I'm a little confused by your question. REPEATABLE READ is an isolation level
we don't directly support in Postgres. If you set the isolation level to
REPEATABLE READ you get SERIALIZABLE which is a higher level.

HOT works fine with SERIALIZABLE, It uses the same criteria as vacuum for
determining when a tuple is dead, namely it compares against RecentGlobalXmin.

The check for whether a transaction can see any old tuples in "broken" chains
uses the serializable snapshot as a conservative proxy for the oldest snapshot
which might be in use. That will work for both serializable and read
committed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Low hanging fruit in lazy-XID-assignment patch?

From
Bruce Momjian
Date:
Added to TODO:

* Expire published xmin for read-only and idle transactions
 http://archives.postgresql.org/pgsql-hackers/2007-09/msg00343.php


---------------------------------------------------------------------------

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > As a fallout of this work that I haven't seen made explicit, a session
> > opening a transaction and then sitting around doing nothing will not
> > cause as many problems as it used to -- for example it won't cause
> > VACUUM to be unable to clean up dead rows.  Is this correct?
> 
> Yeah, if you just issue BEGIN and then sit, you won't have acquired
> either an xid or an xmin, so you don't create a VACUUM problem anymore.
> 
> If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin
> but not an xid, so at that point you become a problem for VACUUM.
> However, internally you don't have any live snapshots (if you're in READ
> COMMITTED mode), so eventually we could have you stop publishing an xmin
> too.  That's something for 8.4 though.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +