Thread: nested transactions

nested transactions

From
Bruce Momjian
Date:
I am going to work on nested transactions for 7.4.

My goal is to first implement nested transactions:
BEGIN;SELECT ...BEGIN;UPDATE;COMMIT;DELETE;COMMIT;

and later savepoints (Oracle):

BEGIN;SELECT ...SAVEPOINT t1;UPDATE;SAVEPOINT t2;DELETE;ROLLBACK TO SAVEPOINT t2;COMMIT;

I assume people want both.

As an implementation, I will hack xact.c to create a transaction status
stack so when you do a BEGIN inside a transaction, it saves the
transaction status, the transaction block status, and perhaps the
command counter.  A COMMIT restores these values.

I also plan to modify the on commit/abort actions.  On a subtransaction
commit, little has to be done, but on an ABORT, you must execute any
abort actions required by that subtransaction _and_ remove any on commit
actions for the subtransaction.  There will need to be some code
reorganization because some on commit/abort activity assumes only one
transaction can be in process.  A stack will need to be added in those
cases.


And finally, I must abort tuple changes made by the aborted
subtransaction.  One way of doing that is to keep all relation id's
modified by the transaction, and do a sequential scan of the tables on
abort, changing the transaction id's to a fixed aborted transaction id. 
However, this could be slow.  (We could store tids if only a few rows
are updated by a subtransaction.  That would speed it up considerably.)

Another idea is to use new transaction id's for the subtransactions, and
update the transaction id status in pg_clog for the subtransactions, so
that there is no transaction id renumbering required.  One problem with
this is the requirement of updating all the clog transaction id statuses
atomically.  One way to do that would be to do parent/child dependency
in clog so that if a child is looked up and it is marked as "in
process", a check could be done against the parent.  Once the outer
transaction is committed/aborted, those subtransactions could be updated
so there would be no need to reference the parent any longer.  This
would increase the clog size per transaction from 2 bits to 4 bytes 
(two bits for status, 30 bits for offset to parent).

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am going to work on nested transactions for 7.4.
> [some details]

This is, of course, barely scratching the surface of what will need to
be done.

I assume you've abandoned the notion of a fast release cycle for 7.4?
'Cause if you start on this, we ain't releasing any time soon ...
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am going to work on nested transactions for 7.4.
> > [some details]
> 
> This is, of course, barely scratching the surface of what will need to
> be done.
> 
> I assume you've abandoned the notion of a fast release cycle for 7.4?
> 'Cause if you start on this, we ain't releasing any time soon ...

Abandoned because of the delay in Win32 (end of Dec), PITR (not being
worked on), and mostly because very few wanted a short release cycle.

I will keep the transaction changes private to my tree, so if I can't
get it done, I will just keep it for the next release.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Scott Lamb
Date:
Bruce Momjian wrote:
> I am going to work on nested transactions for 7.4.

If you're going to do a lot of reworking of how transactions are 
handled, maybe this is a good time to beg for cursors that stay open 
across commits. It looks like the JDBC driver is moving to using cursors 
with ResultSet.CLOSE_CURSORS_AT_COMMIT, for the advantage of not having 
to fetch the entire result immediately and hold it in memory. If this 
were implemented, the same could be done for 
ResultSet.HOLD_CURSORS_OVER_COMMIT, which I think a lot of JDBC code needs.

Thanks,
Scott



Re: nested transactions

From
snpe
Date:
On Friday 22 November 2002 04:36 pm, Scott Lamb wrote:
> Bruce Momjian wrote:
> > I am going to work on nested transactions for 7.4.
>
> If you're going to do a lot of reworking of how transactions are
> handled, maybe this is a good time to beg for cursors that stay open
> across commits. It looks like the JDBC driver is moving to using cursors
> with ResultSet.CLOSE_CURSORS_AT_COMMIT, for the advantage of not having
> to fetch the entire result immediately and hold it in memory. If this
> were implemented, the same could be done for
> ResultSet.HOLD_CURSORS_OVER_COMMIT, which I think a lot of JDBC code needs.
>

I agree.It is my favorite features - and if you set savepoint I think that stay first solution
(begin; ... ; begin; ...; begin; ...;comit; ...;commit;...; commit;

Thanks 
Haris Peco


Re: nested transactions

From
Manfred Koizar
Date:
On Fri, 22 Nov 2002 00:32:46 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:
>I am going to work on nested transactions for 7.4.
> [...]
>And finally, I must abort tuple changes made by the aborted
>subtransaction.  One way of doing that is to keep all relation id's
>modified by the transaction, and do a sequential scan of the tables on
>abort, changing the transaction id's to a fixed aborted transaction id. 
>However, this could be slow.  (We could store tids if only a few rows
>are updated by a subtransaction.  That would speed it up considerably.)

Depends on your definition of "few".  I don't expect problems for up
to several thousand tids.  If there are more modified tuples, we could
first reduce the list to page numbers, before finally falling back to
table scans.

>Another idea is to use new transaction id's for the subtransactions, and
>[...]
>would increase the clog size per transaction from 2 bits to 4 bytes 
>(two bits for status, 30 bits for offset to parent).

Nice idea, this 30 bit offset.  But one could argue that increased
clog size even hurts users who don't use nested transactions at all.
If parent/child dependency is kept separate from status bits (in
pg_subtransxxxx files), additional I/O cost is only paid if
subtransactions are actually used.  New status bits (XMIN_IS_SUB,
XMAX_IS_SUB) in tuple headers can avoid unnecessary parent xid
lookups.

I also thought of subtransaction xids in tuple headers as short lived
information.  Under certain conditions they can be replaced with the
parent xid as soon as the parent transaction has finished.  I proposed
this to be done on the next tuple access just like we set
committed/aborted flags now, though I'm not sure anymore that it is
safe to do this.

Old pg_subtrans files can be removed by VACUUM.

One more difference between the two proposals:  The former (locally
remember modified tuples) can be used for recovery after a failed
command.  The latter (subtrans tree) can only help, if we give a new
xid to each command, which I'm sure we don't want to do.

ServusManfred


Re: nested transactions

From
"Ken Hirsch"
Date:
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
> And finally, I must abort tuple changes made by the aborted
> subtransaction.  One way of doing that is to keep all relation id's
> modified by the transaction, and do a sequential scan of the tables on
> abort, changing the transaction id's to a fixed aborted transaction id.
> However, this could be slow.  (We could store tids if only a few rows
> are updated by a subtransaction.  That would speed it up considerably.)

Are you sure you don't want to use the log for this?  It does mean that the
log can grow without bound for long-lived transactions, but it's very
straightforward and fast.

Ken Hirsch



Re: nested transactions

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> On Fri, 22 Nov 2002 00:32:46 -0500 (EST), Bruce Momjian
> <pgman@candle.pha.pa.us> wrote:
> >I am going to work on nested transactions for 7.4.
> > [...]
> >And finally, I must abort tuple changes made by the aborted
> >subtransaction.  One way of doing that is to keep all relation id's
> >modified by the transaction, and do a sequential scan of the tables on
> >abort, changing the transaction id's to a fixed aborted transaction id. 
> >However, this could be slow.  (We could store tids if only a few rows
> >are updated by a subtransaction.  That would speed it up considerably.)
> 
> Depends on your definition of "few".  I don't expect problems for up
> to several thousand tids.  If there are more modified tuples, we could
> first reduce the list to page numbers, before finally falling back to
> table scans.

Yes, and the key point is that those are kept only in the backend local
memory, so clearly thousands are possible.  The outer transaction takes
care of all the ACID issues.

> >Another idea is to use new transaction id's for the subtransactions, and
> >[...]
> >would increase the clog size per transaction from 2 bits to 4 bytes 
> >(two bits for status, 30 bits for offset to parent).
> 
> Nice idea, this 30 bit offset.  But one could argue that increased
> clog size even hurts users who don't use nested transactions at all.
> If parent/child dependency is kept separate from status bits (in
> pg_subtransxxxx files), additional I/O cost is only paid if
> subtransactions are actually used.  New status bits (XMIN_IS_SUB,
> XMAX_IS_SUB) in tuple headers can avoid unnecessary parent xid
> lookups.
> 
> I also thought of subtransaction xids in tuple headers as short lived
> information.  Under certain conditions they can be replaced with the
> parent xid as soon as the parent transaction has finished.  I proposed
> this to be done on the next tuple access just like we set
> committed/aborted flags now, though I'm not sure anymore that it is
> safe to do this.
> 
> Old pg_subtrans files can be removed by VACUUM.
> 
> One more difference between the two proposals:  The former (locally
> remember modified tuples) can be used for recovery after a failed
> command.  The latter (subtrans tree) can only help, if we give a new
> xid to each command, which I'm sure we don't want to do.

The interesting issue is that if we could set the commit/abort bits all
at the same time, we could have the parent/child dependency local to the
backend --- other backends don't need to know the parent, only the
status of the (subtransaction's) xid, and they need to see all those
xid's committed at the same time.

You could store the backend slot id in pg_clog rather than the parent
xid and look up the status of the outer xid for that backend slot.  That
would allow you to use 2 bytes, with a max of 16k backends.  The problem
is that on a crash, the pg_clog points to invalid slots --- it would
probably have to be cleaned up on startup.

But still, you have an interesting idea of just setting the bit to be "I
am a child".  The trick is allowing backends to figure out who's child
you are.  We could store this somehow in shared memory, but that is
finite and there can be lots of xid's for a backend using
subtransactions.

I still think there must be a clean way, but I haven't figured it out yet.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Bruce Momjian
Date:
Ken Hirsch wrote:
> From: "Bruce Momjian" <pgman@candle.pha.pa.us>
> > And finally, I must abort tuple changes made by the aborted
> > subtransaction.  One way of doing that is to keep all relation id's
> > modified by the transaction, and do a sequential scan of the tables on
> > abort, changing the transaction id's to a fixed aborted transaction id.
> > However, this could be slow.  (We could store tids if only a few rows
> > are updated by a subtransaction.  That would speed it up considerably.)
> 
> Are you sure you don't want to use the log for this?  It does mean that the
> log can grow without bound for long-lived transactions, but it's very
> straightforward and fast.

I don't think we want to have unlimited log file growth for long running
transactions/subtransactions.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Hans-Jürgen Schönig
Date:
Is there going to be a way to use transactions inside transactions of 
transactions?
In other words:
   BEGIN;   BEGIN;   BEGIN;   BEGIN;
   COMMIT;   COMMIT;   COMMIT;   COMMIT;

Is there a way to have some sort of recursive solution with every 
transaction but the first one being a child transaction?
Is there a way to implement that without too much extra effort?
I just curious how that could be done.
   Hans





Re: nested transactions

From
Manfred Koizar
Date:
On Wed, 27 Nov 2002 22:47:33 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:
>The interesting issue is that if we could set the commit/abort bits all
>at the same time, we could have the parent/child dependency local to the
>backend --- other backends don't need to know the parent, only the
>status of the (subtransaction's) xid, and they need to see all those
>xid's committed at the same time.

You mean the commit/abort bit in the tuple headers?  Yes, this would
be interesting, but I see no way how this could be done.  If it could,
there would be no need for pg_clog.

Reading your paragraph above one more time I think you mean the bits
in pg_clog:  Each subtransaction gets its own xid.  On ROLLBACK the
abort bits of the aborted (sub)transaction and all its children are
set in pg_clog immediately.  This operation does not have to be
atomic.  On subtransaction COMMIT nothing happens to pg_clog, the
status is only changed locally, the subtransaction still looks "in
progress" to other backends.  Only when the main transaction commits,
we set the commit bits of the main transaction and all its non-aborted
children in pg_clog.  This action has to be atomic.  Right?

AFAICS the problem lies in updating several pg_clog bits at once.  How
can this be done without holding a potentially long lasting lock?

>You could store the backend slot id in pg_clog rather than the parent
>xid and look up the status of the outer xid for that backend slot.  That
>would allow you to use 2 bytes, with a max of 16k backends.  The problem
>is that on a crash, the pg_clog points to invalid slots --- it would
>probably have to be cleaned up on startup.

Again I would try to keep pg_clog compact and store the backend slots
in another file, thus not slowing down instances where subtransactions
are nor used.  Apart from this minor detail I don't see, how this is
supposed to work.  Could you elaborate?

>But still, you have an interesting idea of just setting the bit to be "I
>am a child".

The idea was to set subtransaction bits in the tuple header.  Here is
yet another different idea:  Let the currently unused fourth state in
pg_clog indicate a committed subtransaction.  There are two bits per
transaction, commit and abort, with the following meaning:
a  c0  0  transaction in progress, the owning backend knows whether it is      a main- or a sub-transaction, other
backendsdon't care1  0  aborted, nobody cares whether main- or sub-transaction0  1  committed main-transaction (*)1  1
committedsub-transaction, have to look for parent in      pg_subtrans
 

If we allow the 1/1 state to be replaced with 0/1 or 1/0 (on the fly
as a side effect of a visibility check, or by vacuum, or by
COMMIT/ROLLBACK), this could save a lot of parent lookups without
having to touch the xids in the tuple headers.

So (*) should read: committed main-transaction or committed
sub-transaction having a committed parent.

>The trick is allowing backends to figure out who's child
>you are.  We could store this somehow in shared memory, but that is
>finite and there can be lots of xid's for a backend using
>subtransactions.

The subtrans dependencies have to be visible to all backends.  Store
them to disk just like pg_clog.  In older proposals I spoke of a
pg_subtrans "table" containing (parent, child) pairs.  This was only
meant as a concept, not as a real SQL table subject to MVCC.  An
efficient(?) implementation could be an array of parent xids, indexed
by child xid.  Most of it can be stolen from the clog code.

One more argument for pg_subtrans being visible to all backends:  If
an UPDATE is about to change a tuple touched by another active
transaction, it waits for the other transaction to commit or abort.
We must always wait for the main transaction, not the subtrans.

>I still think there must be a clean way,
I hope so ...

> but I haven't figured it out yet.
Are we getting nearer?

ServusManfred


Re: nested transactions

From
Bruce Momjian
Date:
Hans-J�rgen Sch�nig wrote:
> Is there going to be a way to use transactions inside transactions of 
> transactions?
> In other words:
> 
>     BEGIN;
>     BEGIN;
>     BEGIN;
>     BEGIN;
> 
>     COMMIT;
>     COMMIT;
>     COMMIT;
>     COMMIT;
> 
> Is there a way to have some sort of recursive solution with every 
> transaction but the first one being a child transaction?
> Is there a way to implement that without too much extra effort?
> I just curious how that could be done.

Sure, nesting will be unlimited.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> On Wed, 27 Nov 2002 22:47:33 -0500 (EST), Bruce Momjian
> <pgman@candle.pha.pa.us> wrote:
> >The interesting issue is that if we could set the commit/abort bits all
> >at the same time, we could have the parent/child dependency local to the
> >backend --- other backends don't need to know the parent, only the
> >status of the (subtransaction's) xid, and they need to see all those
> >xid's committed at the same time.
> 
> You mean the commit/abort bit in the tuple headers?  Yes, this would
> be interesting, but I see no way how this could be done.  If it could,
> there would be no need for pg_clog.
> 
> Reading your paragraph above one more time I think you mean the bits
> in pg_clog:  Each subtransaction gets its own xid.  On ROLLBACK the

Right.

> abort bits of the aborted (sub)transaction and all its children are
> set in pg_clog immediately.  This operation does not have to be
> atomic.  On subtransaction COMMIT nothing happens to pg_clog, the

Right, going from RUNNING to ABORTED doesn't have to be atomic because
both tuples are invisible.

> status is only changed locally, the subtransaction still looks "in
> progress" to other backends.  Only when the main transaction commits,
> we set the commit bits of the main transaction and all its non-aborted
> children in pg_clog.  This action has to be atomic.  Right?

Right.  We can't have some backends looking at part of the transaction
as committed while at the same time other backends see the transaction
as in process.

> AFAICS the problem lies in updating several pg_clog bits at once.  How
> can this be done without holding a potentially long lasting lock?

Yes, locking is one possible solution, but no one likes that.  One hack
lock idea would be to create a subtransaction-only lock, so if you see
the special 4-th xact state (about to be committed as part of a
subtransaction) you have to wait on that lock (held by the backend
twiddling the xact bits), then look again.  That basically would
serialize all the bit-twiddling for subtransactions.  I am sure I am
going to get a "yuck" from the audience on that one, but I am not sure
how long that bit twiddling could take.  Does xact twiddle every cause
I/O?  I think it could, which would be a pretty big performance problem.
It would serialize the subtransaction commits _and_ block anyone trying
to get the status of those subtransactions.  We would not use the the
4th xid status during the transaction, only while we were twiddling the
bits on commit.

> >You could store the backend slot id in pg_clog rather than the parent
> >xid and look up the status of the outer xid for that backend slot.  That
> >would allow you to use 2 bytes, with a max of 16k backends.  The problem
> >is that on a crash, the pg_clog points to invalid slots --- it would
> >probably have to be cleaned up on startup.
> 
> Again I would try to keep pg_clog compact and store the backend slots
> in another file, thus not slowing down instances where subtransactions
> are nor used.  Apart from this minor detail I don't see, how this is
> supposed to work.  Could you elaborate?

The trick is that when that 4th status is set, backends looking up the
status all need to point to a central location that can be set for all
of them at once, hence the original idea of putting the parent xid in
the clog file.  We don't _need_ to do that, but we do need a way to
_point_ to a central location where the status can be looked up.

> >But still, you have an interesting idea of just setting the bit to be "I
> >am a child".
> 
> The idea was to set subtransaction bits in the tuple header.  Here is
> yet another different idea:  Let the currently unused fourth state in
> pg_clog indicate a committed subtransaction.  There are two bits per
> transaction, commit and abort, with the following meaning:
> 
>  a  c
>  0  0  transaction in progress, the owning backend knows whether it is
>        a main- or a sub-transaction, other backends don't care
>  1  0  aborted, nobody cares whether main- or sub-transaction
>  0  1  committed main-transaction (*)
>  1  1  committed sub-transaction, have to look for parent in
>        pg_subtrans
> 
> If we allow the 1/1 state to be replaced with 0/1 or 1/0 (on the fly
> as a side effect of a visibility check, or by vacuum, or by
> COMMIT/ROLLBACK), this could save a lot of parent lookups without
> having to touch the xids in the tuple headers.

Yes, you could do that, but we can easily just set the clog bits
atomically, and it will not be needed --- the tuple bits really don't
help us, I think.

> So (*) should read: committed main-transaction or committed
> sub-transaction having a committed parent.
> 
> >The trick is allowing backends to figure out who's child
> >you are.  We could store this somehow in shared memory, but that is
> >finite and there can be lots of xid's for a backend using
> >subtransactions.
> 
> The subtrans dependencies have to be visible to all backends.  Store
> them to disk just like pg_clog.  In older proposals I spoke of a
> pg_subtrans "table" containing (parent, child) pairs.  This was only
> meant as a concept, not as a real SQL table subject to MVCC.  An
> efficient(?) implementation could be an array of parent xids, indexed
> by child xid.  Most of it can be stolen from the clog code.

OK, we put it in a file.  And how do we efficiently clean it up?
Remember, it is only to be used for a _brief_ period of time.  I think a
file system solution is doable if we can figure out a way not to create
a file for every xid.  Do we spin through the files (one per outer
transaction) looking for a matching xid when we see that 4th xact state?

Maybe we write the xid's to a file in a special directory in sorted
order, and backends can do a btree search of each file in that directory
looking for the xid, and then knowing the master xid, look up that
status, and once all the children xid's are updated, you delete the
file.

> One more argument for pg_subtrans being visible to all backends:  If
> an UPDATE is about to change a tuple touched by another active
> transaction, it waits for the other transaction to commit or abort.
> We must always wait for the main transaction, not the subtrans.

Yes, but again, the xid status of subtransactions is only update just
before commit of the main transaction, so there is little value to
having those visible.

Let's keep going!

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Bruce Momjian
Date:
pgman wrote:
> > AFAICS the problem lies in updating several pg_clog bits at once.  How
> > can this be done without holding a potentially long lasting lock?
> 
> Yes, locking is one possible solution, but no one likes that.  One hack
> lock idea would be to create a subtransaction-only lock, so if you see
> the special 4-th xact state (about to be committed as part of a
> subtransaction) you have to wait on that lock (held by the backend
> twiddling the xact bits), then look again.  That basically would
> serialize all the bit-twiddling for subtransactions.  I am sure I am
> going to get a "yuck" from the audience on that one, but I am not sure
> how long that bit twiddling could take.  Does xact twiddle every cause
> I/O?  I think it could, which would be a pretty big performance problem.
> It would serialize the subtransaction commits _and_ block anyone trying
> to get the status of those subtransactions.  We would not use the the
> 4th xid status during the transaction, only while we were twiddling the
> bits on commit.

Let me correct this.  Transaction state readers _don't_ have to block
while the subtransaction is twiddling bits.  Logic would be:
Set all aborted subtransaction status bitsGrab subxact lockSet subxact global status bit to in progressSet all
subtransactionxids to SUBXACT_TO_COMMITSet subxact global status bit to committed (commit happens here)Set all
SUBXACT_TO_COMMITxids to COMMITTEDRelease subxact lock
 

Any transaction looking up a subtransaction that has an
SUBXACT_TO_COMMIT state has to consult the global subxact status bit,
which is a global variable in shared memory.

What this basically does is to funnel all subxid lookups into a single
global subxid status bit.  In fact, even the outer transaction has to be
set to SUBXACT_TO_COMMIT so it commits at the same time as the
subtransactions.

On crash recovery, all SUBXACT_TO_COMMIT have to be cleaned up, somehow,
perhaps using WAL.
The only downside to this approach is that subtransaction bit twiddling
is serialized.

This is an interesting idea becuase it has overhead only to backends
using subtransactions.  It does kill our "multiple commits at the same
time" that we can do with normal transactions.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Yes, locking is one possible solution, but no one likes that.  One hack
> lock idea would be to create a subtransaction-only lock, so if you see
> the special 4-th xact state (about to be committed as part of a
> subtransaction) you have to wait on that lock (held by the backend
> twiddling the xact bits), then look again.  That basically would
> serialize all the bit-twiddling for subtransactions.  I am sure I am
> going to get a "yuck" from the audience on that one,

You sure are.

> but I am not sure
> how long that bit twiddling could take.  Does xact twiddle every cause
> I/O?

Yes, if the page of pg_clog you need to touch is not currently in a
buffer.  With a large transaction you might have hundreds of
subtransactions, which could take an unpleasantly long time to mark
all committed.

What's worse, I think the above proposal requires a *single* lock for
this purpose (if there's more than one, how shall the requestor know
which one to block on?) --- so you are serializing all transaction
commits that have subtransactions, with only one able to go through at
a time.  That will really, really not do; the performance will be way
worse than the chaining idea we discussed before.

> You could store the backend slot id in pg_clog rather than the parent
> xid and look up the status of the outer xid for that backend slot.  That
> would allow you to use 2 bytes, with a max of 16k backends.

This is also a bad idea, because backend slot ids are not stable (by the
time you look in PG_PROC, the slot may be occupied by a new, unrelated
backend process).

> But still, you have an interesting idea of just setting the bit to be "I
> am a child".

That bit alone doesn't help; you need to know *whose* child.

AFAICS, the objection to putting parent xact IDs into pg_clog is
basically a performance issue: bigger clog means more I/O.  This is
surely true; but the alternatives proposed so far are worse.
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
I should add that I am not prepared to overhaul the pg_clog file format
as part of adding subtransactions for 7.4.  I can do the tid/sequential scan
method for abort, or the single-lock method described.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Yes, locking is one possible solution, but no one likes that.  One hack
> > lock idea would be to create a subtransaction-only lock, so if you see
> > the special 4-th xact state (about to be committed as part of a
> > subtransaction) you have to wait on that lock (held by the backend
> > twiddling the xact bits), then look again.  That basically would
> > serialize all the bit-twiddling for subtransactions.  I am sure I am
> > going to get a "yuck" from the audience on that one,
> 
> You sure are.
> 
> > but I am not sure
> > how long that bit twiddling could take.  Does xact twiddle every cause
> > I/O?
> 
> Yes, if the page of pg_clog you need to touch is not currently in a
> buffer.  With a large transaction you might have hundreds of
> subtransactions, which could take an unpleasantly long time to mark
> all committed.
> 
> What's worse, I think the above proposal requires a *single* lock for
> this purpose (if there's more than one, how shall the requestor know
> which one to block on?) --- so you are serializing all transaction
> commits that have subtransactions, with only one able to go through at
> a time.  That will really, really not do; the performance will be way
> worse than the chaining idea we discussed before.
> 
> > You could store the backend slot id in pg_clog rather than the parent
> > xid and look up the status of the outer xid for that backend slot.  That
> > would allow you to use 2 bytes, with a max of 16k backends.
> 
> This is also a bad idea, because backend slot ids are not stable (by the
> time you look in PG_PROC, the slot may be occupied by a new, unrelated
> backend process).
> 
> > But still, you have an interesting idea of just setting the bit to be "I
> > am a child".
> 
> That bit alone doesn't help; you need to know *whose* child.
> 
> AFAICS, the objection to putting parent xact IDs into pg_clog is
> basically a performance issue: bigger clog means more I/O.  This is
> surely true; but the alternatives proposed so far are worse.
> 
>             regards, tom lane
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I should add that I am not prepared to overhaul the pg_clog file format
> as part of adding subtransactions for 7.4.  I can do the tid/sequential scan
> method for abort, or the single-lock method described.

If you think that changing the pg_clog file format would be harder than
either of those other ideas, I think you're very badly mistaken.
pg_clog is touched only by one rather simple module.

I think the other methods will be completely unacceptable from a
performance point of view.  They could maybe work if subtransactions
were a seldom-used feature; but the people who want to use 'em are
mostly talking about a subtransaction for *every* command.  If you
design your implementation on the assumption that subtransactions are
infrequent, it will be unusably slow.
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I should add that I am not prepared to overhaul the pg_clog file format
> > as part of adding subtransactions for 7.4.  I can do the tid/sequential scan
> > method for abort, or the single-lock method described.
> 
> If you think that changing the pg_clog file format would be harder than
> either of those other ideas, I think you're very badly mistaken.
> pg_clog is touched only by one rather simple module.

Agreed, the clog changes would be the simple solution.  However, I am
not sure I can make that level of changes.  In fact, the complexity of
having multiple transactions per backend is going to be tough for me
too.

Also, I should point out that balooning pg_clog by 16x is going to mean
we are perhaps 4-8x more likely to need extra pages to mark all
subtransactions.

Isn't there some other way we can link these subtransactions together
rather than mucking with pg_clog, as we only need the linkage while we
mark them all committed?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Also, I should point out that balooning pg_clog by 16x is going to mean
> we are perhaps 4-8x more likely to need extra pages to mark all
> subtransactions.

So?  The critical point is that we don't need to serialize the pg_clog
operations if we do it that way.  Also, we can certainly expand the
number of pg_clog pages held in memory by some amount.  Right now it's
only 4, IIRC.  We could make it 64 and probably no one would even
notice.

> Isn't there some other way we can link these subtransactions together
> rather than mucking with pg_clog, as we only need the linkage while we
> mark them all committed?

You *cannot* expect to do it all in shared memory; you will be blown out
of the water by the first long transaction that comes along, if you try.
So the question is not whether we put the status into a file, it is only
what representation we choose.

Manfred suggested a separate log file ("pg_subclog" or some such) but
I really don't see any operational advantage to that.  You still end up
with 4 bytes per transaction, you're just assuming that putting them
in a different file makes it better.  I don't see how.
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
Tom Lane wrote:
> > Isn't there some other way we can link these subtransactions together
> > rather than mucking with pg_clog, as we only need the linkage while we
> > mark them all committed?
> 
> You *cannot* expect to do it all in shared memory; you will be blown out
> of the water by the first long transaction that comes along, if you try.
> So the question is not whether we put the status into a file, it is only
> what representation we choose.
> 
> Manfred suggested a separate log file ("pg_subclog" or some such) but
> I really don't see any operational advantage to that.  You still end up
> with 4 bytes per transaction, you're just assuming that putting them
> in a different file makes it better.  I don't see how.

It only becomes better if we can throw away that file (or contents) when
the transaction completes and we have marked all the subtransactions as
completed.  We can't compress pg_clog if we store the parent info in
there.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> It only becomes better if we can throw away that file (or contents) when
> the transaction completes and we have marked all the subtransactions as
> completed.  We can't compress pg_clog if we store the parent info in
> there.

But we already have a recycling mechanism for pg_clog.  AFAICS,
creating a parallel log file with a separate recycling mechanism is
a study in wasted effort.
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > It only becomes better if we can throw away that file (or contents) when
> > the transaction completes and we have marked all the subtransactions as
> > completed.  We can't compress pg_clog if we store the parent info in
> > there.
> 
> But we already have a recycling mechanism for pg_clog.  AFAICS,
> creating a parallel log file with a separate recycling mechanism is
> a study in wasted effort.

But that recycling requires the vacuum of every database in the system. 
Do people do that frequently enough?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> But we already have a recycling mechanism for pg_clog.  AFAICS,
>> creating a parallel log file with a separate recycling mechanism is
>> a study in wasted effort.

> But that recycling requires the vacuum of every database in the system. 
> Do people do that frequently enough?

Once the auto vacuum code is in there, they won't have any choice ;-)

In any case, I saw no part of your proposal that removed the need for
vacuum, so what's your point?
        regards, tom lane


Re: nested transactions

From
Manfred Koizar
Date:
On Thu, 28 Nov 2002 21:46:09 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>Manfred suggested a separate log file ("pg_subclog" or some such) but
>I really don't see any operational advantage to that.  You still end up
>with 4 bytes per transaction, you're just assuming that putting them
>in a different file makes it better.  I don't see how.

There are two points:

1) If your site/instance/application/whatever... does not use nested
transactions or does use them only occasionally, you don't have to pay
the additional I/O cost.

2) If we update a subtransaction's pg_clog bits as soon as the status
of the main transaction is known, pg_subtrans is only visited once per
subtransaction, while pg_clog has to be looked up once per tuple.

Things might look different however, if we wrap every command into a
subtransaction...

ServusManfred


Re: nested transactions

From
"Matthew T. O'Connor"
Date:
On Friday 29 November 2002 00:56, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> But we already have a recycling mechanism for pg_clog.  AFAICS,
> >> creating a parallel log file with a separate recycling mechanism is
> >> a study in wasted effort.
> >
> > But that recycling requires the vacuum of every database in the system.
> > Do people do that frequently enough?
>
> Once the auto vacuum code is in there, they won't have any choice ;-)

OK, I know postgres needs to be vacuumed every so often (I think its to
guarantee safe XID wraparound?)  I think the AVD should do something to
guarnatee this is hapening.  Since I am working on AVD, what are the criterea
for this?  From the above I assume it also pertains to pg_clog recycling
(which is related to XID wraparound?), but I know nothing about that.

Right now AVD only performs vacuum analyze on specific tables as it deems they
need it, it does not perform vacuum on entire databases at any point yet.



Re: nested transactions

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> 1) If your site/instance/application/whatever... does not use nested
> transactions or does use them only occasionally, you don't have to pay
> the additional I/O cost.

As I already said to Bruce, designing this facility on the assumption
that it will be seldom-used is a recipe for failure.  Everybody and
his brother wants commands that don't abort the whole transaction.
As soon as this facility exists, you can bet that the standard mode
of operation will become "one subtransaction per interactive command".
If you don't design it to support that load, you may as well not bother
to build it at all.

> 2) If we update a subtransaction's pg_clog bits as soon as the status
> of the main transaction is known, pg_subtrans is only visited once per
> subtransaction, while pg_clog has to be looked up once per tuple.

How you figure that?  It seems to me the visit rate is exactly the same,
you've just divided it into two files.  Having to touch two files
instead of one seems if anything worse.
        regards, tom lane


Re: nested transactions

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> Right now AVD only performs vacuum analyze on specific tables as it deems they 
> need it, it does not perform vacuum on entire databases at any point yet.

See
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/routine-vacuuming.html

However I think that only talks about XID wraparound and datfrozenxid.
pg_clog recycling is driven off the oldest datvacuumxid in pg_database;
the AVD should think about launching a database-wide vacuum whenever
age(datvacuumxid) exceeds a million or two.
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> One more argument for pg_subtrans being visible to all backends:  If
> an UPDATE is about to change a tuple touched by another active
> transaction, it waits for the other transaction to commit or abort.
> We must always wait for the main transaction, not the subtrans.

This issue kills the idea that we can get away with providing lookup to
the other backends _only_ while we are twiddling the clog bits.  Other
transactions are going to need to know if the XID they see on the tuple
is owned by an active backend.  This means we have to provide
child/master xid lookup during the transaction, meaning we may as well
use pg_clog or separate file, especially if we can get autovacuum for
7.4.  It kills the idea that somehow locking would work.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Manfred Koizar
Date:
On Thu, 28 Nov 2002 12:59:21 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:
>Yes, locking is one possible solution, but no one likes that.  One hack
>lock idea would be to create a subtransaction-only lock, [...]
>
>> [...] without
>> having to touch the xids in the tuple headers.
>
>Yes, you could do that, but we can easily just set the clog bits
>atomically,

From what I read above I don't think we can *easily* set more than one
transaction's bits atomically.

> and it will not be needed --- the tuple bits really don't
>help us, I think.

Yes, this is what I said, or at least tried to say.  I just wanted to
make clear how this new approach (use the fourth status) differs from
older proposals (replace subtransaction ids in tuple headers).

>OK, we put it in a file.  And how do we efficiently clean it up?
>Remember, it is only to be used for a _brief_ period of time.  I think a
>file system solution is doable if we can figure out a way not to create
>a file for every xid.

I don't want to create one file for every transaction, but rather a
huge (sparse) array of parent xids.  This array is divided into
manageable chunks, represented by files, "pg_subtrans_NNNN".  These
files are only created when necessary.  At any time only a tiny part
of the whole array is kept in shared buffers.  This concept is similar
or almost equal to pg_clog, which is an array of doublebits.

>Maybe we write the xid's to a file in a special directory in sorted
>order, and backends can do a btree search of each file in that directory
>looking for the xid, and then knowing the master xid, look up that
>status, and once all the children xid's are updated, you delete the
>file.

Yes, dense arrays or btrees are other possible implementations.  But
for simplicity I'd do it pg_clog style.

>Yes, but again, the xid status of subtransactions is only update just
>before commit of the main transaction, so there is little value to
>having those visible.

Having them visible solves the atomicity problem without requiring
long locks.  Updating the status of a single (main or sub) transaction
is atomic, just like it is now.

Here is what is to be done for some operations:

BEGIN main transaction:Get a new xid (no change to current behaviour).pg_clog[xid] is still 00, meaning
active.pg_subtrans[xid]is still 0, meaning no parent.
 

BEGIN subtransaction:Push current transaction info onto local stack.Get a new xid.Record parent xid in
pg_subtrans[xid].pg_clog[xid]is still 00.
 

ROLLBACK subtransaction:Set pg_clog[xid] to 10 (aborted).Optionally set clog bits for subsubtransactions to 10.Pop
transactioninfo from stack.
 

COMMIT subtransaction:Set pg_clog[xid] to 11 (committed subtrans).Don't touch clog bits for subsubtransactions!Pop
transactioninfo from stack.
 

ROLLBACK main transaction:Set pg_clog[xid] to 10 (aborted).Optionally set clog bits for subtransactions to 10.
COMMIT main transaction:Set pg_clog[xid] to 01 (committed).Optionally set clog bits for subtransactions from 11 to
01.Don'ttouch clog bits for aborted subtransactions!
 

Visibility check by other transactions:  If a tuple is visited and its
XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to
be consulted to find out the status of the inserting/deleting
transaction xid.  If pg_clog[xid] is ...
00:  transaction still active
10:  aborted
01:  committed
11:  committed subtransaction, have to check parent

Only in this last case do we have to get parentxid from pg_subtrans.
Now we look at pg_clog[parentxid].  If we find ...
00:  parent still active, so xid is considered active, too
10:  parent aborted, so xid is considered aborted,     optionally set pg_clog[xid] = 10
01:  parent committed, so xid is considered committed,     optionally set pg_clog[xid] = 01
11:  recursively check grandparent(s) ...

For brevity the following operations are not covered in detail:
. Visibility checks for tuples inserted/deleted by a (sub)transaction
belonging to the current transaction tree (have to check local
transaction stack whenever we look at a xid or switch to a parent xid)
. HeapTupleSatisfiesUpdate (sometimes has to wait for parent
transaction)

The trick here is, that subtransaction status is immediately updated
in pg_clog on commit/abort.  Main transaction commit is atomic (just
set its commit bit).  Status 11 is short-lived, it is replaced with
the final status by one or more of
- COMMIT/ROLLBACK of the main transaction- a later visibility check (as a side effect)- VACUUM

pg_subtrans cleanup:  A pg_subtrans_NNNN file covers a known range of
transaction ids.  As soon as none of these transactions has a pg_clog
status of 11, the pg_subtrans_NNNN file can be removed.  VACUUM can do
this, and it won't even have to check the heap.

ServusManfred


Re: nested transactions

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> Visibility check by other transactions:  If a tuple is visited and its
> XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to
> be consulted to find out the status of the inserting/deleting
> transaction xid.  If pg_clog[xid] is ...

>     00:  transaction still active

>     10:  aborted

>     01:  committed

>     11:  committed subtransaction, have to check parent

> Only in this last case do we have to get parentxid from pg_subtrans.

Unfortunately this discussion is wrong.  User-level visibility checks
will usually have to fetch the parentxid in case 01 as well, because
even if the parent is committed, it might not be visible in our
snapshot.  Snapshots will record only topmost-parent XIDs (because
that's what we can find in the PG_PROC array, and anything else would
create atomicity problems anyway).  So we must chase to the topmost
parent before testing visibility.

This means that the parentxid will need to be fetched in enough cases
that it's quite dubious that pushing it to a different file saves I/O.

Also, using a 11 state doubles the amount of pg_clog I/O needed to
commit a collection of subtransactions.  You have to write 11 as the
state of each commitable subtransaction, then commit the parent (write
01 as its state), then go back and change the state of each
subtransaction to 01.  (Whether this last bit is done as part of parent
transaction commit, or during later inspections of the state of the
subtransaction, doesn't change the argument.)

I think it would be preferable to use only three states: active,
aborted, committed.  The parent commit protocol is (1) write 10 as state
of each aborted subtransaction (this should be done as soon as the
subtransaction is known aborted, rather than delaying to parent commit);
(2) write 01 as state of parent (this is the atomic commit); (3) write
01 as state of each committed subtransaction.  Readers who see 00 must
check the parent state; if the parent is committed then they have to go
back and recheck the child state (to see if it became "aborted" after
they looked).  This halves the write traffic during a commit, at the
cost of additional read traffic when subtransaction state is checked in
a narrow window after the time of parent transaction commit.  I believe
it nets out to be faster.
        regards, tom lane


Re: nested transactions

From
Manfred Koizar
Date:
On Fri, 29 Nov 2002 13:33:28 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>Unfortunately this discussion is wrong.  User-level visibility checks
>will usually have to fetch the parentxid in case 01 as well, because
>even if the parent is committed, it might not be visible in our
>snapshot.

Or we don't allow a subtransaction's status to be updated from 11 to
01 until we know, that the main transaction is visible to all active
transactions.  Didn't check whether this is expensive to find out.  At
least it should be doable by VACCUM.

>Snapshots will record only topmost-parent XIDs (because
>that's what we can find in the PG_PROC array, and anything else would
>create atomicity problems anyway).  So we must chase to the topmost
>parent before testing visibility.

BTW, I think this *forces* us to replace the sub xid with the
respective main xid in a tuple header, when we set
XMIN/MAX_IS_COMMITTED.  Otherwise we'd have to look for the main xid,
whenever a tuple is touched.

>Also, using a 11 state doubles the amount of pg_clog I/O needed to
>commit a collection of subtransactions.

Is a pg_clog page written out to disk each time a bit is changed?  I'd
expect some locality.

>I think it would be preferable to use only three states: active,
>aborted, committed.  The parent commit protocol is (1) write 10 as state
>of each aborted subtransaction (this should be done as soon as the
>subtransaction is known aborted, rather than delaying to parent commit);
>(2) write 01 as state of parent (this is the atomic commit); (3) write
>01 as state of each committed subtransaction.  Readers who see 00 must
>check the parent state; if the parent is committed then they have to go
>back and recheck the child state (to see if it became "aborted" after
>they looked).

Nice idea!  This saves the fourth status for future uses (for example,
Firebird uses it for two phase commit).  OTOH for reasons you
mentioned above there's no chance to save parent xid lookups, if we go
this way.

>This halves the write traffic during a commit, at the
>cost of additional read traffic when subtransaction state is checked in
>a narrow window after the time of parent transaction commit.  I believe
>it nets out to be faster.

Maybe.  The whole point of my approach is:  If we can limit the active
range of transactions requiring parent xid lookups to a small fraction
of the range needing pg_clog lookups, then it makes sense to store
status bits and parent xids in different files.  Otherwise keeping
them together in one file clearly is faster.

ServusManfred


Re: nested transactions

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> On Fri, 29 Nov 2002 13:33:28 -0500, Tom Lane <tgl@sss.pgh.pa.us>
> wrote:
> >Unfortunately this discussion is wrong.  User-level visibility checks
> >will usually have to fetch the parentxid in case 01 as well, because
> >even if the parent is committed, it might not be visible in our
> >snapshot.
> 
> Or we don't allow a subtransaction's status to be updated from 11 to
> 01 until we know, that the main transaction is visible to all active
> transactions.  Didn't check whether this is expensive to find out.  At
> least it should be doable by VACCUM.
> 
> >Snapshots will record only topmost-parent XIDs (because
> >that's what we can find in the PG_PROC array, and anything else would
> >create atomicity problems anyway).  So we must chase to the topmost
> >parent before testing visibility.
> 
> BTW, I think this *forces* us to replace the sub xid with the
> respective main xid in a tuple header, when we set
> XMIN/MAX_IS_COMMITTED.  Otherwise we'd have to look for the main xid,
> whenever a tuple is touched.

Sorry, I don't follow this.  As far as I know, we will set the subxid on
the tuple so we can independently mark the xact as aborted without
revisiting all the tuples.  Once it is committed/rolled back, I see no
need to lookup the parent, and in fact we could clear the clog parent
xid offset so there is no way to access the parent anymore.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> Maybe.  The whole point of my approach is:  If we can limit the active
> range of transactions requiring parent xid lookups to a small fraction
> of the range needing pg_clog lookups, then it makes sense to store
> status bits and parent xids in different files.  Otherwise keeping
> them together in one file clearly is faster.

Hmm ... I'm not sure that that's possible.

But wait a moment.  The child xid is by definition always greater than
(newer than) its parent.  So if we consult pg_clog and find the
transaction marked committed, *and* the xid is before the window of XIDs
in our snapshot, then even if it's not a top-level xid, the parent must
be before our window too.  Therefore we can conclude the transaction is
visible in our snapshot.  So indeed there is a good-size range of xids
for which we'll never need to chase the parent link: everything before
the RecentGlobalXmin computed by GetSnapshotData.  (We do have to set
subtransactions to committed during parent commit to make this true;
we can't update them lazily.  But I think that's okay.)

Maybe you're right --- we could probably truncate pg_subtrans faster
than pg_clog, and we could definitely expect to keep less of it in
memory than pg_clog.
        regards, tom lane


Re: nested transactions

From
Bruce Momjian
Date:
I am concerned this is getting beyond my capabilities for 7.4 --- anyone
want to help?

---------------------------------------------------------------------------

Tom Lane wrote:
> Manfred Koizar <mkoi-pg@aon.at> writes:
> > Maybe.  The whole point of my approach is:  If we can limit the active
> > range of transactions requiring parent xid lookups to a small fraction
> > of the range needing pg_clog lookups, then it makes sense to store
> > status bits and parent xids in different files.  Otherwise keeping
> > them together in one file clearly is faster.
> 
> Hmm ... I'm not sure that that's possible.
> 
> But wait a moment.  The child xid is by definition always greater than
> (newer than) its parent.  So if we consult pg_clog and find the
> transaction marked committed, *and* the xid is before the window of XIDs
> in our snapshot, then even if it's not a top-level xid, the parent must
> be before our window too.  Therefore we can conclude the transaction is
> visible in our snapshot.  So indeed there is a good-size range of xids
> for which we'll never need to chase the parent link: everything before
> the RecentGlobalXmin computed by GetSnapshotData.  (We do have to set
> subtransactions to committed during parent commit to make this true;
> we can't update them lazily.  But I think that's okay.)
> 
> Maybe you're right --- we could probably truncate pg_subtrans faster
> than pg_clog, and we could definitely expect to keep less of it in
> memory than pg_clog.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Manfred Koizar
Date:
[Sorry for the delay.  I'm a bit busy these days.]

On Fri, 29 Nov 2002 15:57:17 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:
>> BTW, I think this *forces* us to replace the sub xid with the
>> respective main xid in a tuple header, when we set
>> XMIN/MAX_IS_COMMITTED.  Otherwise we'd have to look for the main xid,
>> whenever a tuple is touched.
>
>Sorry, I don't follow this.

Probably because we've mixed up several proposals.  I'll try to pick
them apart below.

>As far as I know, we will set the subxid on
>the tuple so we can independently mark the xact as aborted without
>revisiting all the tuples.

Yes.

>Once it is committed/rolled back,

These cases are completely different.  If a (main or sub-) transaction
is rolled back, its effects are invisible to all transactions; this
status is immediately effective and final.  OTOH a subtransaction
commit is only tentative.  It becomes effective when the main
transaction commits.  (And the subtransaction's status turns to
aborted, when the main transaction aborts.)

>I see no
>need to lookup the parent, and in fact we could clear the clog parent
>xid offset so there is no way to access the parent anymore.

While a subtransaction is seen as "tentatively committed" other
transactions have to look up its parent to find out its effective
status.

Proposal A was:  Never show "tentatively committed" to outside
transactions.  This would require neither any new flags in tuple
headers or in pg_clog nor a globally visible pg_subtrans structure.
But it only works, if we can commit a main transaction and all its
subtransactions atomically, which is only possible if we hold a long
lasting lock.  Did we agree that we don't want this?

All other solutions require a parent xid lookup at least during the
time span while a subtransaction is marked "tentatively committed" and
not yet known to be "finally committed".  IIRC we have three proposals
how the "tentatively committed" status can be shown to outside
transactions:

(B) Two flags in the tuple header (one for xmin, one for xmax) telling
us "the xid is a subtransaction".  I don't like this very much,
because it's not in Normal Form: "is a subtransaction" is NOT a
property of a tuple.  OTOH we can declare it a denormalization for
performance reasons (we don't have to look up the parend xid, if the
flag is not set.)

(C) Explicitly use the fourth possible status in pg_clog for
"tentatively committed".  (Performance hack:  replace with "finally
committed" as soon as the xid is visible to all active transactions.)

(D) Only one kind of "committed" in pg_clog; always look for a parent
in pg_subtrans; for performance reasons integrate pg_subtrans into
pc_clog.

Tom brought up the snapshot visibility problem which applies to B, C,
and D.

While each of these proposals can be implemented (relatively) straight
forward, the Black Art is:  When and how can we modify the stored
state to avoid repeated parent xid lookups?  We'll find out ...

ServusManfred


Re: nested transactions

From
Bruce Momjian
Date:
Manfred Koizar wrote:
> [Sorry for the delay.  I'm a bit busy these days.]
> 
> On Fri, 29 Nov 2002 15:57:17 -0500 (EST), Bruce Momjian
> <pgman@candle.pha.pa.us> wrote:
> >> BTW, I think this *forces* us to replace the sub xid with the
> >> respective main xid in a tuple header, when we set
> >> XMIN/MAX_IS_COMMITTED.  Otherwise we'd have to look for the main xid,
> >> whenever a tuple is touched.
> >
> >Sorry, I don't follow this.
> 
> Probably because we've mixed up several proposals.  I'll try to pick
> them apart below.

OK.

> These cases are completely different.  If a (main or sub-) transaction
> is rolled back, its effects are invisible to all transactions; this
> status is immediately effective and final.  OTOH a subtransaction
> commit is only tentative.  It becomes effective when the main
> transaction commits.  (And the subtransaction's status turns to
> aborted, when the main transaction aborts.)

Right.

> >I see no
> >need to lookup the parent, and in fact we could clear the clog parent
> >xid offset so there is no way to access the parent anymore.
> 
> While a subtransaction is seen as "tentatively committed" other
> transactions have to look up its parent to find out its effective
> status.

Right.  And we need those lookups to parent from the start of the
subtransaction until the commit/abort of the main transaction.  If it
aborts, we can shorten that, but if they are all commit, we have to
wait, and they have to be visible because other backends have to know if
the "Running" status of the transaction is still associated with an
active transaction, and we can only stamp one xid on a backend because
shared memory is limited.

> Proposal A was:  Never show "tentatively committed" to outside
> transactions.  This would require neither any new flags in tuple
> headers or in pg_clog nor a globally visible pg_subtrans structure.
> But it only works, if we can commit a main transaction and all its
> subtransactions atomically, which is only possible if we hold a long
> lasting lock.  Did we agree that we don't want this?

Again, we still need the lookup to main transaction for other backend
lookups, so this idea is dead, and we don't want locking.

> All other solutions require a parent xid lookup at least during the
> time span while a subtransaction is marked "tentatively committed" and
> not yet known to be "finally committed".  IIRC we have three proposals
> how the "tentatively committed" status can be shown to outside
> transactions:

Yes.

> (B) Two flags in the tuple header (one for xmin, one for xmax) telling
> us "the xid is a subtransaction".  I don't like this very much,
> because it's not in Normal Form: "is a subtransaction" is NOT a
> property of a tuple.  OTOH we can declare it a denormalization for
> performance reasons (we don't have to look up the parend xid, if the
> flag is not set.)

I see no reason to do that when we have that 4th state available in
pg_clog.  They are going to lookup the xid status anyway, so why not
check that "is subtransaction" status at that point too.  Of course, we
can't mark "IS COMMITTED" on the tuple until the main transaction
commits, but that is simple logic.


> (C) Explicitly use the fourth possible status in pg_clog for
> "tentatively committed".  (Performance hack:  replace with "finally
> committed" as soon as the xid is visible to all active transactions.)

Yes, I think this is the only way to go.  If we need that 4th state
later, we can refactor the code, but for our purposes now, it is useful.

> (D) Only one kind of "committed" in pg_clog; always look for a parent
> in pg_subtrans; for performance reasons integrate pg_subtrans into
> pc_clog.

Seems that 4th state makes this an easy optimization, causing zero
overhead for backends that _don't_ use subtransactions, except for
backends looking up the status of other backends with subtransactions.

> Tom brought up the snapshot visibility problem which applies to B, C,
> and D.
> 
> While each of these proposals can be implemented (relatively) straight
> forward, the Black Art is:  When and how can we modify the stored
> state to avoid repeated parent xid lookups?  We'll find out ...

I think there is now general agreement that we want a separate table to
store parent xids for subtransactions that is only looked up when that
4th clog state is set, and once the main transaction commits, all those
4th state clog entries can be cleaned up to simple commit.  We can also
expire the pg_subtrans table for any xids less than the lowest running
backend xid, which is pretty significant optimization.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: nested transactions

From
Kevin Brown
Date:
Bruce Momjian wrote:
> I am going to work on nested transactions for 7.4.
> 
> My goal is to first implement nested transactions:
> 
>     BEGIN;
>     SELECT ...
>     BEGIN;
>     UPDATE;
>     COMMIT;
>     DELETE;
>     COMMIT;
> 
> and later savepoints (Oracle):
> 
> 
>     BEGIN;
>     SELECT ...
>     SAVEPOINT t1;
>     UPDATE;
>     SAVEPOINT t2;
>     DELETE;
>     ROLLBACK TO SAVEPOINT t2;
>     COMMIT;
> 
> I assume people want both.

Yep.

My question is: how do you see cursors working with nested
transactions?

Right now you can't do cursors outside of transactions.
Subtransactions would complicate things a bit:

BEGIN;
DECLARE CURSOR x ...
BEGIN
(is cursor x visible here?  What are the implications of using it if
it is?)
...
COMMIT;
...
COMMIT;


Would we only allow cursors within the innermost transactions?  If we
allow them anywhere else, why retain the requirement that they be used
within transactions at all?


-- 
Kevin Brown                          kevin@sysexperts.com


Re: nested transactions

From
Bruce Momjian
Date:
Kevin Brown wrote:
> My question is: how do you see cursors working with nested
> transactions?
> 
> Right now you can't do cursors outside of transactions.
> Subtransactions would complicate things a bit:
> 
> BEGIN;
> DECLARE CURSOR x ...
> BEGIN
> (is cursor x visible here?  What are the implications of using it if
> it is?)
> ...
> COMMIT;
> ...
> COMMIT;
> 
> 
> Would we only allow cursors within the innermost transactions?  If we
> allow them anywhere else, why retain the requirement that they be used
> within transactions at all?

I talked to Tom and he feels it will be too hard to rollback a
subtransaction that affects cursors so we will disable use of cursors in
subtransactions, at least in the first implementation.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073