Thread: Deriving Recovery Snapshots

Deriving Recovery Snapshots

From
Simon Riggs
Date:
I've worked out what I think is a workable, efficient process for
deriving snapshots during recovery. I will be posting a patch to show
how this works tomorrow [Wed 15 Oct], just doing cleanup now.

Recovery Snapshots are snapshots taken during recovery. They are valid
snapshots in all ways for testing visibility. Assembling the information
to allow snapshots to be taken operates differently in recovery than it
does in normal processing.

The first thing to realise is that in recovery the only knowledge of
what occurs is through WAL records. If it isn't in the WAL, we don't
know about it. Most of the time that also means we can ignore events
that we know occurred, for example data reads.

In order to build the recovery snapshot data we need to re-create events
from WAL data. In some cases we have to add new WAL records to ensure
that all possible information is present.

Each backend that existed on the master is represented by a PROC
structure in the ProcArray. These are known as "recovery procs" and are
similar to the dummy procs used for prepared transactions. All recovery
procs are "owned by" the Startup process. So there is no process for
which MyProc points to one of the recovery procs. This structure allows
us to record the top-level transactions and then put subtransactions in
the proc's subtransaction cache. A fixed one-to-one correspondence
allows efficient maintenance of the structures. We emulate all
transactional backend, including autovac.

So in Hot Standby mode we have one set of Recovery Procs emulating what
happened on the master, and another set running read only work.

We maintain information according to certain events on the master.
1. xid assignment (top-level xids)
2. xid assignment (subtransactions)
3. xid commit
4. xid subcommit
5. xid abort/subabort
6. backends which have FATAL errors but write no abort record.

(3) and (5) are easy because we already have WAL records for them.
For (3) we already updated clog from the WAL record, so we just need to
identify the proc and then set the xid.

(4) is completely removed by re-arranging subcommit so it is covered by
commit. (Atomic subxid patch)

(6) is a problem since this event can make transactions disappear
completely from the WAL record. If they crash immediately after Xid
assignment then they may crash without ever writing a WAL record at all.
We handle this in two ways. 
* First, we keep a PROC for each backendid. Notice that means we keep a
PROC for each slot in the master's procarray, not for each pid. So if a
backend explodes and then someone reconnects using that same procarray
slot we will know that the previous transaction on that slot has
completed. This is a subtle but important point: without the ability to
infer certain transactions are complete we would need to keep track of a
potentially unlimited number of xids. Tying transactions to proc slots
means we will never have more than a fixed number of missing xids to
deal with. 
* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

(1), (2) xid assignment doesn't appear in WAL. Writing WAL records for
each xid assigned would have a catastrophic effect on performance,
especially if we realise that we would have to do that while holding
XidGenLock. So we have to do lots of pressups to avoid it, in the
following ways: 

We put a flag on the first WAL record written by a new transaction.
(Actually we mark the first WAL record containing the new xid, which
isn't always the first WAL record in the transaction. Weird, huh? Think
Oids, Storage files etc). We add an extra xid onto the WAL record to
hold the parent xid, and use that to maintain subtrans.

This works partially but not completely. It is possible for a
transaction to start a very large number of subtransactions before any
part of the transaction writes WAL. We only have space on the WAL record
for one additional xid. Each subxid must record its immediate parent's
xid in subtrans, so if we assign more than one *subtransaction* at a
time we *must* then write a WAL record for all the xids assigned apart
from the last one.

So that only affects transactions which use two or more subtransactions
in a transaction *and* who insist of starting subtransactions before
anything has been written, so not very common. So AssignTransactionId()
sometimes needs to write WAL records.

Another problem is that xids flagged on WAL records don't arrive in WAL
in the order they were assigned. So we must cope with out-of-order or
"unobserved xids". When we replay WAL, we keep track of UnobservedXids
in a shared memory array. These UnobservedXids are added onto any
recovery Snapshot taken iff they are earlier than latestCompletedXid. So
in the typical case, no xids will be added to the snapshots. For now, I
do all this work holding ProcArrayLock, but there seems scope to
optimise that also. Later.

UnobservedXids is maintained as a sorted array. This comes for free
since xids are always added in xid assignment order. This allows xids to
be removed via bsearch when WAL records arrive for the missing xids. It
also allows us to stop searching for xids once we reach
latestCompletedXid.

As a result of the AssignTransaction WAL records we know that each
backend will only ever allocate at most 2 xids before notifying WAL in
some way, either by flagging a WAL entry it makes or by making an entry
when assigning the new xids. As a result the UnobservedXids array will
never overflow if it has 2* MaxBackends entries. (I've added code, but
#ifdef'd it out).

Since UnobservedXids can never overflow, we also see that the Snapshot
can never overflow *because* of UnobservedXids. Each unobserved
top-level xid leaves space for 65 xids, yet we need only 2 to add the
unobserved xids.

I've had to change the way XidInMVCCSnapshot() works. We search the
snapshot even if it has overflowed. This is actually a performance win
in cases where only a few xids have overflowed but most haven't. This is
essential because if we were forced to check in subtrans *and*
unobservedxids existed then the snapshot would be invalid. (I could have
made it this way *just* in recovery, but the change seems better both
ways).

So that's how we maintain info required for Snapshots, but the next part
of the puzzle is how we handle the initial state. Again, subtransactions
are a pain because there can be an extremely large number of them. So
taking a snapshot and copying it to WAL is insufficient. We handle this
by taking a snapshot when we have performed pg_start_backup() (in the
checkpoint we already take) and then taking another snapshot after each
checkpoint. Doing it that way means wherever we restart from we always
have an initial state record close to hand. On the standby, if the first
snapshot we see has overflowed then we either wait for a snapshot to
arrive which has not overflowed. (We could also wait for a snapshot
whose xmin is later than the xmax of our first snapshot).

This means that there could be a delay in starting Hot Standby mode *if*
we are heavily using subtransactions at the time we take backup.

So overheads of the patch are:
* WAL record extended to completely fill 8-byte alignment; extra 4 bytes
per record on 4-byte alignment. Additional items are:uint16        xl_info2;TransactionId    xl_xid2;
This takes no additional space on 64-bit servers because of previous
wastage.

* AssignTransactionId must WAL log xid assignment when making multiple
assignments.

* bgwriter writes Snapshot data to WAL after each checkpoint.

* additional shared memory: 
2 * MaxBackends * sizeof(TransactionId) for UnobservedXids
1 * MaxBackends * sizeof(PGPROC) for RecoveryProcs

* additional processing time during recovery to maintain snapshot info

In current patch I've put the slotid and flag bits into uint16. That
means we can manage up to 4096 connections without writing any
additional WAL data. Beyond that we need to write a WAL record for each
AssignTransactionId(), plus add 4 bytes onto each Commit/Abort record.

Note that because of the atomic subxids changes we actually write fewer
WAL records in most cases than we did before and they occupy the same
space they did before.

I'll post patch tomorrow and at least weekly after this.

Patch footprint looks like this prior to cleanup.
backend/access/transam/varsup.c |   52 -!backend/access/transam/xact.c   |  559
++++++++++++++++++++++!!!!!backend/access/transam/xlog.c  |   49 +-backend/postmaster/bgwriter.c   |   11
backend/storage/ipc/procarray.c|  721 +++++++++++++++++++++++++++++!!!backend/storage/lmgr/proc.c     |  107
+++++backend/utils/time/tqual.c     |   27 !include/access/heapam.h         |    2 include/access/htup.h           |
2include/access/transam.h        |    2 include/access/xact.h           |   23 +include/access/xlog.h           |   44
+!include/access/xlog_internal.h |    2 include/catalog/pg_control.h    |    3 include/storage/proc.h          |    4
include/storage/procarray.h    |   14 include/utils/snapshot.h        |   65 +++17 files changed, 1432 insertions(+),
44deletions(-), 211 mods(!)
 

Your comments are welcome, especially questions and thoughts around the
correctness of the approach. Lots more comments in patch.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Jeff Davis
Date:
On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote:
> I've worked out what I think is a workable, efficient process for
> deriving snapshots during recovery. I will be posting a patch to show
> how this works tomorrow [Wed 15 Oct], just doing cleanup now.

How will this interact with an idea like this?:
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00400.php

> I've had to change the way XidInMVCCSnapshot() works. We search the
> snapshot even if it has overflowed. This is actually a performance win
> in cases where only a few xids have overflowed but most haven't. This is
> essential because if we were forced to check in subtrans *and*
> unobservedxids existed then the snapshot would be invalid. (I could have
> made it this way *just* in recovery, but the change seems better both
> ways).

I don't entirely understand this. Can you explain the situation that
would result in an invalid snapshot?

Regards,Jeff Davis



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-15 at 12:58 -0700, Jeff Davis wrote:
> On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote:
> > I've worked out what I think is a workable, efficient process for
> > deriving snapshots during recovery. I will be posting a patch to show
> > how this works tomorrow [Wed 15 Oct], just doing cleanup now.
> 
> How will this interact with an idea like this?:
> http://archives.postgresql.org/pgsql-hackers/2008-01/msg00400.php

pg_snapclone should work fine, since it is orthogonal to this work.

> > I've had to change the way XidInMVCCSnapshot() works. We search the
> > snapshot even if it has overflowed. This is actually a performance win
> > in cases where only a few xids have overflowed but most haven't. This is
> > essential because if we were forced to check in subtrans *and*
> > unobservedxids existed then the snapshot would be invalid. (I could have
> > made it this way *just* in recovery, but the change seems better both
> > ways).
> 
> I don't entirely understand this. Can you explain the situation that
> would result in an invalid snapshot?

In recovery the snapshot consists of two sets of xids:
* ones we have seen as running e.g. xid=43
* ones we know exist, but haven't seen yet (e.g. xid=42)
(I call this latter kind Unobserved Transactions).

Both kinds of xids *must* be in the snapshot for MVCC to work.

The current way of checking snapshots is to say "if *any* of the running
transactions has overflowed, check subtrans".

Unobserved transactions are not in subtrans, so if you checked for them
there you would fail to find them. Currently we assume that means it is
a top-level transaction and then check the top-level xids.

Why are unobserved transactions not in subtrans? Because they are
unobserved, so we can't assign their parent xid. (By definition, because
they are unobserved).

There isn't always enough space in the snapshot to allow all the
unobserved xids to be added as if they were top-level transactions, so
we put them into the subtrans cache as a secondary location and then
change the algorithm in XidInMVCCSnapshot(). We don't want to increase
the size of the snapshot because it already contains wasted space in
subtrans cache, nor do we wish to throw errors when people try to take
snapshots. 

The XidInMVCCSnapshot() changes make sense of themselves for most cases,
since we don't want one transaction to cause us to thrash subtrans, as
happened in 8.1.

This took me some time to think through...

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote:

> I've worked out what I think is a workable, efficient process for
> deriving snapshots during recovery. I will be posting a patch to show
> how this works tomorrow [Wed 15 Oct], just doing cleanup now.

OK, here's the latest patch. Found a bug late last night, fixed now.

This patch derives snapshot data during recovery. It contains all code
to write and read back WAL changes etc.. It is the "main patch" for Hot
Standby. Everything else hangs around this, enhances it or puts in
additional checks.

Having said that, this patch doesn't let you connect and run queries.
I've written this part as a standalone patch, for easier reviewing. So
it is still a "WIP" patch.

Patch can be tested by running a workload and then stop server, -m
immediate and then watching the replay log (or gdb). Successful tested
with some custom subtransaction scripts and lots of tests using make
installcheck-parallel, then crash recovery.

diffstat
 backend/access/transam/slru.c     |   16
 backend/access/transam/twophase.c |    2
 backend/access/transam/xact.c     |  664 +++++++++++++++++++!!!!!!!!
 backend/access/transam/xlog.c     |   58 +-
 backend/storage/ipc/procarray.c   |  781 +++++++++++++++++++++++++++!!!
 backend/storage/lmgr/proc.c       |  107 +++++
 backend/utils/time/tqual.c        |   27 !
 include/access/xact.h             |   26 +
 include/access/xlog.h             |   44 +!
 include/access/xlog_internal.h    |    2
 include/catalog/pg_control.h      |    3
 include/storage/proc.h            |    4
 include/storage/procarray.h       |   17
 include/utils/snapshot.h          |   65 +++
 14 files changed, 1541 insertions(+), 19 deletions(-), 256 mods(!)

Prepared transactions do not yet work correctly, but the principles are
the same so I expect this to be done in a similar way. Any input
welcome.

Other related patches are
* recovery_infrastruc.v9.patch
* atomic_subxids.v7.patch
They don't all apply cleanly together, but the changes are unrelated, so
those patches can still be reviewed without wasting energy.

Next phase is connecting and running queries, next few days. That will
probably shake out a few more bugs from this code.

Comments welcome.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Thu, 2008-10-16 at 13:55 +0100, Simon Riggs wrote:

> Other related patches are
> * recovery_infrastruc.v9.patch
> * atomic_subxids.v7.patch
> They don't all apply cleanly together, but the changes are unrelated, so
> those patches can still be reviewed without wasting energy.
> 
> Next phase is connecting and running queries, next few days. That will
> probably shake out a few more bugs from this code.

I've integrated my five patches together into one now:
* recovery_infrastruc.v9.patch
* atomic_subxids.v7.patch
* hs_connect
* hs_checks
* hs_snapshot

Seems positive that it all integrated so quickly and tests OK.
More later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Thu, 2008-10-16 at 15:20 +0100, Simon Riggs wrote:

> I've integrated my five patches together into one now:
> * recovery_infrastruc.v9.patch
> * atomic_subxids.v7.patch
> * hs_connect
> * hs_checks
> * hs_snapshot
> 
> Seems positive that it all integrated so quickly and tests OK.
> More later.

Wahoo! Worked first time.

postgres=# select count(*) from branches;count 
-------    1
(1 row)

That's with archive recovery waiting for next WAL file, so not applying
new changes or holding locks.

So that's the good news. Still have much more work to make everything
work safely while WAL apply happens. And the shooting hasn't started yet
either. So there's light, but nowhere near the end of the tunnel.

So looks like I'll be shipping first combined patch tomorrow.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Each backend that existed on the master is represented by a PROC
> structure in the ProcArray. These are known as "recovery procs" and are
> similar to the dummy procs used for prepared transactions. All recovery
> procs are "owned by" the Startup process. So there is no process for
> which MyProc points to one of the recovery procs. This structure allows
> us to record the top-level transactions and then put subtransactions in
> the proc's subtransaction cache. A fixed one-to-one correspondence
> allows efficient maintenance of the structures. We emulate all
> transactional backend, including autovac.

We'll need to know the max_connections setting in the master, in order 
to size the array correctly. Not a show-stopper, but would be nicer if 
we didn't need to.

> * The backend slot may not be reused for some time, so we should take
> additional actions to keep state current and true. So we choose to log a
> snapshot from the master into WAL after each checkpoint. This can then
> be used to cleanup any unobserved xids. It also provides us with our
> initial state data, see later.

We don't need to log a complete snapshot, do we? Just oldestxmin should 
be enough.

> UnobservedXids is maintained as a sorted array. This comes for free
> since xids are always added in xid assignment order. This allows xids to
> be removed via bsearch when WAL records arrive for the missing xids. It
> also allows us to stop searching for xids once we reach
> latestCompletedXid.

If we're going to have an UnobservedXids array, why don't we just treat 
all in-progress transactions as Unobserved, and forget about the dummy 
PROC entries?

Also, I can't help thinking that this would be a lot simpler if we just 
treated all subtransactions the same as top-level transactions. The only 
problem with that is that there can be a lot of subtransactions, which 
means that we'd need a large UnobservedXids array to handle the worst 
case, but maybe it would still be acceptable?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Each backend that existed on the master is represented by a PROC
> > structure in the ProcArray. These are known as "recovery procs" and are
> > similar to the dummy procs used for prepared transactions. All recovery
> > procs are "owned by" the Startup process. So there is no process for
> > which MyProc points to one of the recovery procs. This structure allows
> > us to record the top-level transactions and then put subtransactions in
> > the proc's subtransaction cache. A fixed one-to-one correspondence
> > allows efficient maintenance of the structures. We emulate all
> > transactional backend, including autovac.
> 
> We'll need to know the max_connections setting in the master, in order 
> to size the array correctly. Not a show-stopper, but would be nicer if 
> we didn't need to.

Yes. We'll probably need to add checks/configurability later. Unless you
have a way...

> > * The backend slot may not be reused for some time, so we should take
> > additional actions to keep state current and true. So we choose to log a
> > snapshot from the master into WAL after each checkpoint. This can then
> > be used to cleanup any unobserved xids. It also provides us with our
> > initial state data, see later.
> 
> We don't need to log a complete snapshot, do we? Just oldestxmin should 
> be enough.

Possibly, but you're thinking that once we're up and running we can use
less info.

Trouble is, you don't know when/if the standby will crash/be shutdown.
So we need regular full snapshots to allow it to re-establish full
information at regular points. So we may as well drop the whole snapshot
to WAL every checkpoint. To do otherwise would mean more code and less
flexibility.

With default settings that is at most 25600 bytes for subxid cache, plus
a maybe 2000 bytes for other info. For most cases, we will use less than
1 wal buffer.

> > UnobservedXids is maintained as a sorted array. This comes for free
> > since xids are always added in xid assignment order. This allows xids to
> > be removed via bsearch when WAL records arrive for the missing xids. It
> > also allows us to stop searching for xids once we reach
> > latestCompletedXid.
> 
> If we're going to have an UnobservedXids array, why don't we just treat 
> all in-progress transactions as Unobserved, and forget about the dummy 
> PROC entries?

That's a good question and I expected some debate on that.

The main problem is fatal errors that don't write abort records. By
reusing the PROC entries we can keep those to a manageable limit. If we
don't have that, the number of fatal errors could cause that list to
grow uncontrollably and we might overflow any setting, causing snapshots
to stall and new queries to hang. We really must have a way to place an
upper bound on the number of unobserved xacts. So we really need the
proc approach. But we also need the UnobservedXids array.

It's definitely more code to have both, so I would not have chosen that
route if there was another way. The simple approach just doesn't cover
all possible cases, and we need to cover them all.

Having only an UnobservedXid array was my first thought and I said
earlier I would do it without using procs. Bad idea. Using the
UnobservedXids array means every xact removal requires a bsearch,
whereas with procs we can do a direct lookup, removing all xids in one
stroke. Much better for typical cases. Also, if we have procs we can use
the "no locks" approach in some cases, as per current practice on new
xid insertions.

> Also, I can't help thinking that this would be a lot simpler if we just 
> treated all subtransactions the same as top-level transactions. The only 
> problem with that is that there can be a lot of subtransactions, which 
> means that we'd need a large UnobservedXids array to handle the worst 
> case, but maybe it would still be acceptable?

Yes, you see the problem. Without subtransactions, this would be a
simple issue to solve.

In one sense, I do as you say. When we make a snapshot we stuff the
UnobservedXids into the snapshot *somewhere*. We don't know whether they
are top level or subxacts. But we need a solution for when we run out of
top-level xid places in the snapshot. Which has now been provided,
luckily.

If we have no upper bound on snapshot size then *all* backends would
need a variable size snapshot. We must solve that problem or accept
having people wait maybe minutes for a snapshot in worst case. I've
found one way of placing a bound on the number of xids we need to keep
in the snapshot. If there is another, better way of keeping it bounded I
will happily adopt it. I spent about 2 weeks sweating this issue...

I'm available tomorrow to talk in real time if there's people in the Dev
room at PGday want to discuss this, or have me explain the patch(es). 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> * The backend slot may not be reused for some time, so we should take
>>> additional actions to keep state current and true. So we choose to log a
>>> snapshot from the master into WAL after each checkpoint. This can then
>>> be used to cleanup any unobserved xids. It also provides us with our
>>> initial state data, see later.
>> We don't need to log a complete snapshot, do we? Just oldestxmin should 
>> be enough.
> 
> Possibly, but you're thinking that once we're up and running we can use
> less info.
> 
> Trouble is, you don't know when/if the standby will crash/be shutdown.
> So we need regular full snapshots to allow it to re-establish full
> information at regular points. So we may as well drop the whole snapshot
> to WAL every checkpoint. To do otherwise would mean more code and less
> flexibility.

Surely it's less code to write the OldestXmin to the checkpoint record, 
rather than a full snapshot, no? And to read it off the checkpoint record.

>>> UnobservedXids is maintained as a sorted array. This comes for free
>>> since xids are always added in xid assignment order. This allows xids to
>>> be removed via bsearch when WAL records arrive for the missing xids. It
>>> also allows us to stop searching for xids once we reach
>>> latestCompletedXid.
>> If we're going to have an UnobservedXids array, why don't we just treat 
>> all in-progress transactions as Unobserved, and forget about the dummy 
>> PROC entries?
> 
> That's a good question and I expected some debate on that.
> 
> The main problem is fatal errors that don't write abort records. By
> reusing the PROC entries we can keep those to a manageable limit. If we
> don't have that, the number of fatal errors could cause that list to
> grow uncontrollably and we might overflow any setting, causing snapshots
> to stall and new queries to hang. We really must have a way to place an
> upper bound on the number of unobserved xacts. So we really need the
> proc approach. But we also need the UnobservedXids array.

If you write the oldestxmin (or a full snapshot, including the 
oldestxmin) to each checkpoint record, you can crop out any unobserved 
xids older than that, when you replay the checkpoint record.

> Having only an UnobservedXid array was my first thought and I said
> earlier I would do it without using procs. Bad idea. Using the
> UnobservedXids array means every xact removal requires a bsearch,
> whereas with procs we can do a direct lookup, removing all xids in one
> stroke. Much better for typical cases.

How much does that really matter? Under normal circumstances, the array 
would be quite small anyway. A bsearch of a relatively small array isn't 
that expensive. Or a hash table, so that removing/inserting items 
doesn't need to shift all the following entries.

>> Also, I can't help thinking that this would be a lot simpler if we just 
>> treated all subtransactions the same as top-level transactions. The only 
>> problem with that is that there can be a lot of subtransactions, which 
>> means that we'd need a large UnobservedXids array to handle the worst 
>> case, but maybe it would still be acceptable?
> 
> Yes, you see the problem. Without subtransactions, this would be a
> simple issue to solve.
> 
> In one sense, I do as you say. When we make a snapshot we stuff the
> UnobservedXids into the snapshot *somewhere*. We don't know whether they
> are top level or subxacts. But we need a solution for when we run out of
> top-level xid places in the snapshot. Which has now been provided,
> luckily.
> 
> If we have no upper bound on snapshot size then *all* backends would
> need a variable size snapshot. We must solve that problem or accept
> having people wait maybe minutes for a snapshot in worst case. I've
> found one way of placing a bound on the number of xids we need to keep
> in the snapshot. If there is another, better way of keeping it bounded I
> will happily adopt it. I spent about 2 weeks sweating this issue...

How about:

1. Keep all transactions and subtransactions in UnobservedXids.
2. If it fills up, remove all subtransactions from it, that the startup 
process knows to be subtransactions and knows the parents, and update 
subtrans. Mark the array as overflowed.

To take a snapshot, a backend simply copies UnobservedXids array and the 
flag. If it hasn't overflowed, a transaction is considered to be in 
progress if it's in the array. If it has overflowed, and the xid is not 
in the array, check subtrans

Note that the startup process sees all WAL records, so it can do 
arbitrarily complex bookkeeping in backend-private memory, and only 
expose the necessary parts in shared mem. For example, it can keep track 
of the parent-child relationships of the xids in UnobservedXids, but the 
backends taking snapshots don't need to know about that. For step 2 to 
work, that's exactly what the startup process needs to keep track of.

For the startup process to know about the parent-child relationships, 
we'll need something like WAL changes you suggested. I'm not too 
thrilled about adding a new field to all WAL records. Seems simpler to 
just rely on the new WAL records on AssignTransactionId(), and we can 
only do it, say, every 100 subtransactions, if we make the 
UnobservedXids array big enough (100*max_connections).

This isn't actually that different from your proposal. The big 
difference is that instead of PROC entries and UnobservedXids, all 
transactions are tracked in UnobservedXids, and instead of caching 
subtransactions in the subxids array in PROC entries, they're cached in 
UnobservedXids as well.


Aanother, completely different approach, would be to forget about xid 
arrays altogether, and change the way snapshots are taken: just do a 
full memcpy of the clog between xmin and xmax. That might be pretty slow 
if xmax-xmin is big, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Hannu Krosing
Date:
On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
>
> Also, I can't help thinking that this would be a lot simpler if we just 
> treated all subtransactions the same as top-level transactions. The only 
> problem with that is that there can be a lot of subtransactions, which 
> means that we'd need a large UnobservedXids array to handle the worst 
> case, but maybe it would still be acceptable?

I remember cases on this list where long transactions did run out of
subtransaction ids. To accommodate something approacing that we need an
array for storing (max_connections * 4G ) UnobservedXids instead of just
max_connections.

-----------------
Hannu



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> * The backend slot may not be reused for some time, so we should take
> >>> additional actions to keep state current and true. So we choose to log a
> >>> snapshot from the master into WAL after each checkpoint. This can then
> >>> be used to cleanup any unobserved xids. It also provides us with our
> >>> initial state data, see later.
> >> We don't need to log a complete snapshot, do we? Just oldestxmin should 
> >> be enough.
> > 
> > Possibly, but you're thinking that once we're up and running we can use
> > less info.
> > 
> > Trouble is, you don't know when/if the standby will crash/be shutdown.
> > So we need regular full snapshots to allow it to re-establish full
> > information at regular points. So we may as well drop the whole snapshot
> > to WAL every checkpoint. To do otherwise would mean more code and less
> > flexibility.
> 
> Surely it's less code to write the OldestXmin to the checkpoint record, 
> rather than a full snapshot, no? And to read it off the checkpoint record.

You may be missing my point.

We need an initial state to work from.

I am proposing we write a full snapshot after each checkpoint because it
allows us to start recovery again from that point. If we wrote only
OldestXmin as you suggest it would optimise the size of the WAL record
but it would prevent us from restarting at that point.

Also, passing OldestXmin only would not work in the presence of long
running statements. Passing the snapshot allows us to see that FATAL
errors have occurred much sooner.

BTW, the way I have coded it means that if we skip writing a checkpoint
on a quiet system then we would also skip writing the snapshot.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:

> How about:
> 
> 1. Keep all transactions and subtransactions in UnobservedXids.
> 2. If it fills up, remove all subtransactions from it, that the startup 
> process knows to be subtransactions and knows the parents, and update 
> subtrans. Mark the array as overflowed.
> 
> To take a snapshot, a backend simply copies UnobservedXids array and the 
> flag. If it hasn't overflowed, a transaction is considered to be in 
> progress if it's in the array. If it has overflowed, and the xid is not 
> in the array, check subtrans

We can't check subtrans. We do not have any record of what the parent is
for an unobserved transaction id. So the complete list of unobserved
xids *must* be added to the snapshot. If that makes snapshot overflow,
we have a big problem: we would be forced to say "sorry snapshot cannot
be issued at this time, please wait". Ugh!

> Note that the startup process sees all WAL records, so it can do 
> arbitrarily complex bookkeeping in backend-private memory, and only 
> expose the necessary parts in shared mem. For example, it can keep track 
> of the parent-child relationships of the xids in UnobservedXids, but the 
> backends taking snapshots don't need to know about that. For step 2 to 
> work, that's exactly what the startup process needs to keep track of.

> For the startup process to know about the parent-child relationships, 
> we'll need something like WAL changes you suggested. I'm not too 
> thrilled about adding a new field to all WAL records. Seems simpler to 
> just rely on the new WAL records on AssignTransactionId(), and we can 
> only do it, say, every 100 subtransactions, if we make the 
> UnobservedXids array big enough (100*max_connections).

Yes, we can make the UnobservedXids array bigger, but only to the point
where it will all fit within a snapshot.

The WAL changes proposed use space that was previously wasted, so there
is no increase in amount of data going to disk. The additional time to
derive that data is very quick when those fields are unused and that
logic is executed before we take WALInsertLock. So overall, very low
overhead.

Every new subxid needs to specify its parent's xid. We must supply that
information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have
done in most cases, tuck that into the wasted space on the xlrec.
Writing a WAL record every 100 subtransactions will not work: we need to
write to subtrans *before* that xid appears anywhere on disk, so that
visibility tests can determine the status of the transaction.

The approach I have come up with is very finely balanced. It's the
*only* approach that I've come up with that covers all requirements;
there were very few technical choices to make. If it wasn't for
subtransactions, disappearing transactions because of FATAL errors and
unobserved xids it would be much simpler. But having said that, the code
isn't excessively complex, I wrote it in about 3 days.

> This isn't actually that different from your proposal. The big 
> difference is that instead of PROC entries and UnobservedXids, all 
> transactions are tracked in UnobservedXids, and instead of caching 
> subtransactions in the subxids array in PROC entries, they're cached in 
> UnobservedXids as well.

> Aanother, completely different approach, would be to forget about xid 
> arrays altogether, and change the way snapshots are taken: just do a 
> full memcpy of the clog between xmin and xmax. That might be pretty slow 
> if xmax-xmin is big, though.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:
> 
>> How about:
>>
>> 1. Keep all transactions and subtransactions in UnobservedXids.
>> 2. If it fills up, remove all subtransactions from it, that the startup 
>> process knows to be subtransactions and knows the parents, and update 
>> subtrans. Mark the array as overflowed.
>>
>> To take a snapshot, a backend simply copies UnobservedXids array and the 
>> flag. If it hasn't overflowed, a transaction is considered to be in 
>> progress if it's in the array. If it has overflowed, and the xid is not 
>> in the array, check subtrans
> 
> We can't check subtrans. We do not have any record of what the parent is
> for an unobserved transaction id. So the complete list of unobserved
> xids *must* be added to the snapshot. If that makes snapshot overflow,
> we have a big problem: we would be forced to say "sorry snapshot cannot
> be issued at this time, please wait". Ugh!

That's why we still need the occasional WAL logging in 
AssignTransactionId(). To log the parent-child relationships of the 
subtransactions.

>> For the startup process to know about the parent-child relationships, 
>> we'll need something like WAL changes you suggested. I'm not too 
>> thrilled about adding a new field to all WAL records. Seems simpler to 
>> just rely on the new WAL records on AssignTransactionId(), and we can 
>> only do it, say, every 100 subtransactions, if we make the 
>> UnobservedXids array big enough (100*max_connections).
> 
> Yes, we can make the UnobservedXids array bigger, but only to the point
> where it will all fit within a snapshot.

The list of xids in a snapshot is just a palloc'd array, in 
backend-local memory, so we can easily make it as large as we need to.

> Every new subxid needs to specify its parent's xid. We must supply that
> information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have
> done in most cases, tuck that into the wasted space on the xlrec.
> Writing a WAL record every 100 subtransactions will not work: we need to
> write to subtrans *before* that xid appears anywhere on disk, so that
> visibility tests can determine the status of the transaction.

I don't follow. It doesn't need to be in subtrans before it appears on 
disk, AFAICS. It can be stored in UnobservedXids at first, and when it 
overflows, we can update subtrans and remove the entries from 
UnobservedXids. A snapshot taken before the overflow will have the 
subxid in its copy of UnobservedXids, and one taken after overflow will 
find it in subtrans.

If UnobservedXids is large enough to hold, say 100 * max_connections 
xids, by writing a WAL record containing the parent-child relationships 
every 100 assigned subtransactions within a top-level transaction, the 
top-level transactions and those subtransactions that we don't know the 
parent of will always fit into UnobservedXids.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> * The backend slot may not be reused for some time, so we should take
>>>>> additional actions to keep state current and true. So we choose to log a
>>>>> snapshot from the master into WAL after each checkpoint. This can then
>>>>> be used to cleanup any unobserved xids. It also provides us with our
>>>>> initial state data, see later.
>>>> We don't need to log a complete snapshot, do we? Just oldestxmin should 
>>>> be enough.
>>> Possibly, but you're thinking that once we're up and running we can use
>>> less info.
>>>
>>> Trouble is, you don't know when/if the standby will crash/be shutdown.
>>> So we need regular full snapshots to allow it to re-establish full
>>> information at regular points. So we may as well drop the whole snapshot
>>> to WAL every checkpoint. To do otherwise would mean more code and less
>>> flexibility.
>> Surely it's less code to write the OldestXmin to the checkpoint record, 
>> rather than a full snapshot, no? And to read it off the checkpoint record.
> 
> You may be missing my point.
> 
> We need an initial state to work from.
> 
> I am proposing we write a full snapshot after each checkpoint because it
> allows us to start recovery again from that point. If we wrote only
> OldestXmin as you suggest it would optimise the size of the WAL record
> but it would prevent us from restarting at that point.

Well, you'd just need to treat anything > oldestxmin, and not marked as 
finished in clog, as unobserved. Which doesn't seem too bad. Not that 
storing the full list of in-progress xids is that bad either, though.

Hmm. What about in-progress subtransactions that have overflowed the 
shared mem cache? Can we rely that subtrans is up-to-date, up to the 
checkpoint record that recovery starts from?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote:

> It doesn't need to be in subtrans before it appears on 
> disk, AFAICS.

I don't make these rules. Read comments in AssignTransactionId().

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Hannu Krosing wrote:
> On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote:
>> Also, I can't help thinking that this would be a lot simpler if we just 
>> treated all subtransactions the same as top-level transactions. The only 
>> problem with that is that there can be a lot of subtransactions, which 
>> means that we'd need a large UnobservedXids array to handle the worst 
>> case, but maybe it would still be acceptable?
> 
> I remember cases on this list where long transactions did run out of
> subtransaction ids. To accommodate something approacing that we need an
> array for storing (max_connections * 4G ) UnobservedXids instead of just
> max_connections.

You can't have more than 4G (or 2G?) active subtransactions running in a 
system, because you will simply run out of transaction ids and hit xid 
wrap-around after that. So in the worst-case, you don't need space for 
(max_connections * 4G) xids, just 4G. That's still a large number, of 
course.

In situations like that, a bitmap, like clog, instead of an array, would 
be more space efficient. But that's less efficient in the more common 
case that there's few in-progress transactions, but some of them are 
very old.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote:
> The list of xids in a snapshot is just a palloc'd array, in 
> backend-local memory, so we can easily make it as large as we need to.

It's malloc'd before we hold the lock. Yes, we can make it as large as
we need to, but that would change the way it is allocated significantly
and that will require either extended lock hold durations or multiple
lock acquisitions. That doesn't sounds good.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> If that makes snapshot overflow,
> we have a big problem: we would be forced to say "sorry snapshot cannot
> be issued at this time, please wait". Ugh!

BTW, we'll need to do that anyway, if we guess the max_connections 
setting in the master incorrectly. We should avoid having to do that 
under other circumstances, of course.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 17:29 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:

> > You may be missing my point.
> > 
> > We need an initial state to work from.
> > 
> > I am proposing we write a full snapshot after each checkpoint because it
> > allows us to start recovery again from that point. If we wrote only
> > OldestXmin as you suggest it would optimise the size of the WAL record
> > but it would prevent us from restarting at that point.
> 
> Well, you'd just need to treat anything > oldestxmin, and not marked as 
> finished in clog, as unobserved. Which doesn't seem too bad. Not that 
> storing the full list of in-progress xids is that bad either, though.

As I said, I'm storing the whole list to give me an initial state.

> Hmm. What about in-progress subtransactions that have overflowed the 
> shared mem cache? Can we rely that subtrans is up-to-date, up to the 
> checkpoint record that recovery starts from?

Yes, we can't use snapshots as an initial state if they have overflowed.
In that case we could avoid dropping full snapshot to WAL, yes. That's
probably trivial to implement.

Currently, the patch can start from a non-overflowed snapshot, but it
could also start from an overflowed snapshot and wait until we receive a
snapshot with an xmin > the xmax of the first snapshot. That seemed too
much icing for the first patch. I think we need to all agree on the
correctness aspects before we attempt too much tuning and tweaking.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 17:40 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > If that makes snapshot overflow,
> > we have a big problem: we would be forced to say "sorry snapshot cannot
> > be issued at this time, please wait". Ugh!
> 
> BTW, we'll need to do that anyway, if we guess the max_connections 
> setting in the master incorrectly. We should avoid having to do that 
> under other circumstances, of course.

I'm sure it's easier to handle that in easier ways. I imagined we would
just issue a message saying "must increase max_connections" and then
shutdown. It's easy enough to document that if you change the master you
must also change the slave - it's not a very common change after all.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote:
> 
>> It doesn't need to be in subtrans before it appears on 
>> disk, AFAICS.
> 
> I don't make these rules. Read comments in AssignTransactionId().

You mean this one:

>     /*
>      * Generate a new Xid and record it in PG_PROC and pg_subtrans.
>      *
>      * NB: we must make the subtrans entry BEFORE the Xid appears anywhere in
>      * shared storage other than PG_PROC; because if there's no room for it in
>      * PG_PROC, the subtrans entry is needed to ensure that other backends see
>      * the Xid as "running".  See GetNewTransactionId.
>      */

? The way I read that is that as long as there's an entry in 
UnobservedXids, we don't need a subtrans entry.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 20:42 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote:
> > 
> >> It doesn't need to be in subtrans before it appears on 
> >> disk, AFAICS.
> > 
> > I don't make these rules. Read comments in AssignTransactionId().
> 
> You mean this one:
> 
> >     /*
> >      * Generate a new Xid and record it in PG_PROC and pg_subtrans.
> >      *
> >      * NB: we must make the subtrans entry BEFORE the Xid appears anywhere in
> >      * shared storage other than PG_PROC; because if there's no room for it in
> >      * PG_PROC, the subtrans entry is needed to ensure that other backends see
> >      * the Xid as "running".  See GetNewTransactionId.
> >      */
> 
> ? The way I read that is that as long as there's an entry in 
> UnobservedXids, we don't need a subtrans entry.

If you don't mark subtrans, you will need to keep things in shared
memory. I agree you can defer marking subtrans until your shared memory
fills, but you will still need to keep track of the parent for each
subtransaction until you do. If you don't *know* the parent, how will
you do this? You haven't explained how you will write the relationship
between a subtransaction and its parent into WAL, for each
subtransaction and for an arbitrarily large number of subtransactions.

That suggests a nice optimisation we can use in normal running and with
the Hot Standby patch, but it doesn't provide an alternate approach. 

We seem to be scrabbling around looking for a second solution. Even if
we find one you'll need to show that its substantially better than the
one that's already on the table. But right now it doesn't seem like
we're on the verge of a radically better approach, for this part.

(The optimisation is for AssignTransactionId() to not update subtrans
for the first PGPROC_MAX_CACHED_SUBXIDS subtransactions, either in
AssignTransactionId() or in xact_redo_commit(). That would mean in most
normal cases we would never update subtrans at all. It requires a slight
change to the way XidInMVCCSnapshot() works, but I've already included
that change in the current patch. I'll do that as a standalone patch.)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> If you don't mark subtrans, you will need to keep things in shared
> memory. I agree you can defer marking subtrans until your shared memory
> fills, but you will still need to keep track of the parent for each
> subtransaction until you do. If you don't *know* the parent, how will
> you do this? You haven't explained how you will write the relationship
> between a subtransaction and its parent into WAL, for each
> subtransaction and for an arbitrarily large number of subtransactions.

I was thinking that it's done in AssignTransactionId(). Similarly to 
what you're doing in your patch, but instead of just one xid with it's 
parent, a list of (xid, parent) pairs would be stored in the WAL record.

Upon replaying that WAL record, those parent-child relationships are 
recorded in subtrans, and the child xids are removed from UnobservedXids.

> We seem to be scrabbling around looking for a second solution. Even if
> we find one you'll need to show that its substantially better than the
> one that's already on the table. But right now it doesn't seem like
> we're on the verge of a radically better approach, for this part.

I'm trying to tease apart the important aspects of the design, and 
trying to reduce it into the simplest thing that can work. For example, 
since it seems like we need UnobservedXids array anyway, can we use just 
that, and not need the PROC entries? And if we need to sometimes WAL log 
in AssignTransactionId(), could we use just that one mechanism for 
transferring parent-child relationships, and not need the extra field in 
all WAL records? The less mechanisms there is to do stuff, the simpler 
the patch, the less codepaths to test, and so forth.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 22:38 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > If you don't mark subtrans, you will need to keep things in shared
> > memory. I agree you can defer marking subtrans until your shared memory
> > fills, but you will still need to keep track of the parent for each
> > subtransaction until you do. If you don't *know* the parent, how will
> > you do this? You haven't explained how you will write the relationship
> > between a subtransaction and its parent into WAL, for each
> > subtransaction and for an arbitrarily large number of subtransactions.
> 
> I was thinking that it's done in AssignTransactionId(). Similarly to 
> what you're doing in your patch, but instead of just one xid with it's 
> parent, a list of (xid, parent) pairs would be stored in the WAL record.

> Upon replaying that WAL record, those parent-child relationships are 
> recorded in subtrans, and the child xids are removed from UnobservedXids.

But once you reach 64 transactions, you'll need to write an extra WAL
record for every subtransaction, which currently I've managed to avoid.

I understand what you hope to achieve, but right now my focus is on
achieving correctness within next few days, not on reworking parts of
the patch that seem to work correctly and are reasonably well optimised.
We can discuss rework in more detail in a couple of weeks. It's been a
worthwhile discussion, but lets look somewhere else now. I could really
use your help in thinking about prepared transactions.

Forgive me please, I don't wish to be rude, though I probably am.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote:

> But once you reach 64 transactions, you'll need to write an extra WAL
> record for every subtransaction, which currently I've managed to avoid.

Some further notes/tests on the optimisation you discovered.

Because of the way I have changed tqual.c, we only need to write to
subtrans if a proc's subxid cache overflows. (I think we would need to
change tqual.c in a similar way in any of the ways so far discussed).

Anyway, quick test with a representative test case shows 3-5%
performance gain from skipping the subtrans updates. I only the test
case files here and the test patch. The test is a simple PL/pgSQL
function with one EXCEPTION clause, so fairly real world.

The patch isn't ready to apply standalone because we need to include the
changes to XidInMVCCSnapshot() also, which would take a little while to
extract. Let me know if that is worth producing a standalone patch for.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote:

> But once you reach 64 transactions, you'll need to write an extra WAL
> record for every subtransaction, which currently I've managed to avoid.

Yes, I've managed to avoid it, but it will simplify the patch if you
think its not worth bothering with. This won't really effect anybody
I've met running straight Postgres, but it may effect EDB. It's not a
problem for me, but I was second guessing objections.

If I do that then I can just pass the slotId in full on every WAL
record, which simplifies a couple of other things also.

So, does everybody accept that we will write a WAL record for every
subtransaction assigned, once we hit the size limit of the subxid cache?
i.e. currently 65th subxid  and beyond.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote:
> 
>> But once you reach 64 transactions, you'll need to write an extra WAL
>> record for every subtransaction, which currently I've managed to avoid.
> 
> Yes, I've managed to avoid it, but it will simplify the patch if you
> think its not worth bothering with. This won't really effect anybody
> I've met running straight Postgres, but it may effect EDB. It's not a
> problem for me, but I was second guessing objections.
> 
> If I do that then I can just pass the slotId in full on every WAL
> record, which simplifies a couple of other things also.
> 
> So, does everybody accept that we will write a WAL record for every
> subtransaction assigned, once we hit the size limit of the subxid cache?
> i.e. currently 65th subxid  and beyond.

Would have to see the patch to understand what the code simplicity vs. 
extra WAL logging tradeoff really is.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Thu, 2008-10-23 at 08:40 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote:
> > 
> >> But once you reach 64 transactions, you'll need to write an extra WAL
> >> record for every subtransaction, which currently I've managed to avoid.
> > 
> > Yes, I've managed to avoid it, but it will simplify the patch if you
> > think its not worth bothering with. This won't really effect anybody
> > I've met running straight Postgres, but it may effect EDB. It's not a
> > problem for me, but I was second guessing objections.
> > 
> > If I do that then I can just pass the slotId in full on every WAL
> > record, which simplifies a couple of other things also.
> > 
> > So, does everybody accept that we will write a WAL record for every
> > subtransaction assigned, once we hit the size limit of the subxid cache?
> > i.e. currently 65th subxid  and beyond.
> 
> Would have to see the patch to understand what the code simplicity vs. 
> extra WAL logging tradeoff really is.

Well, if your not certain now, then my initial feeling was correct. I
don't think everybody would agree to that. The code simplification would
be real, but I don't think it's that hard now.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Deriving Recovery Snapshots

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> The patch isn't ready to apply standalone because we need to include the
> changes to XidInMVCCSnapshot() also, which would take a little while to
> extract. Let me know if that is worth producing a standalone patch for.

FWIW, this patch becomes a lot simpler if you don't change the function
signature, and don't move the SubtransSetParent() call.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/transam/xact.c
--- src/backend/access/transam/xact.c
***************
*** 404,419 **** AssignTransactionId(TransactionState s)
          AssignTransactionId(s->parent);

      /*
!      * Generate a new Xid and record it in PG_PROC and pg_subtrans.
!      *
!      * NB: we must make the subtrans entry BEFORE the Xid appears anywhere in
!      * shared storage other than PG_PROC; because if there's no room for it in
!      * PG_PROC, the subtrans entry is needed to ensure that other backends see
!      * the Xid as "running".  See GetNewTransactionId.
       */
      s->transactionId = GetNewTransactionId(isSubXact);

!     if (isSubXact)
          SubTransSetParent(s->transactionId, s->parent->transactionId);

      /*
--- 404,418 ----
          AssignTransactionId(s->parent);

      /*
!      * Generate a new Xid and record it in PG_PROC. If there's no room
!      * in MyProc-> in MyProc->subxids, we must make the pg_subtrans
!      * entry BEFORE the Xid appears anywhere in shared storage other than
!      * PG_PROC, because the subtrans entry is needed to ensure that other
!      * backends see the Xid as "running".
       */
      s->transactionId = GetNewTransactionId(isSubXact);

!     if (isSubXact && MyProc->subxids.overflowed)
          SubTransSetParent(s->transactionId, s->parent->transactionId);

      /*

Re: Deriving Recovery Snapshots

From
Simon Riggs
Date:
On Thu, 2008-10-23 at 13:40 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The patch isn't ready to apply standalone because we need to include the
> > changes to XidInMVCCSnapshot() also, which would take a little while to
> > extract. Let me know if that is worth producing a standalone patch for.
> 
> FWIW, this patch becomes a lot simpler if you don't change the function 
> signature, and don't move the SubtransSetParent() call.

Yeh, I'm trying to judge between trying to be simple and trying to be
neat. Whichever one I pick, I seem to be wrong. :-) But I begin, perhaps
too slowly, to understand that this is a natural part of review itself,
not really a problem.

Thanks for your help.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support