Thread: Deriving Recovery Snapshots
I've worked out what I think is a workable, efficient process for deriving snapshots during recovery. I will be posting a patch to show how this works tomorrow [Wed 15 Oct], just doing cleanup now. Recovery Snapshots are snapshots taken during recovery. They are valid snapshots in all ways for testing visibility. Assembling the information to allow snapshots to be taken operates differently in recovery than it does in normal processing. The first thing to realise is that in recovery the only knowledge of what occurs is through WAL records. If it isn't in the WAL, we don't know about it. Most of the time that also means we can ignore events that we know occurred, for example data reads. In order to build the recovery snapshot data we need to re-create events from WAL data. In some cases we have to add new WAL records to ensure that all possible information is present. Each backend that existed on the master is represented by a PROC structure in the ProcArray. These are known as "recovery procs" and are similar to the dummy procs used for prepared transactions. All recovery procs are "owned by" the Startup process. So there is no process for which MyProc points to one of the recovery procs. This structure allows us to record the top-level transactions and then put subtransactions in the proc's subtransaction cache. A fixed one-to-one correspondence allows efficient maintenance of the structures. We emulate all transactional backend, including autovac. So in Hot Standby mode we have one set of Recovery Procs emulating what happened on the master, and another set running read only work. We maintain information according to certain events on the master. 1. xid assignment (top-level xids) 2. xid assignment (subtransactions) 3. xid commit 4. xid subcommit 5. xid abort/subabort 6. backends which have FATAL errors but write no abort record. (3) and (5) are easy because we already have WAL records for them. For (3) we already updated clog from the WAL record, so we just need to identify the proc and then set the xid. (4) is completely removed by re-arranging subcommit so it is covered by commit. (Atomic subxid patch) (6) is a problem since this event can make transactions disappear completely from the WAL record. If they crash immediately after Xid assignment then they may crash without ever writing a WAL record at all. We handle this in two ways. * First, we keep a PROC for each backendid. Notice that means we keep a PROC for each slot in the master's procarray, not for each pid. So if a backend explodes and then someone reconnects using that same procarray slot we will know that the previous transaction on that slot has completed. This is a subtle but important point: without the ability to infer certain transactions are complete we would need to keep track of a potentially unlimited number of xids. Tying transactions to proc slots means we will never have more than a fixed number of missing xids to deal with. * The backend slot may not be reused for some time, so we should take additional actions to keep state current and true. So we choose to log a snapshot from the master into WAL after each checkpoint. This can then be used to cleanup any unobserved xids. It also provides us with our initial state data, see later. (1), (2) xid assignment doesn't appear in WAL. Writing WAL records for each xid assigned would have a catastrophic effect on performance, especially if we realise that we would have to do that while holding XidGenLock. So we have to do lots of pressups to avoid it, in the following ways: We put a flag on the first WAL record written by a new transaction. (Actually we mark the first WAL record containing the new xid, which isn't always the first WAL record in the transaction. Weird, huh? Think Oids, Storage files etc). We add an extra xid onto the WAL record to hold the parent xid, and use that to maintain subtrans. This works partially but not completely. It is possible for a transaction to start a very large number of subtransactions before any part of the transaction writes WAL. We only have space on the WAL record for one additional xid. Each subxid must record its immediate parent's xid in subtrans, so if we assign more than one *subtransaction* at a time we *must* then write a WAL record for all the xids assigned apart from the last one. So that only affects transactions which use two or more subtransactions in a transaction *and* who insist of starting subtransactions before anything has been written, so not very common. So AssignTransactionId() sometimes needs to write WAL records. Another problem is that xids flagged on WAL records don't arrive in WAL in the order they were assigned. So we must cope with out-of-order or "unobserved xids". When we replay WAL, we keep track of UnobservedXids in a shared memory array. These UnobservedXids are added onto any recovery Snapshot taken iff they are earlier than latestCompletedXid. So in the typical case, no xids will be added to the snapshots. For now, I do all this work holding ProcArrayLock, but there seems scope to optimise that also. Later. UnobservedXids is maintained as a sorted array. This comes for free since xids are always added in xid assignment order. This allows xids to be removed via bsearch when WAL records arrive for the missing xids. It also allows us to stop searching for xids once we reach latestCompletedXid. As a result of the AssignTransaction WAL records we know that each backend will only ever allocate at most 2 xids before notifying WAL in some way, either by flagging a WAL entry it makes or by making an entry when assigning the new xids. As a result the UnobservedXids array will never overflow if it has 2* MaxBackends entries. (I've added code, but #ifdef'd it out). Since UnobservedXids can never overflow, we also see that the Snapshot can never overflow *because* of UnobservedXids. Each unobserved top-level xid leaves space for 65 xids, yet we need only 2 to add the unobserved xids. I've had to change the way XidInMVCCSnapshot() works. We search the snapshot even if it has overflowed. This is actually a performance win in cases where only a few xids have overflowed but most haven't. This is essential because if we were forced to check in subtrans *and* unobservedxids existed then the snapshot would be invalid. (I could have made it this way *just* in recovery, but the change seems better both ways). So that's how we maintain info required for Snapshots, but the next part of the puzzle is how we handle the initial state. Again, subtransactions are a pain because there can be an extremely large number of them. So taking a snapshot and copying it to WAL is insufficient. We handle this by taking a snapshot when we have performed pg_start_backup() (in the checkpoint we already take) and then taking another snapshot after each checkpoint. Doing it that way means wherever we restart from we always have an initial state record close to hand. On the standby, if the first snapshot we see has overflowed then we either wait for a snapshot to arrive which has not overflowed. (We could also wait for a snapshot whose xmin is later than the xmax of our first snapshot). This means that there could be a delay in starting Hot Standby mode *if* we are heavily using subtransactions at the time we take backup. So overheads of the patch are: * WAL record extended to completely fill 8-byte alignment; extra 4 bytes per record on 4-byte alignment. Additional items are:uint16 xl_info2;TransactionId xl_xid2; This takes no additional space on 64-bit servers because of previous wastage. * AssignTransactionId must WAL log xid assignment when making multiple assignments. * bgwriter writes Snapshot data to WAL after each checkpoint. * additional shared memory: 2 * MaxBackends * sizeof(TransactionId) for UnobservedXids 1 * MaxBackends * sizeof(PGPROC) for RecoveryProcs * additional processing time during recovery to maintain snapshot info In current patch I've put the slotid and flag bits into uint16. That means we can manage up to 4096 connections without writing any additional WAL data. Beyond that we need to write a WAL record for each AssignTransactionId(), plus add 4 bytes onto each Commit/Abort record. Note that because of the atomic subxids changes we actually write fewer WAL records in most cases than we did before and they occupy the same space they did before. I'll post patch tomorrow and at least weekly after this. Patch footprint looks like this prior to cleanup. backend/access/transam/varsup.c | 52 -!backend/access/transam/xact.c | 559 ++++++++++++++++++++++!!!!!backend/access/transam/xlog.c | 49 +-backend/postmaster/bgwriter.c | 11 backend/storage/ipc/procarray.c| 721 +++++++++++++++++++++++++++++!!!backend/storage/lmgr/proc.c | 107 +++++backend/utils/time/tqual.c | 27 !include/access/heapam.h | 2 include/access/htup.h | 2include/access/transam.h | 2 include/access/xact.h | 23 +include/access/xlog.h | 44 +!include/access/xlog_internal.h | 2 include/catalog/pg_control.h | 3 include/storage/proc.h | 4 include/storage/procarray.h | 14 include/utils/snapshot.h | 65 +++17 files changed, 1432 insertions(+), 44deletions(-), 211 mods(!) Your comments are welcome, especially questions and thoughts around the correctness of the approach. Lots more comments in patch. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote: > I've worked out what I think is a workable, efficient process for > deriving snapshots during recovery. I will be posting a patch to show > how this works tomorrow [Wed 15 Oct], just doing cleanup now. How will this interact with an idea like this?: http://archives.postgresql.org/pgsql-hackers/2008-01/msg00400.php > I've had to change the way XidInMVCCSnapshot() works. We search the > snapshot even if it has overflowed. This is actually a performance win > in cases where only a few xids have overflowed but most haven't. This is > essential because if we were forced to check in subtrans *and* > unobservedxids existed then the snapshot would be invalid. (I could have > made it this way *just* in recovery, but the change seems better both > ways). I don't entirely understand this. Can you explain the situation that would result in an invalid snapshot? Regards,Jeff Davis
On Wed, 2008-10-15 at 12:58 -0700, Jeff Davis wrote: > On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote: > > I've worked out what I think is a workable, efficient process for > > deriving snapshots during recovery. I will be posting a patch to show > > how this works tomorrow [Wed 15 Oct], just doing cleanup now. > > How will this interact with an idea like this?: > http://archives.postgresql.org/pgsql-hackers/2008-01/msg00400.php pg_snapclone should work fine, since it is orthogonal to this work. > > I've had to change the way XidInMVCCSnapshot() works. We search the > > snapshot even if it has overflowed. This is actually a performance win > > in cases where only a few xids have overflowed but most haven't. This is > > essential because if we were forced to check in subtrans *and* > > unobservedxids existed then the snapshot would be invalid. (I could have > > made it this way *just* in recovery, but the change seems better both > > ways). > > I don't entirely understand this. Can you explain the situation that > would result in an invalid snapshot? In recovery the snapshot consists of two sets of xids: * ones we have seen as running e.g. xid=43 * ones we know exist, but haven't seen yet (e.g. xid=42) (I call this latter kind Unobserved Transactions). Both kinds of xids *must* be in the snapshot for MVCC to work. The current way of checking snapshots is to say "if *any* of the running transactions has overflowed, check subtrans". Unobserved transactions are not in subtrans, so if you checked for them there you would fail to find them. Currently we assume that means it is a top-level transaction and then check the top-level xids. Why are unobserved transactions not in subtrans? Because they are unobserved, so we can't assign their parent xid. (By definition, because they are unobserved). There isn't always enough space in the snapshot to allow all the unobserved xids to be added as if they were top-level transactions, so we put them into the subtrans cache as a secondary location and then change the algorithm in XidInMVCCSnapshot(). We don't want to increase the size of the snapshot because it already contains wasted space in subtrans cache, nor do we wish to throw errors when people try to take snapshots. The XidInMVCCSnapshot() changes make sense of themselves for most cases, since we don't want one transaction to cause us to thrash subtrans, as happened in 8.1. This took me some time to think through... -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-10-14 at 18:50 +0100, Simon Riggs wrote: > I've worked out what I think is a workable, efficient process for > deriving snapshots during recovery. I will be posting a patch to show > how this works tomorrow [Wed 15 Oct], just doing cleanup now. OK, here's the latest patch. Found a bug late last night, fixed now. This patch derives snapshot data during recovery. It contains all code to write and read back WAL changes etc.. It is the "main patch" for Hot Standby. Everything else hangs around this, enhances it or puts in additional checks. Having said that, this patch doesn't let you connect and run queries. I've written this part as a standalone patch, for easier reviewing. So it is still a "WIP" patch. Patch can be tested by running a workload and then stop server, -m immediate and then watching the replay log (or gdb). Successful tested with some custom subtransaction scripts and lots of tests using make installcheck-parallel, then crash recovery. diffstat backend/access/transam/slru.c | 16 backend/access/transam/twophase.c | 2 backend/access/transam/xact.c | 664 +++++++++++++++++++!!!!!!!! backend/access/transam/xlog.c | 58 +- backend/storage/ipc/procarray.c | 781 +++++++++++++++++++++++++++!!! backend/storage/lmgr/proc.c | 107 +++++ backend/utils/time/tqual.c | 27 ! include/access/xact.h | 26 + include/access/xlog.h | 44 +! include/access/xlog_internal.h | 2 include/catalog/pg_control.h | 3 include/storage/proc.h | 4 include/storage/procarray.h | 17 include/utils/snapshot.h | 65 +++ 14 files changed, 1541 insertions(+), 19 deletions(-), 256 mods(!) Prepared transactions do not yet work correctly, but the principles are the same so I expect this to be done in a similar way. Any input welcome. Other related patches are * recovery_infrastruc.v9.patch * atomic_subxids.v7.patch They don't all apply cleanly together, but the changes are unrelated, so those patches can still be reviewed without wasting energy. Next phase is connecting and running queries, next few days. That will probably shake out a few more bugs from this code. Comments welcome. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Thu, 2008-10-16 at 13:55 +0100, Simon Riggs wrote: > Other related patches are > * recovery_infrastruc.v9.patch > * atomic_subxids.v7.patch > They don't all apply cleanly together, but the changes are unrelated, so > those patches can still be reviewed without wasting energy. > > Next phase is connecting and running queries, next few days. That will > probably shake out a few more bugs from this code. I've integrated my five patches together into one now: * recovery_infrastruc.v9.patch * atomic_subxids.v7.patch * hs_connect * hs_checks * hs_snapshot Seems positive that it all integrated so quickly and tests OK. More later. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-10-16 at 15:20 +0100, Simon Riggs wrote: > I've integrated my five patches together into one now: > * recovery_infrastruc.v9.patch > * atomic_subxids.v7.patch > * hs_connect > * hs_checks > * hs_snapshot > > Seems positive that it all integrated so quickly and tests OK. > More later. Wahoo! Worked first time. postgres=# select count(*) from branches;count ------- 1 (1 row) That's with archive recovery waiting for next WAL file, so not applying new changes or holding locks. So that's the good news. Still have much more work to make everything work safely while WAL apply happens. And the shooting hasn't started yet either. So there's light, but nowhere near the end of the tunnel. So looks like I'll be shipping first combined patch tomorrow. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Each backend that existed on the master is represented by a PROC > structure in the ProcArray. These are known as "recovery procs" and are > similar to the dummy procs used for prepared transactions. All recovery > procs are "owned by" the Startup process. So there is no process for > which MyProc points to one of the recovery procs. This structure allows > us to record the top-level transactions and then put subtransactions in > the proc's subtransaction cache. A fixed one-to-one correspondence > allows efficient maintenance of the structures. We emulate all > transactional backend, including autovac. We'll need to know the max_connections setting in the master, in order to size the array correctly. Not a show-stopper, but would be nicer if we didn't need to. > * The backend slot may not be reused for some time, so we should take > additional actions to keep state current and true. So we choose to log a > snapshot from the master into WAL after each checkpoint. This can then > be used to cleanup any unobserved xids. It also provides us with our > initial state data, see later. We don't need to log a complete snapshot, do we? Just oldestxmin should be enough. > UnobservedXids is maintained as a sorted array. This comes for free > since xids are always added in xid assignment order. This allows xids to > be removed via bsearch when WAL records arrive for the missing xids. It > also allows us to stop searching for xids once we reach > latestCompletedXid. If we're going to have an UnobservedXids array, why don't we just treat all in-progress transactions as Unobserved, and forget about the dummy PROC entries? Also, I can't help thinking that this would be a lot simpler if we just treated all subtransactions the same as top-level transactions. The only problem with that is that there can be a lot of subtransactions, which means that we'd need a large UnobservedXids array to handle the worst case, but maybe it would still be acceptable? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Each backend that existed on the master is represented by a PROC > > structure in the ProcArray. These are known as "recovery procs" and are > > similar to the dummy procs used for prepared transactions. All recovery > > procs are "owned by" the Startup process. So there is no process for > > which MyProc points to one of the recovery procs. This structure allows > > us to record the top-level transactions and then put subtransactions in > > the proc's subtransaction cache. A fixed one-to-one correspondence > > allows efficient maintenance of the structures. We emulate all > > transactional backend, including autovac. > > We'll need to know the max_connections setting in the master, in order > to size the array correctly. Not a show-stopper, but would be nicer if > we didn't need to. Yes. We'll probably need to add checks/configurability later. Unless you have a way... > > * The backend slot may not be reused for some time, so we should take > > additional actions to keep state current and true. So we choose to log a > > snapshot from the master into WAL after each checkpoint. This can then > > be used to cleanup any unobserved xids. It also provides us with our > > initial state data, see later. > > We don't need to log a complete snapshot, do we? Just oldestxmin should > be enough. Possibly, but you're thinking that once we're up and running we can use less info. Trouble is, you don't know when/if the standby will crash/be shutdown. So we need regular full snapshots to allow it to re-establish full information at regular points. So we may as well drop the whole snapshot to WAL every checkpoint. To do otherwise would mean more code and less flexibility. With default settings that is at most 25600 bytes for subxid cache, plus a maybe 2000 bytes for other info. For most cases, we will use less than 1 wal buffer. > > UnobservedXids is maintained as a sorted array. This comes for free > > since xids are always added in xid assignment order. This allows xids to > > be removed via bsearch when WAL records arrive for the missing xids. It > > also allows us to stop searching for xids once we reach > > latestCompletedXid. > > If we're going to have an UnobservedXids array, why don't we just treat > all in-progress transactions as Unobserved, and forget about the dummy > PROC entries? That's a good question and I expected some debate on that. The main problem is fatal errors that don't write abort records. By reusing the PROC entries we can keep those to a manageable limit. If we don't have that, the number of fatal errors could cause that list to grow uncontrollably and we might overflow any setting, causing snapshots to stall and new queries to hang. We really must have a way to place an upper bound on the number of unobserved xacts. So we really need the proc approach. But we also need the UnobservedXids array. It's definitely more code to have both, so I would not have chosen that route if there was another way. The simple approach just doesn't cover all possible cases, and we need to cover them all. Having only an UnobservedXid array was my first thought and I said earlier I would do it without using procs. Bad idea. Using the UnobservedXids array means every xact removal requires a bsearch, whereas with procs we can do a direct lookup, removing all xids in one stroke. Much better for typical cases. Also, if we have procs we can use the "no locks" approach in some cases, as per current practice on new xid insertions. > Also, I can't help thinking that this would be a lot simpler if we just > treated all subtransactions the same as top-level transactions. The only > problem with that is that there can be a lot of subtransactions, which > means that we'd need a large UnobservedXids array to handle the worst > case, but maybe it would still be acceptable? Yes, you see the problem. Without subtransactions, this would be a simple issue to solve. In one sense, I do as you say. When we make a snapshot we stuff the UnobservedXids into the snapshot *somewhere*. We don't know whether they are top level or subxacts. But we need a solution for when we run out of top-level xid places in the snapshot. Which has now been provided, luckily. If we have no upper bound on snapshot size then *all* backends would need a variable size snapshot. We must solve that problem or accept having people wait maybe minutes for a snapshot in worst case. I've found one way of placing a bound on the number of xids we need to keep in the snapshot. If there is another, better way of keeping it bounded I will happily adopt it. I spent about 2 weeks sweating this issue... I'm available tomorrow to talk in real time if there's people in the Dev room at PGday want to discuss this, or have me explain the patch(es). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> * The backend slot may not be reused for some time, so we should take >>> additional actions to keep state current and true. So we choose to log a >>> snapshot from the master into WAL after each checkpoint. This can then >>> be used to cleanup any unobserved xids. It also provides us with our >>> initial state data, see later. >> We don't need to log a complete snapshot, do we? Just oldestxmin should >> be enough. > > Possibly, but you're thinking that once we're up and running we can use > less info. > > Trouble is, you don't know when/if the standby will crash/be shutdown. > So we need regular full snapshots to allow it to re-establish full > information at regular points. So we may as well drop the whole snapshot > to WAL every checkpoint. To do otherwise would mean more code and less > flexibility. Surely it's less code to write the OldestXmin to the checkpoint record, rather than a full snapshot, no? And to read it off the checkpoint record. >>> UnobservedXids is maintained as a sorted array. This comes for free >>> since xids are always added in xid assignment order. This allows xids to >>> be removed via bsearch when WAL records arrive for the missing xids. It >>> also allows us to stop searching for xids once we reach >>> latestCompletedXid. >> If we're going to have an UnobservedXids array, why don't we just treat >> all in-progress transactions as Unobserved, and forget about the dummy >> PROC entries? > > That's a good question and I expected some debate on that. > > The main problem is fatal errors that don't write abort records. By > reusing the PROC entries we can keep those to a manageable limit. If we > don't have that, the number of fatal errors could cause that list to > grow uncontrollably and we might overflow any setting, causing snapshots > to stall and new queries to hang. We really must have a way to place an > upper bound on the number of unobserved xacts. So we really need the > proc approach. But we also need the UnobservedXids array. If you write the oldestxmin (or a full snapshot, including the oldestxmin) to each checkpoint record, you can crop out any unobserved xids older than that, when you replay the checkpoint record. > Having only an UnobservedXid array was my first thought and I said > earlier I would do it without using procs. Bad idea. Using the > UnobservedXids array means every xact removal requires a bsearch, > whereas with procs we can do a direct lookup, removing all xids in one > stroke. Much better for typical cases. How much does that really matter? Under normal circumstances, the array would be quite small anyway. A bsearch of a relatively small array isn't that expensive. Or a hash table, so that removing/inserting items doesn't need to shift all the following entries. >> Also, I can't help thinking that this would be a lot simpler if we just >> treated all subtransactions the same as top-level transactions. The only >> problem with that is that there can be a lot of subtransactions, which >> means that we'd need a large UnobservedXids array to handle the worst >> case, but maybe it would still be acceptable? > > Yes, you see the problem. Without subtransactions, this would be a > simple issue to solve. > > In one sense, I do as you say. When we make a snapshot we stuff the > UnobservedXids into the snapshot *somewhere*. We don't know whether they > are top level or subxacts. But we need a solution for when we run out of > top-level xid places in the snapshot. Which has now been provided, > luckily. > > If we have no upper bound on snapshot size then *all* backends would > need a variable size snapshot. We must solve that problem or accept > having people wait maybe minutes for a snapshot in worst case. I've > found one way of placing a bound on the number of xids we need to keep > in the snapshot. If there is another, better way of keeping it bounded I > will happily adopt it. I spent about 2 weeks sweating this issue... How about: 1. Keep all transactions and subtransactions in UnobservedXids. 2. If it fills up, remove all subtransactions from it, that the startup process knows to be subtransactions and knows the parents, and update subtrans. Mark the array as overflowed. To take a snapshot, a backend simply copies UnobservedXids array and the flag. If it hasn't overflowed, a transaction is considered to be in progress if it's in the array. If it has overflowed, and the xid is not in the array, check subtrans Note that the startup process sees all WAL records, so it can do arbitrarily complex bookkeeping in backend-private memory, and only expose the necessary parts in shared mem. For example, it can keep track of the parent-child relationships of the xids in UnobservedXids, but the backends taking snapshots don't need to know about that. For step 2 to work, that's exactly what the startup process needs to keep track of. For the startup process to know about the parent-child relationships, we'll need something like WAL changes you suggested. I'm not too thrilled about adding a new field to all WAL records. Seems simpler to just rely on the new WAL records on AssignTransactionId(), and we can only do it, say, every 100 subtransactions, if we make the UnobservedXids array big enough (100*max_connections). This isn't actually that different from your proposal. The big difference is that instead of PROC entries and UnobservedXids, all transactions are tracked in UnobservedXids, and instead of caching subtransactions in the subxids array in PROC entries, they're cached in UnobservedXids as well. Aanother, completely different approach, would be to forget about xid arrays altogether, and change the way snapshots are taken: just do a full memcpy of the clog between xmin and xmax. That might be pretty slow if xmax-xmin is big, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote: > > Also, I can't help thinking that this would be a lot simpler if we just > treated all subtransactions the same as top-level transactions. The only > problem with that is that there can be a lot of subtransactions, which > means that we'd need a large UnobservedXids array to handle the worst > case, but maybe it would still be acceptable? I remember cases on this list where long transactions did run out of subtransaction ids. To accommodate something approacing that we need an array for storing (max_connections * 4G ) UnobservedXids instead of just max_connections. ----------------- Hannu
On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> * The backend slot may not be reused for some time, so we should take > >>> additional actions to keep state current and true. So we choose to log a > >>> snapshot from the master into WAL after each checkpoint. This can then > >>> be used to cleanup any unobserved xids. It also provides us with our > >>> initial state data, see later. > >> We don't need to log a complete snapshot, do we? Just oldestxmin should > >> be enough. > > > > Possibly, but you're thinking that once we're up and running we can use > > less info. > > > > Trouble is, you don't know when/if the standby will crash/be shutdown. > > So we need regular full snapshots to allow it to re-establish full > > information at regular points. So we may as well drop the whole snapshot > > to WAL every checkpoint. To do otherwise would mean more code and less > > flexibility. > > Surely it's less code to write the OldestXmin to the checkpoint record, > rather than a full snapshot, no? And to read it off the checkpoint record. You may be missing my point. We need an initial state to work from. I am proposing we write a full snapshot after each checkpoint because it allows us to start recovery again from that point. If we wrote only OldestXmin as you suggest it would optimise the size of the WAL record but it would prevent us from restarting at that point. Also, passing OldestXmin only would not work in the presence of long running statements. Passing the snapshot allows us to see that FATAL errors have occurred much sooner. BTW, the way I have coded it means that if we skip writing a checkpoint on a quiet system then we would also skip writing the snapshot. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote: > How about: > > 1. Keep all transactions and subtransactions in UnobservedXids. > 2. If it fills up, remove all subtransactions from it, that the startup > process knows to be subtransactions and knows the parents, and update > subtrans. Mark the array as overflowed. > > To take a snapshot, a backend simply copies UnobservedXids array and the > flag. If it hasn't overflowed, a transaction is considered to be in > progress if it's in the array. If it has overflowed, and the xid is not > in the array, check subtrans We can't check subtrans. We do not have any record of what the parent is for an unobserved transaction id. So the complete list of unobserved xids *must* be added to the snapshot. If that makes snapshot overflow, we have a big problem: we would be forced to say "sorry snapshot cannot be issued at this time, please wait". Ugh! > Note that the startup process sees all WAL records, so it can do > arbitrarily complex bookkeeping in backend-private memory, and only > expose the necessary parts in shared mem. For example, it can keep track > of the parent-child relationships of the xids in UnobservedXids, but the > backends taking snapshots don't need to know about that. For step 2 to > work, that's exactly what the startup process needs to keep track of. > For the startup process to know about the parent-child relationships, > we'll need something like WAL changes you suggested. I'm not too > thrilled about adding a new field to all WAL records. Seems simpler to > just rely on the new WAL records on AssignTransactionId(), and we can > only do it, say, every 100 subtransactions, if we make the > UnobservedXids array big enough (100*max_connections). Yes, we can make the UnobservedXids array bigger, but only to the point where it will all fit within a snapshot. The WAL changes proposed use space that was previously wasted, so there is no increase in amount of data going to disk. The additional time to derive that data is very quick when those fields are unused and that logic is executed before we take WALInsertLock. So overall, very low overhead. Every new subxid needs to specify its parent's xid. We must supply that information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have done in most cases, tuck that into the wasted space on the xlrec. Writing a WAL record every 100 subtransactions will not work: we need to write to subtrans *before* that xid appears anywhere on disk, so that visibility tests can determine the status of the transaction. The approach I have come up with is very finely balanced. It's the *only* approach that I've come up with that covers all requirements; there were very few technical choices to make. If it wasn't for subtransactions, disappearing transactions because of FATAL errors and unobserved xids it would be much simpler. But having said that, the code isn't excessively complex, I wrote it in about 3 days. > This isn't actually that different from your proposal. The big > difference is that instead of PROC entries and UnobservedXids, all > transactions are tracked in UnobservedXids, and instead of caching > subtransactions in the subxids array in PROC entries, they're cached in > UnobservedXids as well. > Aanother, completely different approach, would be to forget about xid > arrays altogether, and change the way snapshots are taken: just do a > full memcpy of the clog between xmin and xmax. That might be pretty slow > if xmax-xmin is big, though. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote: > >> How about: >> >> 1. Keep all transactions and subtransactions in UnobservedXids. >> 2. If it fills up, remove all subtransactions from it, that the startup >> process knows to be subtransactions and knows the parents, and update >> subtrans. Mark the array as overflowed. >> >> To take a snapshot, a backend simply copies UnobservedXids array and the >> flag. If it hasn't overflowed, a transaction is considered to be in >> progress if it's in the array. If it has overflowed, and the xid is not >> in the array, check subtrans > > We can't check subtrans. We do not have any record of what the parent is > for an unobserved transaction id. So the complete list of unobserved > xids *must* be added to the snapshot. If that makes snapshot overflow, > we have a big problem: we would be forced to say "sorry snapshot cannot > be issued at this time, please wait". Ugh! That's why we still need the occasional WAL logging in AssignTransactionId(). To log the parent-child relationships of the subtransactions. >> For the startup process to know about the parent-child relationships, >> we'll need something like WAL changes you suggested. I'm not too >> thrilled about adding a new field to all WAL records. Seems simpler to >> just rely on the new WAL records on AssignTransactionId(), and we can >> only do it, say, every 100 subtransactions, if we make the >> UnobservedXids array big enough (100*max_connections). > > Yes, we can make the UnobservedXids array bigger, but only to the point > where it will all fit within a snapshot. The list of xids in a snapshot is just a palloc'd array, in backend-local memory, so we can easily make it as large as we need to. > Every new subxid needs to specify its parent's xid. We must supply that > information somehow: either via an XLOG_XACT_ASSIGNMENT, or as I have > done in most cases, tuck that into the wasted space on the xlrec. > Writing a WAL record every 100 subtransactions will not work: we need to > write to subtrans *before* that xid appears anywhere on disk, so that > visibility tests can determine the status of the transaction. I don't follow. It doesn't need to be in subtrans before it appears on disk, AFAICS. It can be stored in UnobservedXids at first, and when it overflows, we can update subtrans and remove the entries from UnobservedXids. A snapshot taken before the overflow will have the subxid in its copy of UnobservedXids, and one taken after overflow will find it in subtrans. If UnobservedXids is large enough to hold, say 100 * max_connections xids, by writing a WAL record containing the parent-child relationships every 100 assigned subtransactions within a top-level transaction, the top-level transactions and those subtransactions that we don't know the parent of will always fit into UnobservedXids. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Wed, 2008-10-22 at 12:29 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote: >>>> Simon Riggs wrote: >>>>> * The backend slot may not be reused for some time, so we should take >>>>> additional actions to keep state current and true. So we choose to log a >>>>> snapshot from the master into WAL after each checkpoint. This can then >>>>> be used to cleanup any unobserved xids. It also provides us with our >>>>> initial state data, see later. >>>> We don't need to log a complete snapshot, do we? Just oldestxmin should >>>> be enough. >>> Possibly, but you're thinking that once we're up and running we can use >>> less info. >>> >>> Trouble is, you don't know when/if the standby will crash/be shutdown. >>> So we need regular full snapshots to allow it to re-establish full >>> information at regular points. So we may as well drop the whole snapshot >>> to WAL every checkpoint. To do otherwise would mean more code and less >>> flexibility. >> Surely it's less code to write the OldestXmin to the checkpoint record, >> rather than a full snapshot, no? And to read it off the checkpoint record. > > You may be missing my point. > > We need an initial state to work from. > > I am proposing we write a full snapshot after each checkpoint because it > allows us to start recovery again from that point. If we wrote only > OldestXmin as you suggest it would optimise the size of the WAL record > but it would prevent us from restarting at that point. Well, you'd just need to treat anything > oldestxmin, and not marked as finished in clog, as unobserved. Which doesn't seem too bad. Not that storing the full list of in-progress xids is that bad either, though. Hmm. What about in-progress subtransactions that have overflowed the shared mem cache? Can we rely that subtrans is up-to-date, up to the checkpoint record that recovery starts from? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote: > It doesn't need to be in subtrans before it appears on > disk, AFAICS. I don't make these rules. Read comments in AssignTransactionId(). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hannu Krosing wrote: > On Thu, 2008-10-16 at 18:52 +0300, Heikki Linnakangas wrote: >> Also, I can't help thinking that this would be a lot simpler if we just >> treated all subtransactions the same as top-level transactions. The only >> problem with that is that there can be a lot of subtransactions, which >> means that we'd need a large UnobservedXids array to handle the worst >> case, but maybe it would still be acceptable? > > I remember cases on this list where long transactions did run out of > subtransaction ids. To accommodate something approacing that we need an > array for storing (max_connections * 4G ) UnobservedXids instead of just > max_connections. You can't have more than 4G (or 2G?) active subtransactions running in a system, because you will simply run out of transaction ids and hit xid wrap-around after that. So in the worst-case, you don't need space for (max_connections * 4G) xids, just 4G. That's still a large number, of course. In situations like that, a bitmap, like clog, instead of an array, would be more space efficient. But that's less efficient in the more common case that there's few in-progress transactions, but some of them are very old. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote: > The list of xids in a snapshot is just a palloc'd array, in > backend-local memory, so we can easily make it as large as we need to. It's malloc'd before we hold the lock. Yes, we can make it as large as we need to, but that would change the way it is allocated significantly and that will require either extended lock hold durations or multiple lock acquisitions. That doesn't sounds good. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > If that makes snapshot overflow, > we have a big problem: we would be forced to say "sorry snapshot cannot > be issued at this time, please wait". Ugh! BTW, we'll need to do that anyway, if we guess the max_connections setting in the master incorrectly. We should avoid having to do that under other circumstances, of course. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-10-22 at 17:29 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > You may be missing my point. > > > > We need an initial state to work from. > > > > I am proposing we write a full snapshot after each checkpoint because it > > allows us to start recovery again from that point. If we wrote only > > OldestXmin as you suggest it would optimise the size of the WAL record > > but it would prevent us from restarting at that point. > > Well, you'd just need to treat anything > oldestxmin, and not marked as > finished in clog, as unobserved. Which doesn't seem too bad. Not that > storing the full list of in-progress xids is that bad either, though. As I said, I'm storing the whole list to give me an initial state. > Hmm. What about in-progress subtransactions that have overflowed the > shared mem cache? Can we rely that subtrans is up-to-date, up to the > checkpoint record that recovery starts from? Yes, we can't use snapshots as an initial state if they have overflowed. In that case we could avoid dropping full snapshot to WAL, yes. That's probably trivial to implement. Currently, the patch can start from a non-overflowed snapshot, but it could also start from an overflowed snapshot and wait until we receive a snapshot with an xmin > the xmax of the first snapshot. That seemed too much icing for the first patch. I think we need to all agree on the correctness aspects before we attempt too much tuning and tweaking. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 17:40 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > If that makes snapshot overflow, > > we have a big problem: we would be forced to say "sorry snapshot cannot > > be issued at this time, please wait". Ugh! > > BTW, we'll need to do that anyway, if we guess the max_connections > setting in the master incorrectly. We should avoid having to do that > under other circumstances, of course. I'm sure it's easier to handle that in easier ways. I imagined we would just issue a message saying "must increase max_connections" and then shutdown. It's easy enough to document that if you change the master you must also change the slave - it's not a very common change after all. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote: > >> It doesn't need to be in subtrans before it appears on >> disk, AFAICS. > > I don't make these rules. Read comments in AssignTransactionId(). You mean this one: > /* > * Generate a new Xid and record it in PG_PROC and pg_subtrans. > * > * NB: we must make the subtrans entry BEFORE the Xid appears anywhere in > * shared storage other than PG_PROC; because if there's no room for it in > * PG_PROC, the subtrans entry is needed to ensure that other backends see > * the Xid as "running". See GetNewTransactionId. > */ ? The way I read that is that as long as there's an entry in UnobservedXids, we don't need a subtrans entry. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-10-22 at 20:42 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2008-10-22 at 17:18 +0300, Heikki Linnakangas wrote: > > > >> It doesn't need to be in subtrans before it appears on > >> disk, AFAICS. > > > > I don't make these rules. Read comments in AssignTransactionId(). > > You mean this one: > > > /* > > * Generate a new Xid and record it in PG_PROC and pg_subtrans. > > * > > * NB: we must make the subtrans entry BEFORE the Xid appears anywhere in > > * shared storage other than PG_PROC; because if there's no room for it in > > * PG_PROC, the subtrans entry is needed to ensure that other backends see > > * the Xid as "running". See GetNewTransactionId. > > */ > > ? The way I read that is that as long as there's an entry in > UnobservedXids, we don't need a subtrans entry. If you don't mark subtrans, you will need to keep things in shared memory. I agree you can defer marking subtrans until your shared memory fills, but you will still need to keep track of the parent for each subtransaction until you do. If you don't *know* the parent, how will you do this? You haven't explained how you will write the relationship between a subtransaction and its parent into WAL, for each subtransaction and for an arbitrarily large number of subtransactions. That suggests a nice optimisation we can use in normal running and with the Hot Standby patch, but it doesn't provide an alternate approach. We seem to be scrabbling around looking for a second solution. Even if we find one you'll need to show that its substantially better than the one that's already on the table. But right now it doesn't seem like we're on the verge of a radically better approach, for this part. (The optimisation is for AssignTransactionId() to not update subtrans for the first PGPROC_MAX_CACHED_SUBXIDS subtransactions, either in AssignTransactionId() or in xact_redo_commit(). That would mean in most normal cases we would never update subtrans at all. It requires a slight change to the way XidInMVCCSnapshot() works, but I've already included that change in the current patch. I'll do that as a standalone patch.) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > If you don't mark subtrans, you will need to keep things in shared > memory. I agree you can defer marking subtrans until your shared memory > fills, but you will still need to keep track of the parent for each > subtransaction until you do. If you don't *know* the parent, how will > you do this? You haven't explained how you will write the relationship > between a subtransaction and its parent into WAL, for each > subtransaction and for an arbitrarily large number of subtransactions. I was thinking that it's done in AssignTransactionId(). Similarly to what you're doing in your patch, but instead of just one xid with it's parent, a list of (xid, parent) pairs would be stored in the WAL record. Upon replaying that WAL record, those parent-child relationships are recorded in subtrans, and the child xids are removed from UnobservedXids. > We seem to be scrabbling around looking for a second solution. Even if > we find one you'll need to show that its substantially better than the > one that's already on the table. But right now it doesn't seem like > we're on the verge of a radically better approach, for this part. I'm trying to tease apart the important aspects of the design, and trying to reduce it into the simplest thing that can work. For example, since it seems like we need UnobservedXids array anyway, can we use just that, and not need the PROC entries? And if we need to sometimes WAL log in AssignTransactionId(), could we use just that one mechanism for transferring parent-child relationships, and not need the extra field in all WAL records? The less mechanisms there is to do stuff, the simpler the patch, the less codepaths to test, and so forth. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-10-22 at 22:38 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > If you don't mark subtrans, you will need to keep things in shared > > memory. I agree you can defer marking subtrans until your shared memory > > fills, but you will still need to keep track of the parent for each > > subtransaction until you do. If you don't *know* the parent, how will > > you do this? You haven't explained how you will write the relationship > > between a subtransaction and its parent into WAL, for each > > subtransaction and for an arbitrarily large number of subtransactions. > > I was thinking that it's done in AssignTransactionId(). Similarly to > what you're doing in your patch, but instead of just one xid with it's > parent, a list of (xid, parent) pairs would be stored in the WAL record. > Upon replaying that WAL record, those parent-child relationships are > recorded in subtrans, and the child xids are removed from UnobservedXids. But once you reach 64 transactions, you'll need to write an extra WAL record for every subtransaction, which currently I've managed to avoid. I understand what you hope to achieve, but right now my focus is on achieving correctness within next few days, not on reworking parts of the patch that seem to work correctly and are reasonably well optimised. We can discuss rework in more detail in a couple of weeks. It's been a worthwhile discussion, but lets look somewhere else now. I could really use your help in thinking about prepared transactions. Forgive me please, I don't wish to be rude, though I probably am. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote: > But once you reach 64 transactions, you'll need to write an extra WAL > record for every subtransaction, which currently I've managed to avoid. Some further notes/tests on the optimisation you discovered. Because of the way I have changed tqual.c, we only need to write to subtrans if a proc's subxid cache overflows. (I think we would need to change tqual.c in a similar way in any of the ways so far discussed). Anyway, quick test with a representative test case shows 3-5% performance gain from skipping the subtrans updates. I only the test case files here and the test patch. The test is a simple PL/pgSQL function with one EXCEPTION clause, so fairly real world. The patch isn't ready to apply standalone because we need to include the changes to XidInMVCCSnapshot() also, which would take a little while to extract. Let me know if that is worth producing a standalone patch for. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote: > But once you reach 64 transactions, you'll need to write an extra WAL > record for every subtransaction, which currently I've managed to avoid. Yes, I've managed to avoid it, but it will simplify the patch if you think its not worth bothering with. This won't really effect anybody I've met running straight Postgres, but it may effect EDB. It's not a problem for me, but I was second guessing objections. If I do that then I can just pass the slotId in full on every WAL record, which simplifies a couple of other things also. So, does everybody accept that we will write a WAL record for every subtransaction assigned, once we hit the size limit of the subxid cache? i.e. currently 65th subxid and beyond. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote: > >> But once you reach 64 transactions, you'll need to write an extra WAL >> record for every subtransaction, which currently I've managed to avoid. > > Yes, I've managed to avoid it, but it will simplify the patch if you > think its not worth bothering with. This won't really effect anybody > I've met running straight Postgres, but it may effect EDB. It's not a > problem for me, but I was second guessing objections. > > If I do that then I can just pass the slotId in full on every WAL > record, which simplifies a couple of other things also. > > So, does everybody accept that we will write a WAL record for every > subtransaction assigned, once we hit the size limit of the subxid cache? > i.e. currently 65th subxid and beyond. Would have to see the patch to understand what the code simplicity vs. extra WAL logging tradeoff really is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-10-23 at 08:40 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2008-10-22 at 21:47 +0100, Simon Riggs wrote: > > > >> But once you reach 64 transactions, you'll need to write an extra WAL > >> record for every subtransaction, which currently I've managed to avoid. > > > > Yes, I've managed to avoid it, but it will simplify the patch if you > > think its not worth bothering with. This won't really effect anybody > > I've met running straight Postgres, but it may effect EDB. It's not a > > problem for me, but I was second guessing objections. > > > > If I do that then I can just pass the slotId in full on every WAL > > record, which simplifies a couple of other things also. > > > > So, does everybody accept that we will write a WAL record for every > > subtransaction assigned, once we hit the size limit of the subxid cache? > > i.e. currently 65th subxid and beyond. > > Would have to see the patch to understand what the code simplicity vs. > extra WAL logging tradeoff really is. Well, if your not certain now, then my initial feeling was correct. I don't think everybody would agree to that. The code simplification would be real, but I don't think it's that hard now. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > The patch isn't ready to apply standalone because we need to include the > changes to XidInMVCCSnapshot() also, which would take a little while to > extract. Let me know if that is worth producing a standalone patch for. FWIW, this patch becomes a lot simpler if you don't change the function signature, and don't move the SubtransSetParent() call. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/transam/xact.c --- src/backend/access/transam/xact.c *************** *** 404,419 **** AssignTransactionId(TransactionState s) AssignTransactionId(s->parent); /* ! * Generate a new Xid and record it in PG_PROC and pg_subtrans. ! * ! * NB: we must make the subtrans entry BEFORE the Xid appears anywhere in ! * shared storage other than PG_PROC; because if there's no room for it in ! * PG_PROC, the subtrans entry is needed to ensure that other backends see ! * the Xid as "running". See GetNewTransactionId. */ s->transactionId = GetNewTransactionId(isSubXact); ! if (isSubXact) SubTransSetParent(s->transactionId, s->parent->transactionId); /* --- 404,418 ---- AssignTransactionId(s->parent); /* ! * Generate a new Xid and record it in PG_PROC. If there's no room ! * in MyProc-> in MyProc->subxids, we must make the pg_subtrans ! * entry BEFORE the Xid appears anywhere in shared storage other than ! * PG_PROC, because the subtrans entry is needed to ensure that other ! * backends see the Xid as "running". */ s->transactionId = GetNewTransactionId(isSubXact); ! if (isSubXact && MyProc->subxids.overflowed) SubTransSetParent(s->transactionId, s->parent->transactionId); /*
On Thu, 2008-10-23 at 13:40 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > The patch isn't ready to apply standalone because we need to include the > > changes to XidInMVCCSnapshot() also, which would take a little while to > > extract. Let me know if that is worth producing a standalone patch for. > > FWIW, this patch becomes a lot simpler if you don't change the function > signature, and don't move the SubtransSetParent() call. Yeh, I'm trying to judge between trying to be simple and trying to be neat. Whichever one I pick, I seem to be wrong. :-) But I begin, perhaps too slowly, to understand that this is a natural part of review itself, not really a problem. Thanks for your help. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support