Deriving Recovery Snapshots - Mailing list pgsql-hackers

From Simon Riggs
Subject Deriving Recovery Snapshots
Date
Msg-id 1224006635.3808.45.camel@ebony.2ndQuadrant
Whole thread Raw
Responses Re: Deriving Recovery Snapshots
Re: Deriving Recovery Snapshots
Re: Deriving Recovery Snapshots
List pgsql-hackers
I've worked out what I think is a workable, efficient process for
deriving snapshots during recovery. I will be posting a patch to show
how this works tomorrow [Wed 15 Oct], just doing cleanup now.

Recovery Snapshots are snapshots taken during recovery. They are valid
snapshots in all ways for testing visibility. Assembling the information
to allow snapshots to be taken operates differently in recovery than it
does in normal processing.

The first thing to realise is that in recovery the only knowledge of
what occurs is through WAL records. If it isn't in the WAL, we don't
know about it. Most of the time that also means we can ignore events
that we know occurred, for example data reads.

In order to build the recovery snapshot data we need to re-create events
from WAL data. In some cases we have to add new WAL records to ensure
that all possible information is present.

Each backend that existed on the master is represented by a PROC
structure in the ProcArray. These are known as "recovery procs" and are
similar to the dummy procs used for prepared transactions. All recovery
procs are "owned by" the Startup process. So there is no process for
which MyProc points to one of the recovery procs. This structure allows
us to record the top-level transactions and then put subtransactions in
the proc's subtransaction cache. A fixed one-to-one correspondence
allows efficient maintenance of the structures. We emulate all
transactional backend, including autovac.

So in Hot Standby mode we have one set of Recovery Procs emulating what
happened on the master, and another set running read only work.

We maintain information according to certain events on the master.
1. xid assignment (top-level xids)
2. xid assignment (subtransactions)
3. xid commit
4. xid subcommit
5. xid abort/subabort
6. backends which have FATAL errors but write no abort record.

(3) and (5) are easy because we already have WAL records for them.
For (3) we already updated clog from the WAL record, so we just need to
identify the proc and then set the xid.

(4) is completely removed by re-arranging subcommit so it is covered by
commit. (Atomic subxid patch)

(6) is a problem since this event can make transactions disappear
completely from the WAL record. If they crash immediately after Xid
assignment then they may crash without ever writing a WAL record at all.
We handle this in two ways. 
* First, we keep a PROC for each backendid. Notice that means we keep a
PROC for each slot in the master's procarray, not for each pid. So if a
backend explodes and then someone reconnects using that same procarray
slot we will know that the previous transaction on that slot has
completed. This is a subtle but important point: without the ability to
infer certain transactions are complete we would need to keep track of a
potentially unlimited number of xids. Tying transactions to proc slots
means we will never have more than a fixed number of missing xids to
deal with. 
* The backend slot may not be reused for some time, so we should take
additional actions to keep state current and true. So we choose to log a
snapshot from the master into WAL after each checkpoint. This can then
be used to cleanup any unobserved xids. It also provides us with our
initial state data, see later.

(1), (2) xid assignment doesn't appear in WAL. Writing WAL records for
each xid assigned would have a catastrophic effect on performance,
especially if we realise that we would have to do that while holding
XidGenLock. So we have to do lots of pressups to avoid it, in the
following ways: 

We put a flag on the first WAL record written by a new transaction.
(Actually we mark the first WAL record containing the new xid, which
isn't always the first WAL record in the transaction. Weird, huh? Think
Oids, Storage files etc). We add an extra xid onto the WAL record to
hold the parent xid, and use that to maintain subtrans.

This works partially but not completely. It is possible for a
transaction to start a very large number of subtransactions before any
part of the transaction writes WAL. We only have space on the WAL record
for one additional xid. Each subxid must record its immediate parent's
xid in subtrans, so if we assign more than one *subtransaction* at a
time we *must* then write a WAL record for all the xids assigned apart
from the last one.

So that only affects transactions which use two or more subtransactions
in a transaction *and* who insist of starting subtransactions before
anything has been written, so not very common. So AssignTransactionId()
sometimes needs to write WAL records.

Another problem is that xids flagged on WAL records don't arrive in WAL
in the order they were assigned. So we must cope with out-of-order or
"unobserved xids". When we replay WAL, we keep track of UnobservedXids
in a shared memory array. These UnobservedXids are added onto any
recovery Snapshot taken iff they are earlier than latestCompletedXid. So
in the typical case, no xids will be added to the snapshots. For now, I
do all this work holding ProcArrayLock, but there seems scope to
optimise that also. Later.

UnobservedXids is maintained as a sorted array. This comes for free
since xids are always added in xid assignment order. This allows xids to
be removed via bsearch when WAL records arrive for the missing xids. It
also allows us to stop searching for xids once we reach
latestCompletedXid.

As a result of the AssignTransaction WAL records we know that each
backend will only ever allocate at most 2 xids before notifying WAL in
some way, either by flagging a WAL entry it makes or by making an entry
when assigning the new xids. As a result the UnobservedXids array will
never overflow if it has 2* MaxBackends entries. (I've added code, but
#ifdef'd it out).

Since UnobservedXids can never overflow, we also see that the Snapshot
can never overflow *because* of UnobservedXids. Each unobserved
top-level xid leaves space for 65 xids, yet we need only 2 to add the
unobserved xids.

I've had to change the way XidInMVCCSnapshot() works. We search the
snapshot even if it has overflowed. This is actually a performance win
in cases where only a few xids have overflowed but most haven't. This is
essential because if we were forced to check in subtrans *and*
unobservedxids existed then the snapshot would be invalid. (I could have
made it this way *just* in recovery, but the change seems better both
ways).

So that's how we maintain info required for Snapshots, but the next part
of the puzzle is how we handle the initial state. Again, subtransactions
are a pain because there can be an extremely large number of them. So
taking a snapshot and copying it to WAL is insufficient. We handle this
by taking a snapshot when we have performed pg_start_backup() (in the
checkpoint we already take) and then taking another snapshot after each
checkpoint. Doing it that way means wherever we restart from we always
have an initial state record close to hand. On the standby, if the first
snapshot we see has overflowed then we either wait for a snapshot to
arrive which has not overflowed. (We could also wait for a snapshot
whose xmin is later than the xmax of our first snapshot).

This means that there could be a delay in starting Hot Standby mode *if*
we are heavily using subtransactions at the time we take backup.

So overheads of the patch are:
* WAL record extended to completely fill 8-byte alignment; extra 4 bytes
per record on 4-byte alignment. Additional items are:uint16        xl_info2;TransactionId    xl_xid2;
This takes no additional space on 64-bit servers because of previous
wastage.

* AssignTransactionId must WAL log xid assignment when making multiple
assignments.

* bgwriter writes Snapshot data to WAL after each checkpoint.

* additional shared memory: 
2 * MaxBackends * sizeof(TransactionId) for UnobservedXids
1 * MaxBackends * sizeof(PGPROC) for RecoveryProcs

* additional processing time during recovery to maintain snapshot info

In current patch I've put the slotid and flag bits into uint16. That
means we can manage up to 4096 connections without writing any
additional WAL data. Beyond that we need to write a WAL record for each
AssignTransactionId(), plus add 4 bytes onto each Commit/Abort record.

Note that because of the atomic subxids changes we actually write fewer
WAL records in most cases than we did before and they occupy the same
space they did before.

I'll post patch tomorrow and at least weekly after this.

Patch footprint looks like this prior to cleanup.
backend/access/transam/varsup.c |   52 -!backend/access/transam/xact.c   |  559
++++++++++++++++++++++!!!!!backend/access/transam/xlog.c  |   49 +-backend/postmaster/bgwriter.c   |   11
backend/storage/ipc/procarray.c|  721 +++++++++++++++++++++++++++++!!!backend/storage/lmgr/proc.c     |  107
+++++backend/utils/time/tqual.c     |   27 !include/access/heapam.h         |    2 include/access/htup.h           |
2include/access/transam.h        |    2 include/access/xact.h           |   23 +include/access/xlog.h           |   44
+!include/access/xlog_internal.h |    2 include/catalog/pg_control.h    |    3 include/storage/proc.h          |    4
include/storage/procarray.h    |   14 include/utils/snapshot.h        |   65 +++17 files changed, 1432 insertions(+),
44deletions(-), 211 mods(!)
 

Your comments are welcome, especially questions and thoughts around the
correctness of the approach. Lots more comments in patch.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: "Zhe He"
Date:
Subject: Question about implementing a new operation
Next
From: Michael Glaesemann
Date:
Subject: Constraint partition index usage