Thread: Hot standby, recovery procs

Hot standby, recovery procs

From
Heikki Linnakangas
Date:
(back to reviewing the main hot standby patch at last)

Why do we need recovery procs? AFAICS the only fields that we use are
xid and the subxid cache. Now that we also have the unobserved xids
array, why don't we use it to track all transactions in the master, not 
just the unobserved ones.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Tue, 2009-02-24 at 10:40 +0200, Heikki Linnakangas wrote:
> (back to reviewing the main hot standby patch at last)
> 
> Why do we need recovery procs? AFAICS the only fields that we use are
> xid and the subxid cache. Now that we also have the unobserved xids
> array, why don't we use it to track all transactions in the master, not 
> just the unobserved ones.

We need an array of objects defined in shared memory that has a
top-level xid and a subxid cache. That object also needs an lsn
attribute. We need code that adds these, removes them and adds the data
onto snapshots in almost identical ways to current procarray code.

Those objects live and die completely differently to unobservedxids,
which don't need (nor can they have) the more complex data structure.

I think if I had not made those into procs you would have said that they
are so similar it would aid code readability to have them be the same.

What benefit would we gain from separating them, especially since we now
have working, tested code?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2009-02-24 at 10:40 +0200, Heikki Linnakangas wrote:
>> (back to reviewing the main hot standby patch at last)
>>
>> Why do we need recovery procs? AFAICS the only fields that we use are
>> xid and the subxid cache. Now that we also have the unobserved xids
>> array, why don't we use it to track all transactions in the master, not 
>> just the unobserved ones.
> 
> We need an array of objects defined in shared memory that has a
> top-level xid and a subxid cache.

Not really. The other transactions, taking snapshots, don't need to 
distinguish top-level xids and subxids. That's why the unobserved xids 
array works to begin with. We only need a list of running 
(sub)transaction ids. Which is exactly what unobservedxids array is.

The startup process can track the parent-child relationships in private 
memory if it needs to. But I can't immediately see why it would need to: 
commit and abort records list all the subtransactions. To keep the 
unobserved xids array bounded, when we find out about a parent-child 
relationship, via an xact-assignment record or via the xid and top-level 
xid fields in other WAL records, we can simply use SubtransSetParent. To 
keep it real simple, we can stipulate that you always check subtrans in 
XidIdInMVCCSnapshot while in hot standby mode.

> That object also needs an lsn
> attribute. We need code that adds these, removes them and adds the data
> onto snapshots in almost identical ways to current procarray code.

We only need the lsn atrribute because we when we take the snapshot of 
running xids, we don't write it to the WAL immediately, and a new 
transaction might begin after that. If we close that gap in the master, 
we don't need the lsn in recovery procs.

Actually, I think the patch doesn't get that right as it stands:

0. Transactions 1 is running in master
1. Get list of running transactions
2. Transaction 1 commits.
3. List of running xacts is written to WAL

When the standby replays the xl_running_xacts record, it will create a 
recovery proc and mark the transaction as running again, even though it 
has already committed.

PS. This line in the same function (ProcArrayUpdateRecoveryTransactions) 
seems wrong as well:
>             memcpy(proc->subxids.xids, subxip, 
>                         rxact[xid_index].nsubxids * sizeof(TransactionId));

I don't think "subxip" is correct for the 2d argument.

> I think if I had not made those into procs you would have said that they
> are so similar it would aid code readability to have them be the same.

And in fact I suggested earlier that we get rid of the unobserved xids 
array, and only use recovery procs.

> What benefit would we gain from separating them, especially since we now
> have working, tested code?

Simplicity. That matters a lot. Removing the distinction between 
unobserved xids and already-observed running transactions would slash a 
lot of code.

I appreciate your testing, but it's not like it has gone through years 
of usage in the field. This is not the case of "if it ain't broken, 
don't fix it". The code that's in the patch is not in production yet, 
and now is precisely the right time to get it right, before it goes into 
the "if it ain't broke, don't fix it" mode.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Tue, 2009-02-24 at 21:59 +0200, Heikki Linnakangas wrote:
> We only need the lsn atrribute because we when we take the snapshot
> of 
> running xids, we don't write it to the WAL immediately, and a new 
> transaction might begin after that. If we close that gap in the
> master, 
> we don't need the lsn in recovery procs.
> 
> Actually, I think the patch doesn't get that right as it stands:
> 
> 0. Transactions 1 is running in master
> 1. Get list of running transactions
> 2. Transaction 1 commits.
> 3. List of running xacts is written to WAL
> 
> When the standby replays the xl_running_xacts record, it will create
> a 
> recovery proc and mark the transaction as running again, even though
> it 
> has already committed.

No, because we check whether TransactionIdDidCommit().

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2009-02-24 at 21:59 +0200, Heikki Linnakangas wrote:
>> We only need the lsn atrribute because we when we take the snapshot
>> of 
>> running xids, we don't write it to the WAL immediately, and a new 
>> transaction might begin after that. If we close that gap in the
>> master, 
>> we don't need the lsn in recovery procs.
>>
>> Actually, I think the patch doesn't get that right as it stands:
>>
>> 0. Transactions 1 is running in master
>> 1. Get list of running transactions
>> 2. Transaction 1 commits.
>> 3. List of running xacts is written to WAL
>>
>> When the standby replays the xl_running_xacts record, it will create
>> a 
>> recovery proc and mark the transaction as running again, even though
>> it 
>> has already committed.
> 
> No, because we check whether TransactionIdDidCommit().

Oh, right... But we have the same problem with the subtransactions, 
don't we? This block:

>         /*
>          * If our state information is later for this proc, then 
>          * overwrite it. It's possible for a commit and possibly
>          * a new transaction record to have arrived in WAL in between
>          * us doing GetRunningTransactionData() and grabbing the
>          * WALInsertLock, so we musn't assume we always know best.
>          */
>         if (XLByteLT(proc->lsn, lsn))
>         {
>             TransactionId     *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]);
> 
>             proc->lsn = lsn;
>             /* proc-> pid stays 0 for Recovery Procs */
> 
>             proc->subxids.nxids = rxact[xid_index].nsubxids;
>             proc->subxids.overflowed = rxact[xid_index].overflowed;
> 
>             memcpy(proc->subxids.xids, subxip, 
>                         rxact[xid_index].nsubxids * sizeof(TransactionId));
> 
>             /* Remove subtransactions from UnobservedXids also */
>             if (unobserved)
>             {
>                 for (index = 0; index < rxact[xid_index].nsubxids; index++)
>                     UnobservedTransactionsRemoveXid(subxip[index + rxact[xid_index].subx_offset], false);
>             }
>         }

overwrites subxids array, and will resurrect any already aborted 
subtransaction.

Isn't XLByteLT(proc->lsn, lsn) always true, because 'lsn' is the lsn of 
the WAL record we're redoing, so there can't be any procs with an LSN 
higher than that?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Tue, 2009-02-24 at 21:59 +0200, Heikki Linnakangas wrote:

> > I think if I had not made those into procs you would have said that they
> > are so similar it would aid code readability to have them be the same.
> 
> And in fact I suggested earlier that we get rid of the unobserved xids 
> array, and only use recovery procs.

Last week, I think. Why are these tweaks so important?

Checking pg_subtrans for every call to XidInMVCCSnapshot will destroy
performance, as well you know.

> > What benefit would we gain from separating them, especially since we now
> > have working, tested code?
> 
> Simplicity. That matters a lot. Removing the distinction between 
> unobserved xids and already-observed running transactions would slash a 
> lot of code.

It might and it might not, but I don't believe all angles have been
evaluated. But I would say that major changes such as this have resulted
in weeks of work. More bugs have been introduced since feature freeze
than were present beforehand. 

If you want this code to fail, then twisting it in lots of directions
every week is exactly the way to do that. Neither of us will understand
how it works and we'll take more weeks for it to settle down to the
point of reviewability again. We don't have weeks any more.

So far I've made every change you've asked, but there is a reasonable
limit. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Tue, 2009-02-24 at 22:29 +0200, Heikki Linnakangas wrote:

> Oh, right... But we have the same problem with the subtransactions, 
> don't we? This block:
> 
> >         /*
> >          * If our state information is later for this proc, then 
> >          * overwrite it. It's possible for a commit and possibly
> >          * a new transaction record to have arrived in WAL in between
> >          * us doing GetRunningTransactionData() and grabbing the
> >          * WALInsertLock, so we musn't assume we always know best.
> >          */
> >         if (XLByteLT(proc->lsn, lsn))
> >         {
> >             TransactionId     *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]);
> > 
> >             proc->lsn = lsn;
> >             /* proc-> pid stays 0 for Recovery Procs */
> > 
> >             proc->subxids.nxids = rxact[xid_index].nsubxids;
> >             proc->subxids.overflowed = rxact[xid_index].overflowed;
> > 
> >             memcpy(proc->subxids.xids, subxip, 
> >                         rxact[xid_index].nsubxids * sizeof(TransactionId));
> > 
> >             /* Remove subtransactions from UnobservedXids also */
> >             if (unobserved)
> >             {
> >                 for (index = 0; index < rxact[xid_index].nsubxids; index++)
> >                     UnobservedTransactionsRemoveXid(subxip[index + rxact[xid_index].subx_offset], false);
> >             }
> >         }
> 
> overwrites subxids array, and will resurrect any already aborted 
> subtransaction.
> 
> Isn't XLByteLT(proc->lsn, lsn) always true, because 'lsn' is the lsn of 
> the WAL record we're redoing, so there can't be any procs with an LSN 
> higher than that?

I'm wondering whether we need those circumstances at all.

The main role of ProcArrayUpdateRecoveryTransactions() is two-fold
* initialise snapshot when there isn't one
* reduce possibility of FATAL errors that don't write abort records

Neither of those needs us to update the subxid cache, so we'd be better
off avoiding that altogether in the common case. So we should be able to
ignore the lsn and race conditions altogether.

It might even be more helpful to explicitly separate those twin roles so
the code is clearer.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Tue, 2009-02-24 at 23:41 +0000, Simon Riggs wrote:
> On Tue, 2009-02-24 at 22:29 +0200, Heikki Linnakangas wrote:

> > overwrites subxids array, and will resurrect any already aborted 
> > subtransaction.
> > 
> > Isn't XLByteLT(proc->lsn, lsn) always true, because 'lsn' is the lsn of 
> > the WAL record we're redoing, so there can't be any procs with an LSN 
> > higher than that?
> 
> I'm wondering whether we need those circumstances at all.
> 
> The main role of ProcArrayUpdateRecoveryTransactions() is two-fold
> * initialise snapshot when there isn't one
> * reduce possibility of FATAL errors that don't write abort records
> 
> Neither of those needs us to update the subxid cache, so we'd be better
> off avoiding that altogether in the common case. So we should be able to
> ignore the lsn and race conditions altogether.

We still have a race condition for the initial snapshot, so your concern
still holds. Thanks for highlighting it.

I'm in the middle of rewriting ProcArrayUpdateRecoveryTransactions() to
avoid errors caused by these race conditions. The LSN flag was an
attempt to do that, but was insufficient and has now been removed.

I'll discuss it more when I've got it working. Seems like we need
working code now rather than lengthy debates. I see a solution and
almost have it done.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2009-02-24 at 21:59 +0200, Heikki Linnakangas wrote:
>>> What benefit would we gain from separating them, especially since we now
>>> have working, tested code?
>> Simplicity. That matters a lot. Removing the distinction between 
>> unobserved xids and already-observed running transactions would slash a 
>> lot of code.
> 
> It might and it might not, but I don't believe all angles have been
> evaluated. But I would say that major changes such as this have resulted
> in weeks of work. More bugs have been introduced since feature freeze
> than were present beforehand. 

Here's a rough sketch of how the transaction tracking could work without 
recovery procs, relying on unobserved xids array only. The "unobserved 
xids" is a complete misnomer now, as it tracks all master-transactions, 
and there's no distinction between observed and unobserved ones.

Another big change in this patch is the way xl_xact_assignment records 
work. Instead of issuing one such WAL record for each subtransaction 
when they're being assigned recursively, we keep track of which xids 
have already been "reported" in the WAL (similar to what you had in an 
earlier version of the patch). Whenever you hit the limit of 64 
unreported subxids, you issue a single WAL record listing all the 
unreported subxids of this top-level transactions, and mark them as 
reported. The limit of 64 is chosen arbitrarily, but it should match the 
number of slots in the unobserved xids array per backend, to avoid 
running out of slots. This eliminates the need for the xl_topxid field 
in the WAL record header. I think one WAL record per 64 assigned 
subtransactions is a small price to pay, considering that a transaction 
with that many subtransactions is probably doing some interesting work 
anyway, and the volume of those assignment WAL records is lost in the 
noise of all the other WAL records the transactions issues.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Forgot attachment, here it is as a patch against CVS HEAD. It's also in
my git repository.

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Tue, 2009-02-24 at 21:59 +0200, Heikki Linnakangas wrote:
>>>> What benefit would we gain from separating them, especially since we
>>>> now
>>>> have working, tested code?
>>> Simplicity. That matters a lot. Removing the distinction between
>>> unobserved xids and already-observed running transactions would slash
>>> a lot of code.
>>
>> It might and it might not, but I don't believe all angles have been
>> evaluated. But I would say that major changes such as this have resulted
>> in weeks of work. More bugs have been introduced since feature freeze
>> than were present beforehand.
>
> Here's a rough sketch of how the transaction tracking could work without
> recovery procs, relying on unobserved xids array only. The "unobserved
> xids" is a complete misnomer now, as it tracks all master-transactions,
> and there's no distinction between observed and unobserved ones.
>
> Another big change in this patch is the way xl_xact_assignment records
> work. Instead of issuing one such WAL record for each subtransaction
> when they're being assigned recursively, we keep track of which xids
> have already been "reported" in the WAL (similar to what you had in an
> earlier version of the patch). Whenever you hit the limit of 64
> unreported subxids, you issue a single WAL record listing all the
> unreported subxids of this top-level transactions, and mark them as
> reported. The limit of 64 is chosen arbitrarily, but it should match the
> number of slots in the unobserved xids array per backend, to avoid
> running out of slots. This eliminates the need for the xl_topxid field
> in the WAL record header. I think one WAL record per 64 assigned
> subtransactions is a small price to pay, considering that a transaction
> with that many subtransactions is probably doing some interesting work
> anyway, and the volume of those assignment WAL records is lost in the
> noise of all the other WAL records the transactions issues.
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 223911c..b617bb7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -370,6 +370,12 @@ SET ENABLE_SEQSCAN TO OFF;
         allows. See <xref linkend="sysvipc"> for information on how to
         adjust those parameters, if necessary.
        </para>
+
+       <para>
+    When running a standby server it is strongly recommended that you
+    set this parameter to be the same or higher than the master server,
+    otherwise queries on the standby server may fail.
+       </para>
       </listitem>
      </varlistentry>

@@ -5392,6 +5398,32 @@ plruby.use_strict = true        # generates error: unknown class name
       </listitem>
      </varlistentry>

+     <varlistentry id="guc-trace-recovery-messages" xreflabel="trace_recovery_messages">
+      <term><varname>trace_recovery_messages</varname> (<type>string</type>)</term>
+      <indexterm>
+       <primary><varname>trace_recovery_messages</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Controls which message levels are written to the server log
+        for system modules needed for recovery processing. This allows
+        the user to override the normal setting of log_min_messages,
+        but only for specific messages. This is intended for use in
+        debugging Hot Standby.
+        Valid values are <literal>DEBUG5</>, <literal>DEBUG4</>,
+        <literal>DEBUG3</>, <literal>DEBUG2</>, <literal>DEBUG1</>,
+        <literal>INFO</>, <literal>NOTICE</>, <literal>WARNING</>,
+        <literal>ERROR</>, <literal>LOG</>, <literal>FATAL</>, and
+        <literal>PANIC</>.  Each level includes all the levels that
+        follow it.  The later the level, the fewer messages are sent
+        to the log.  The default is <literal>WARNING</>.  Note that
+        <literal>LOG</> has a different rank here than in
+        <varname>client_min_messages</>.
+        Parameter should be set in the postgresql.conf only.
+       </para>
+      </listitem>
+     </varlistentry>
+
     <varlistentry id="guc-zero-damaged-pages" xreflabel="zero_damaged_pages">
       <term><varname>zero_damaged_pages</varname> (<type>boolean</type>)</term>
       <indexterm>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c999f0c..910e0f9 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12963,6 +12963,193 @@ postgres=# select * from pg_xlogfile_name_offset(pg_stop_backup());
     <xref linkend="continuous-archiving">.
    </para>

+   <indexterm>
+    <primary>pg_is_in_recovery</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_last_recovered_xact_timestamp</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_last_recovered_xid</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_last_recovered_xlog_location</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_recovery_pause</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_recovery_continue</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_recovery_pause_xid</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_recovery_pause_time</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_recovery_stop</primary>
+   </indexterm>
+
+   <para>
+    The functions shown in <xref
+    linkend="functions-admin-recovery-table"> assist in archive recovery.
+    Except for the first three functions, these are restricted to superusers.
+    All of these functions can only be executed during recovery.
+   </para>
+
+   <table id="functions-admin-recovery-table">
+    <title>Recovery Control Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_is_in_recovery</function>()</literal>
+        </entry>
+       <entry><type>bool</type></entry>
+       <entry>True if recovery is still in progress.</entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_last_recovered_xact_timestamp</function>()</literal>
+        </entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>Returns the original completion timestamp with timezone of the
+        last recovered transaction. If recovery is still in progress this
+        will increase monotonically while if recovery is complete then this
+        value will remain static at the value of the last transaction applied
+        during that recovery. When the server has been started normally this
+        will return a default value.
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_last_recovered_xid</function>()</literal>
+        </entry>
+       <entry><type>integer</type></entry>
+       <entry>Returns the transaction id (32-bit) of last completed transaction
+        in the current recovery. Later numbered transaction ids may already have
+        completed, so the value could in some cases be lower than the last time
+        this function executed. If recovery is complete then this value will
+        remain static at the value of the last transaction applied during that
+        recovery. When the server has been started normally this will return
+        InvalidXid (zero).
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_last_recovered_xlog_location</function>()</literal>
+        </entry>
+       <entry><type>text</type></entry>
+       <entry>Returns the transaction log location of the last WAL record
+        in the current recovery. If recovery is still in progress this
+        will increase monotonically. If recovery is complete then this value will
+        remain static at the value of the last transaction applied during that
+        recovery. When the server has been started normally this will return
+        InvalidXLogRecPtr (0/0).
+        (zero).
+       </entry>
+      </row>
+
+      <row>
+       <entry>
+        <literal><function>pg_recovery_pause</function>()</literal>
+        </entry>
+       <entry><type>void</type></entry>
+       <entry>Pause recovery processing, unconditionally.</entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_recovery_continue</function>()</literal>
+        </entry>
+       <entry><type>void</type></entry>
+       <entry>If recovery is paused, continue processing.</entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_recovery_stop</function>()</literal>
+        </entry>
+       <entry><type>void</type></entry>
+       <entry>End recovery and begin normal processing.</entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_recovery_pause_xid</function>()</literal>
+        </entry>
+       <entry><type>void</type></entry>
+       <entry>Continue recovery until specified xid completes, if it is ever
+        seen, then pause recovery.
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_recovery_pause_time</function>()</literal>
+        </entry>
+       <entry><type>void</type></entry>
+       <entry>Continue recovery until a transaction with specified timestamp
+        completes, if one is ever seen, then pause recovery.
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_recovery_advance</function>()</literal>
+        </entry>
+       <entry><type>void</type></entry>
+       <entry>Advance recovery specified number of records then pause.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_recovery_pause</> and <function>pg_recovery_continue</> allow
+    a superuser to control the progress of recovery on the database server.
+    While recovery is paused queries can then be executed to determine how far
+    forwards recovery should progress. Recovery can never go backwards
+    because previous values are overwritten.  If the superuser wishes recovery
+    to complete and normal processing mode to start, execute
+    <function>pg_recovery_stop</>.
+   </para>
+
+   <para>
+    Variations of the pause function exist, mainly to allow PITR to dynamically
+    control where it should progress to. <function>pg_recovery_pause_xid</> and
+    <function>pg_recovery_pause_time</> allow the specification of a trial
+    recovery target, similarly to <xref linkend="recovery-config-settings">.
+    Recovery will then progress to the specified point and then pause, rather
+    than stopping permanently, allowing assessment of whether this is the
+    desired stopping point for recovery.
+   </para>
+
+   <para>
+    <function>pg_recovery_advance</> allows recovery to progress record by
+    record, for very careful analysis or debugging. Step size can be 1 or
+    more records. If recovery is not yet paused then <function>pg_recovery_advance</>
+    will process the specified number of records then pause. If recovery
+    is already paused, recovery will continue for another N records before
+    pausing again.
+   </para>
+
+   <para>
+    If you pause recovery while the server is waiting for a WAL file when
+    operating in standby mode it will have apparently no effect until the
+    file arrives. Once the server begins processing WAL records again it
+    will notice the pause request and will act upon it. This is not a bug.
+    pause.
+   </para>
+
+   <para>
+    Pausing recovery will also prevent restartpoints from starting since they
+    are triggered by events in the WAL stream. In all other ways processing
+    will continue, for example the background writer will continue to clean
+    shared_buffers while paused.
+   </para>
+
    <para>
     The functions shown in <xref linkend="functions-admin-dbsize"> calculate
     the actual disk space usage of database objects.
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 8382576..9c370d3 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -14,6 +14,7 @@
 #include "postgres.h"

 #include "access/gin.h"
+#include "access/xact.h"
 #include "access/xlogutils.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
@@ -438,6 +439,9 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
 {
     uint8        info = record->xl_info & ~XLR_INFO_MASK;

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
     RestoreBkpBlocks(lsn, record, false);

     topCtx = MemoryContextSwitchTo(opCtx);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 4a20d90..bdcbaf1 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -14,6 +14,7 @@
 #include "postgres.h"

 #include "access/gist_private.h"
+#include "access/xact.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -396,6 +397,9 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record)
     uint8        info = record->xl_info & ~XLR_INFO_MASK;
     MemoryContext oldCxt;

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
     RestoreBkpBlocks(lsn, record, false);

     oldCxt = MemoryContextSwitchTo(opCtx);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 52115cf..33a87d9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3814,19 +3814,78 @@ heap_restrpos(HeapScanDesc scan)
 }

 /*
+ * Update the latestRemovedXid for the current VACUUM. This gets called
+ * only rarely, since we probably already removed rows earlier.
+ * see comments for vacuum_log_cleanup_info().
+ */
+void
+HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
+                                        TransactionId *latestRemovedXid)
+{
+    TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
+    TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
+    TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+
+    if (tuple->t_infomask & HEAP_MOVED_OFF ||
+        tuple->t_infomask & HEAP_MOVED_IN)
+    {
+        if (TransactionIdPrecedes(*latestRemovedXid, xvac))
+            *latestRemovedXid = xvac;
+    }
+
+    if (TransactionIdPrecedes(*latestRemovedXid, xmax))
+        *latestRemovedXid = xmax;
+
+    if (TransactionIdPrecedes(*latestRemovedXid, xmin))
+        *latestRemovedXid = xmin;
+
+    Assert(TransactionIdIsValid(*latestRemovedXid));
+}
+
+/*
+ * Perform XLogInsert to register a heap cleanup info message. These
+ * messages are sent once per VACUUM and are required because
+ * of the phasing of removal operations during a lazy VACUUM.
+ * see comments for vacuum_log_cleanup_info().
+ */
+XLogRecPtr
+log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+{
+    xl_heap_cleanup_info xlrec;
+    XLogRecPtr    recptr;
+    XLogRecData rdata;
+
+    xlrec.node = rnode;
+    xlrec.latestRemovedXid = latestRemovedXid;
+
+    rdata.data = (char *) &xlrec;
+    rdata.len = SizeOfHeapCleanupInfo;
+    rdata.buffer = InvalidBuffer;
+    rdata.next = NULL;
+
+    recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO, &rdata);
+
+    return recptr;
+}
+
+/*
  * Perform XLogInsert for a heap-clean operation.  Caller must already
  * have modified the buffer and marked it dirty.
  *
  * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
  * zero-based tuple indexes.  Now they are one-based like other uses
  * of OffsetNumber.
+ *
+ * For 8.4 we also include the latestRemovedXid which allows recovery
+ * processing to cancel long standby queries that would be have their
+ * results changed if we applied these changes.
  */
 XLogRecPtr
 log_heap_clean(Relation reln, Buffer buffer,
                OffsetNumber *redirected, int nredirected,
                OffsetNumber *nowdead, int ndead,
                OffsetNumber *nowunused, int nunused,
-               bool redirect_move)
+               TransactionId latestRemovedXid, bool redirect_move)
 {
     xl_heap_clean xlrec;
     uint8        info;
@@ -3838,6 +3897,7 @@ log_heap_clean(Relation reln, Buffer buffer,

     xlrec.node = reln->rd_node;
     xlrec.block = BufferGetBlockNumber(buffer);
+    xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
     xlrec.ndead = ndead;

@@ -4109,6 +4169,29 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 }

 /*
+ * Handles CLEANUP_INFO
+ */
+static void
+heap_xlog_cleanup_info(XLogRecPtr lsn, XLogRecord *record)
+{
+    xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
+
+    if (InArchiveRecovery &&
+        RecordKnownAssignedTransactionIds(lsn, record->xl_xid))
+    {
+        SetDeferredRecoveryConflicts(xlrec->latestRemovedXid,
+                                     xlrec->node,
+                                     lsn);
+    }
+
+    /*
+     * Actual operation is a no-op. Record type exists to provide a means
+     * for conflict processing to occur before we begin index vacuum actions.
+     * see vacuumlazy.c
+     */
+}
+
+/*
  * Handles CLEAN and CLEAN_MOVE record types
  */
 static void
@@ -4126,12 +4209,23 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
     int            nunused;
     Size        freespace;

+    if (InArchiveRecovery &&
+        RecordKnownAssignedTransactionIds(lsn, record->xl_xid))
+    {
+        SetDeferredRecoveryConflicts(xlrec->latestRemovedXid,
+                                     xlrec->node,
+                                     lsn);
+    }
+
+    RestoreBkpBlocks(lsn, record, true);
+
     if (record->xl_info & XLR_BKP_BLOCK_1)
         return;

-    buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+    buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
     if (!BufferIsValid(buffer))
         return;
+    LockBufferForCleanup(buffer);
     page = (Page) BufferGetPage(buffer);

     if (XLByteLE(lsn, PageGetLSN(page)))
@@ -4186,12 +4280,18 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
     Buffer        buffer;
     Page        page;

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
+    RestoreBkpBlocks(lsn, record, false);
+
     if (record->xl_info & XLR_BKP_BLOCK_1)
         return;

-    buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+    buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
     if (!BufferIsValid(buffer))
         return;
+    LockBufferForCleanup(buffer);
     page = (Page) BufferGetPage(buffer);

     if (XLByteLE(lsn, PageGetLSN(page)))
@@ -4777,6 +4877,9 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record)
 {
     uint8        info = record->xl_info & ~XLR_INFO_MASK;

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
     RestoreBkpBlocks(lsn, record, false);

     switch (info & XLOG_HEAP_OPMASK)
@@ -4818,17 +4921,17 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
     switch (info & XLOG_HEAP_OPMASK)
     {
         case XLOG_HEAP2_FREEZE:
-            RestoreBkpBlocks(lsn, record, false);
             heap_xlog_freeze(lsn, record);
             break;
         case XLOG_HEAP2_CLEAN:
-            RestoreBkpBlocks(lsn, record, true);
             heap_xlog_clean(lsn, record, false);
             break;
         case XLOG_HEAP2_CLEAN_MOVE:
-            RestoreBkpBlocks(lsn, record, true);
             heap_xlog_clean(lsn, record, true);
             break;
+        case XLOG_HEAP2_CLEANUP_INFO:
+            heap_xlog_cleanup_info(lsn, record);
+            break;
         default:
             elog(PANIC, "heap2_redo: unknown op code %u", info);
     }
@@ -4958,17 +5061,26 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
     {
         xl_heap_clean *xlrec = (xl_heap_clean *) rec;

-        appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u",
+        appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u",
                          xlrec->node.spcNode, xlrec->node.dbNode,
-                         xlrec->node.relNode, xlrec->block);
+                         xlrec->node.relNode, xlrec->block,
+                         xlrec->latestRemovedXid);
     }
     else if (info == XLOG_HEAP2_CLEAN_MOVE)
     {
         xl_heap_clean *xlrec = (xl_heap_clean *) rec;

-        appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u",
+        appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u remxid %u",
                          xlrec->node.spcNode, xlrec->node.dbNode,
-                         xlrec->node.relNode, xlrec->block);
+                         xlrec->node.relNode, xlrec->block,
+                         xlrec->latestRemovedXid);
+    }
+    else if (info == XLOG_HEAP2_CLEANUP_INFO)
+    {
+        xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
+
+        appendStringInfo(buf, "cleanup info: remxid %u",
+                         xlrec->latestRemovedXid);
     }
     else
         appendStringInfo(buf, "UNKNOWN");
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 2691666..00eb502 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -30,6 +30,7 @@
 typedef struct
 {
     TransactionId new_prune_xid;    /* new prune hint value for page */
+    TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
     int            nredirected;        /* numbers of entries in arrays below */
     int            ndead;
     int            nunused;
@@ -85,6 +86,14 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
         return;

     /*
+     * We can't write WAL in recovery mode, so there's no point trying to
+     * clean the page. The master will likely issue a cleaning WAL record
+     * soon anyway, so this is no particular loss.
+     */
+    if (RecoveryInProgress())
+        return;
+
+    /*
      * We prune when a previous UPDATE failed to find enough space on the page
      * for a new tuple version, or when free space falls below the relation's
      * fill-factor target (but not less than 10%).
@@ -176,6 +185,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
      * Also initialize the rest of our working state.
      */
     prstate.new_prune_xid = InvalidTransactionId;
+    prstate.latestRemovedXid = InvalidTransactionId;
     prstate.nredirected = prstate.ndead = prstate.nunused = 0;
     memset(prstate.marked, 0, sizeof(prstate.marked));

@@ -258,7 +268,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
                                     prstate.redirected, prstate.nredirected,
                                     prstate.nowdead, prstate.ndead,
                                     prstate.nowunused, prstate.nunused,
-                                    redirect_move);
+                                    prstate.latestRemovedXid, redirect_move);

             PageSetLSN(BufferGetPage(buffer), recptr);
             PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
@@ -396,6 +406,8 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
                 == HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
             {
                 heap_prune_record_unused(prstate, rootoffnum);
+                HeapTupleHeaderAdvanceLatestRemovedXid(htup,
+                                                       &prstate->latestRemovedXid);
                 ndeleted++;
             }

@@ -521,7 +533,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
          * find another DEAD tuple is a fairly unusual corner case.)
          */
         if (tupdead)
+        {
             latestdead = offnum;
+            HeapTupleHeaderAdvanceLatestRemovedXid(htup,
+                                                   &prstate->latestRemovedXid);
+        }
         else if (!recent_dead)
             break;

diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 88baa7c..fb2b06a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,8 +89,19 @@ RelationGetIndexScan(Relation indexRelation,
     else
         scan->keyData = NULL;

+    /*
+     * During recovery we ignore killed tuples and don't bother to kill them
+     * either. We do this because the xmin on the primary node could easily
+     * be later than the xmin on the standby node, so that what the primary
+     * thinks is killed is supposed to be visible on standby. So for correct
+     * MVCC for queries during recovery we must ignore these hints and check
+     * all tuples. Do *not* set ignore_killed_tuples to true when running
+     * in a transaction that was started during recovery. AMs can set it to
+     * false at any time. xactStartedInRecovery should not be touched by AMs.
+     */
     scan->kill_prior_tuple = false;
-    scan->ignore_killed_tuples = true;    /* default setting */
+    scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
+    scan->ignore_killed_tuples = !scan->xactStartedInRecovery;

     scan->opaque = NULL;

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 92eec92..09da208 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -455,9 +455,12 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)

             /*
              * If we scanned a whole HOT chain and found only dead tuples,
-             * tell index AM to kill its entry for that TID.
+             * tell index AM to kill its entry for that TID. We do not do
+             * this when in recovery because it may violate MVCC to do so.
+             * see comments in RelationGetIndexScan().
              */
-            scan->kill_prior_tuple = scan->xs_hot_dead;
+            if (!scan->xactStartedInRecovery)
+                scan->kill_prior_tuple = scan->xs_hot_dead;

             /*
              * The AM's gettuple proc finds the next index entry matching the
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 81d56b3..aee8f8f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -401,6 +401,27 @@ of the WAL entry.)  If the parent page becomes half-dead but is not
 immediately deleted due to a subsequent crash, there is no loss of
 consistency, and the empty page will be picked up by the next VACUUM.

+Scans during Recovery
+---------------------
+
+The btree index type can be safely used during recovery. During recovery
+we have at most one writer and potentially many readers. In that
+situation the locking requirements can be relaxed and we do not need
+double locking during block splits. Each WAL record makes changes to a
+single level of the btree using the correct locking sequence and so
+is safe for concurrent readers. Some readers may observe a block split
+in progress as they descend the tree, but they will simple move right
+onto the correct page.
+
+During recovery all index scans start with ignore_killed_tuples = false
+and we never set kill_prior_tuple. We do this because the oldest xmin
+on the standby server can be older than the oldest xmin on the master
+server, which means tuples can be marked as killed even when they are
+still visible on the standby. We don't WAL log tuple killed bits, but
+they can still appear in the standby because of full page writes. So
+we must always ignore them and that means it's not worth setting them
+either.
+
 Other Things That Are Handy to Know
 -----------------------------------

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 69a2ed3..7b4ce9e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1924,7 +1924,7 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer)
     }

     if (ndeletable > 0)
-        _bt_delitems(rel, buffer, deletable, ndeletable);
+        _bt_delitems(rel, buffer, deletable, ndeletable, false, 0);

     /*
      * Note: if we didn't find any LP_DEAD items, then the page's
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 23026c2..829c070 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -652,7 +652,8 @@ _bt_page_recyclable(Page page)
  */
 void
 _bt_delitems(Relation rel, Buffer buf,
-             OffsetNumber *itemnos, int nitems)
+             OffsetNumber *itemnos, int nitems, bool isVacuum,
+             BlockNumber lastBlockVacuumed)
 {
     Page        page = BufferGetPage(buf);
     BTPageOpaque opaque;
@@ -684,15 +685,35 @@ _bt_delitems(Relation rel, Buffer buf,
     /* XLOG stuff */
     if (!rel->rd_istemp)
     {
-        xl_btree_delete xlrec;
         XLogRecPtr    recptr;
         XLogRecData rdata[2];

-        xlrec.node = rel->rd_node;
-        xlrec.block = BufferGetBlockNumber(buf);
+        if (isVacuum)
+        {
+            xl_btree_vacuum xlrec_vacuum;
+            xlrec_vacuum.node = rel->rd_node;
+            xlrec_vacuum.block = BufferGetBlockNumber(buf);
+
+            xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+            rdata[0].data = (char *) &xlrec_vacuum;
+            rdata[0].len = SizeOfBtreeVacuum;
+        }
+        else
+        {
+            xl_btree_delete xlrec_delete;
+            xlrec_delete.node = rel->rd_node;
+            xlrec_delete.block = BufferGetBlockNumber(buf);
+
+            /*
+             * We would like to set an accurate latestRemovedXid, but there
+             * is no easy way of obtaining a useful value. So we use the
+             * probably far too conservative value of RecentGlobalXmin instead.
+             */
+            xlrec_delete.latestRemovedXid = InvalidTransactionId;
+            rdata[0].data = (char *) &xlrec_delete;
+            rdata[0].len = SizeOfBtreeDelete;
+        }

-        rdata[0].data = (char *) &xlrec;
-        rdata[0].len = SizeOfBtreeDelete;
         rdata[0].buffer = InvalidBuffer;
         rdata[0].next = &(rdata[1]);

@@ -715,7 +736,10 @@ _bt_delitems(Relation rel, Buffer buf,
         rdata[1].buffer_std = true;
         rdata[1].next = NULL;

-        recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
+        if (isVacuum)
+            recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
+        else
+            recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);

         PageSetLSN(page, recptr);
         PageSetTLI(page, ThisTimeLineID);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 59680cd..b1a8a57 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -58,7 +58,8 @@ typedef struct
     IndexBulkDeleteCallback callback;
     void       *callback_state;
     BTCycleId    cycleid;
-    BlockNumber lastUsedPage;
+    BlockNumber lastBlockVacuumed;     /* last blkno reached by Vacuum scan */
+    BlockNumber lastUsedPage;        /* blkno of last page that is in use */
     BlockNumber totFreePages;    /* true total # of free pages */
     MemoryContext pagedelcontext;
 } BTVacState;
@@ -626,6 +627,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
     vstate.callback = callback;
     vstate.callback_state = callback_state;
     vstate.cycleid = cycleid;
+    vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
     vstate.lastUsedPage = BTREE_METAPAGE;
     vstate.totFreePages = 0;

@@ -855,7 +857,19 @@ restart:
          */
         if (ndeletable > 0)
         {
-            _bt_delitems(rel, buf, deletable, ndeletable);
+            BlockNumber    lastBlockVacuumed = BufferGetBlockNumber(buf);
+
+            _bt_delitems(rel, buf, deletable, ndeletable, true, vstate->lastBlockVacuumed);
+
+            /*
+             * Keep track of the block number of the lastBlockVacuumed, so
+             * we can scan those blocks as well during WAL replay. This then
+             * provides concurrency protection and allows btrees to be used
+             * while in recovery.
+             */
+            if (lastBlockVacuumed > vstate->lastBlockVacuumed)
+                vstate->lastBlockVacuumed = lastBlockVacuumed;
+
             stats->tuples_removed += ndeletable;
             /* must recompute maxoff */
             maxoff = PageGetMaxOffsetNumber(page);
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 517c4b9..db7a216 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -16,7 +16,10 @@

 #include "access/nbtree.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "utils/inval.h"

 /*
  * We must keep track of expected insertions due to page splits, and apply
@@ -459,6 +462,86 @@ btree_xlog_split(bool onleft, bool isroot,
 }

 static void
+btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
+{
+    xl_btree_vacuum *xlrec;
+    Buffer        buffer;
+    Page        page;
+    BTPageOpaque opaque;
+
+    if (record->xl_info & XLR_BKP_BLOCK_1)
+        return;
+
+    xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+
+    /*
+     * We need to ensure every block is unpinned between the
+     * lastBlockVacuumed and the current block, if there are any.
+     * This ensures that every block in the index is touched during
+     * VACUUM as required to ensure scans work correctly.
+     */
+    if ((xlrec->lastBlockVacuumed + 1) != xlrec->block)
+    {
+        BlockNumber blkno = xlrec->lastBlockVacuumed + 1;
+
+        for (; blkno < xlrec->block; blkno++)
+        {
+            /*
+             * XXXHS we don't actually need to read the block, we
+             * just need to confirm it is unpinned. If we had a special call
+             * into the buffer manager we could optimise this so that
+             * if the block is not in shared_buffers we confirm it as unpinned.
+             */
+            buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+            if (BufferIsValid(buffer))
+            {
+                LockBufferForCleanup(buffer);
+                UnlockReleaseBuffer(buffer);
+            }
+        }
+    }
+
+    /*
+     * We need to take a cleanup lock to apply these changes.
+     * See nbtree/README for details.
+     */
+    buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
+    if (!BufferIsValid(buffer))
+        return;
+    LockBufferForCleanup(buffer);
+    page = (Page) BufferGetPage(buffer);
+
+    if (XLByteLE(lsn, PageGetLSN(page)))
+    {
+        UnlockReleaseBuffer(buffer);
+        return;
+    }
+
+    if (record->xl_len > SizeOfBtreeVacuum)
+    {
+        OffsetNumber *unused;
+        OffsetNumber *unend;
+
+        unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeVacuum);
+        unend = (OffsetNumber *) ((char *) xlrec + record->xl_len);
+
+        PageIndexMultiDelete(page, unused, unend - unused);
+    }
+
+    /*
+     * Mark the page as not containing any LP_DEAD items --- see comments in
+     * _bt_delitems().
+     */
+    opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+    opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+    PageSetLSN(page, lsn);
+    PageSetTLI(page, ThisTimeLineID);
+    MarkBufferDirty(buffer);
+    UnlockReleaseBuffer(buffer);
+}
+
+static void
 btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 {
     xl_btree_delete *xlrec;
@@ -470,6 +553,11 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
         return;

     xlrec = (xl_btree_delete *) XLogRecGetData(record);
+
+    /*
+     * We don't need to take a cleanup lock to apply these changes.
+     * See nbtree/README for details.
+     */
     buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
     if (!BufferIsValid(buffer))
         return;
@@ -714,6 +802,35 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 {
     uint8        info = record->xl_info & ~XLR_INFO_MASK;

+    /*
+     * Btree delete records can conflict with standby queries. You might
+     * think that Vacuum records would conflict as well, but they don't
+     * because XLOG_HEAP2_CLEANUP_INFO exist specifically to ensure that
+     * we perform all conflict for the whole index, rather than block by
+     * block.
+     */
+    if (InArchiveRecovery)
+    {
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+        if (info == XLOG_BTREE_DELETE)
+        {
+            xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
+
+            /*
+             * XXXHS: Currently we put everybody on death row, because
+             * currently _bt_delitems() supplies InvalidTransactionId. We
+             * should be able to do better than that with some thought.
+             */
+            SetDeferredRecoveryConflicts(xlrec->latestRemovedXid,
+                                         xlrec->node,
+                                         lsn);
+        }
+    }
+
+    /*
+     * Exclusive lock on a btree block is as good as a Cleanup lock,
+     * so need to special case btree delete and vacuum.
+     */
     RestoreBkpBlocks(lsn, record, false);

     switch (info)
@@ -739,6 +856,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
         case XLOG_BTREE_SPLIT_R_ROOT:
             btree_xlog_split(false, true, lsn, record);
             break;
+        case XLOG_BTREE_VACUUM:
+            btree_xlog_vacuum(lsn, record);
+            break;
         case XLOG_BTREE_DELETE:
             btree_xlog_delete(lsn, record);
             break;
@@ -843,13 +963,24 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
                                  xlrec->level, xlrec->firstright);
                 break;
             }
+        case XLOG_BTREE_VACUUM:
+            {
+                xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
+
+                appendStringInfo(buf, "vacuum: rel %u/%u/%u; blk %u, lastBlockVacuumed %u",
+                                 xlrec->node.spcNode, xlrec->node.dbNode,
+                                 xlrec->node.relNode, xlrec->block,
+                                 xlrec->lastBlockVacuumed);
+                break;
+            }
         case XLOG_BTREE_DELETE:
             {
                 xl_btree_delete *xlrec = (xl_btree_delete *) rec;

-                appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u",
+                appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u, latestRemovedXid %u",
                                  xlrec->node.spcNode, xlrec->node.dbNode,
-                                 xlrec->node.relNode, xlrec->block);
+                                 xlrec->node.relNode, xlrec->block,
+                                 xlrec->latestRemovedXid);
                 break;
             }
         case XLOG_BTREE_DELETE_PAGE:
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index a88563e..f7926d2 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -195,7 +195,8 @@ they first do something that requires one --- typically, insert/update/delete
 a tuple, though there are a few other places that need an XID assigned.
 If a subtransaction requires an XID, we always first assign one to its
 parent.  This maintains the invariant that child transactions have XIDs later
-than their parents, which is assumed in a number of places.
+than their parents, which is assumed in a number of places. In 8.4 onwards,
+some corner cases exist that require XID assignment to be WAL logged.

 The subsidiary actions of obtaining a lock on the XID and and entering it into
 pg_subtrans and PG_PROC are done at the time it is assigned.
@@ -649,3 +650,33 @@ fsync it down to disk without any sort of interlock, as soon as it finishes
 the bulk update.  However, all these paths are designed to write data that
 no other transaction can see until after T1 commits.  The situation is thus
 not different from ordinary WAL-logged updates.
+
+Transaction Emulation during Recovery
+-------------------------------------
+
+During Recovery we replay transaction changes in the order they occurred.
+As part of this replay we emulate some transactional behaviour, so that
+read only backends can take MVCC snapshots. We do this by maintaining
+Recovery Procs, so that each transaction that has recorded WAL records for
+database writes will exist in the procarray until it commits. Further
+details are given in comments in procarray.c.
+
+Many actions write no WAL records at all, for example read only transactions.
+These have no effect on MVCC in recovery and we can pretend they never
+occurred at all. Subtransaction commit does not write a WAL record either
+and has very little effect, since lock waiters need to wait for the
+parent transaction to complete.
+
+Not all transactional behaviour is emulated, for example we do not insert
+a transaction entry into the lock table, nor do we maintain the transaction
+stack in memory. Clog entries are made normally. Multitrans is not maintained
+because its purpose is to record tuple level locks that an application has
+requested to prevent write locks. Since write locks cannot be obtained at all,
+there is never any conflict and so there is no reason to update multitrans.
+Subtrans is maintained during recovery but the details of the transaction
+tree are ignored and all subtransactions reference the top-level TransactionId
+directly. Since commit is atomic this provides correct lock wait behaviour
+yet simplifies emulation of subtransactions considerably.
+
+Further details on locking mechanics in recovery are given in comments
+with the Lock rmgr code.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 528a219..81315a6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -35,6 +35,7 @@
 #include "access/clog.h"
 #include "access/slru.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "pg_trace.h"
 #include "postmaster/bgwriter.h"

@@ -687,6 +688,9 @@ clog_redo(XLogRecPtr lsn, XLogRecord *record)
     /* Backup blocks are not used in clog records */
     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
     if (info == CLOG_ZEROPAGE)
     {
         int            pageno;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7314341..06d0273 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1413,8 +1413,11 @@ ZeroMultiXactMemberPage(int pageno, bool writeXlog)
  * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.    Note that we
  * may already have replayed WAL data into the SLRU files.
  *
- * We don't need any locks here, really; the SLRU locks are taken
- * only because slru.c expects to be called with locks held.
+ * We want this operation to be atomic to ensure that other processes can
+ * use MultiXact while we complete recovery. We access one page only from the
+ * offset and members buffers, so once locks are acquired they will not be
+ * dropped and re-acquired by SLRU code. So we take both locks at start, then
+ * hold them all the way to the end.
  */
 void
 StartupMultiXact(void)
@@ -1426,6 +1429,7 @@ StartupMultiXact(void)

     /* Clean up offsets state */
     LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
+    LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);

     /*
      * Initialize our idea of the latest page number.
@@ -1452,10 +1456,7 @@ StartupMultiXact(void)
         MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
     }

-    LWLockRelease(MultiXactOffsetControlLock);
-
     /* And the same for members */
-    LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);

     /*
      * Initialize our idea of the latest page number.
@@ -1483,6 +1484,7 @@ StartupMultiXact(void)
     }

     LWLockRelease(MultiXactMemberControlLock);
+    LWLockRelease(MultiXactOffsetControlLock);

     /*
      * Initialize lastTruncationPoint to invalid, ensuring that the first
@@ -1542,8 +1544,9 @@ CheckPointMultiXact(void)
      * isn't valid (because StartupMultiXact hasn't been called yet) and so
      * SimpleLruTruncate would get confused.  It seems best not to risk
      * removing any data during recovery anyway, so don't truncate.
+     * We are executing in the bgwriter, so we must access shared status.
      */
-    if (!InRecovery)
+    if (!RecoveryInProgress())
         TruncateMultiXact();

     TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
@@ -1873,6 +1876,9 @@ multixact_redo(XLogRecPtr lsn, XLogRecord *record)
     /* Backup blocks are not used in multixact records */
     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
     if (info == XLOG_MULTIXACT_ZERO_OFF_PAGE)
     {
         int            pageno;
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 0273b0e..252f4ee 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -20,6 +20,7 @@
 #include "commands/dbcommands.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
+#include "storage/sinval.h"
 #include "storage/freespace.h"


@@ -32,7 +33,7 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = {
     {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
     {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
     {"Reserved 7", NULL, NULL, NULL, NULL, NULL},
-    {"Reserved 8", NULL, NULL, NULL, NULL, NULL},
+    {"Relation", relation_redo, relation_desc, NULL, NULL, NULL},
     {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
     {"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
     {"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 68e3869..f337e18 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -598,7 +598,8 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
      * commands to set the commit status of transactions whose bits are in
      * already-truncated segments of the commit log (see notes in
      * SlruPhysicalWritePage).    Hence, if we are InRecovery, allow the case
-     * where the file doesn't exist, and return zeroes instead.
+     * where the file doesn't exist, and return zeroes instead. We also
+     * return a zeroed page when seek and read fails.
      */
     fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
     if (fd < 0)
@@ -619,6 +620,14 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)

     if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
     {
+        if (InRecovery)
+        {
+            ereport(LOG,
+                    (errmsg("file \"%s\" doesn't exist, reading as zeroes",
+                            path)));
+            MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+            return true;
+        }
         slru_errcause = SLRU_SEEK_FAILED;
         slru_errno = errno;
         close(fd);
@@ -628,6 +637,14 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
     errno = 0;
     if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
     {
+        if (InRecovery)
+        {
+            ereport(LOG,
+                    (errmsg("file \"%s\" doesn't exist, reading as zeroes",
+                            path)));
+            MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+            return true;
+        }
         slru_errcause = SLRU_READ_FAILED;
         slru_errno = errno;
         close(fd);
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0dbd216..9003daa 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
 #include "access/slru.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
+#include "miscadmin.h"
 #include "pg_trace.h"
 #include "utils/snapmgr.h"

@@ -44,7 +45,8 @@
  * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE, and segment numbering at
  * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE/SLRU_SEGMENTS_PER_PAGE.  We need take no
  * explicit notice of that fact in this module, except when comparing segment
- * and page numbers in TruncateSUBTRANS (see SubTransPagePrecedes).
+ * and page numbers in TruncateSUBTRANS (see SubTransPagePrecedes)
+ * and in recovery when we do ExtendSUBTRANS.
  */

 /* We need four bytes per xact */
@@ -83,8 +85,12 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
     ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
     ptr += entryno;

-    /* Current state should be 0 */
-    Assert(*ptr == InvalidTransactionId);
+    /*
+     * Current state should be 0, except in recovery where we may
+     * need to reset the value multiple times
+     */
+    Assert(*ptr == InvalidTransactionId ||
+            (InRecovery && *ptr == parent));

     *ptr = parent;

@@ -223,33 +229,19 @@ ZeroSUBTRANSPage(int pageno)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup,
  * after StartupXLOG has initialized ShmemVariableCache->nextXid.
- *
- * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
- * if there are none.
  */
 void
 StartupSUBTRANS(TransactionId oldestActiveXID)
 {
-    int            startPage;
-    int            endPage;
+    TransactionId xid = ShmemVariableCache->nextXid;
+    int            pageno = TransactionIdToPage(xid);

-    /*
-     * Since we don't expect pg_subtrans to be valid across crashes, we
-     * initialize the currently-active page(s) to zeroes during startup.
-     * Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
-     * the new page without regard to whatever was previously on disk.
-     */
     LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);

-    startPage = TransactionIdToPage(oldestActiveXID);
-    endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
-
-    while (startPage != endPage)
-    {
-        (void) ZeroSUBTRANSPage(startPage);
-        startPage++;
-    }
-    (void) ZeroSUBTRANSPage(startPage);
+    /*
+     * Initialize our idea of the latest page number.
+     */
+    SubTransCtl->shared->latest_page_number = pageno;

     LWLockRelease(SubtransControlLock);
 }
@@ -302,16 +294,42 @@ void
 ExtendSUBTRANS(TransactionId newestXact)
 {
     int            pageno;
+    static int last_pageno = 0;

-    /*
-     * No work except at first XID of a page.  But beware: just after
-     * wraparound, the first XID of page zero is FirstNormalTransactionId.
-     */
-    if (TransactionIdToEntry(newestXact) != 0 &&
-        !TransactionIdEquals(newestXact, FirstNormalTransactionId))
-        return;
+    Assert(TransactionIdIsNormal(newestXact));

-    pageno = TransactionIdToPage(newestXact);
+    if (!InRecovery)
+    {
+        /*
+         * No work except at first XID of a page.  But beware: just after
+         * wraparound, the first XID of page zero is FirstNormalTransactionId.
+         */
+        if (TransactionIdToEntry(newestXact) != 0 &&
+            !TransactionIdEquals(newestXact, FirstNormalTransactionId))
+            return;
+
+        pageno = TransactionIdToPage(newestXact);
+    }
+    else
+    {
+        /*
+         * InRecovery we keep track of the last page we extended, so
+         * we can compare that against incoming XIDs. This will only
+         * ever be run by startup process, so keep it as a static variable
+         * rather than hiding behind the SubtransControlLock.
+         */
+        pageno = TransactionIdToPage(newestXact);
+
+        if (pageno == last_pageno ||
+            SubTransPagePrecedes(pageno, last_pageno))
+            return;
+
+        elog(trace_recovery(DEBUG1),
+                        "extend subtrans  xid %u page %d last_page %d",
+                        newestXact, pageno, last_pageno);
+
+        last_pageno = pageno;
+    }

     LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);

diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 2a1eab4..6fb2d3f 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -35,9 +35,6 @@ static TransactionId cachedFetchXid = InvalidTransactionId;
 static XidStatus cachedFetchXidStatus;
 static XLogRecPtr cachedCommitLSN;

-/* Handy constant for an invalid xlog recptr */
-static const XLogRecPtr InvalidXLogRecPtr = {0, 0};
-
 /* Local functions */
 static XidStatus TransactionLogFetch(TransactionId transactionId);

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index eb3f341..9f1681c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -359,7 +359,7 @@ MarkAsPrepared(GlobalTransaction gxact)
      * Put it into the global ProcArray so TransactionIdIsInProgress considers
      * the XID as still running.
      */
-    ProcArrayAdd(&gxact->proc);
+    ProcArrayAdd(&gxact->proc, true);
 }

 /*
@@ -1198,7 +1198,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
                                        hdr->nsubxacts, children,
                                        hdr->nabortrels, abortrels);

-    ProcArrayRemove(&gxact->proc, latestXid);
+    ProcArrayRemove(&gxact->proc, latestXid, 0, NULL);

     /*
      * In case we fail while running the callbacks, mark the gxact invalid so
@@ -1690,6 +1690,34 @@ RecoverPreparedTransactions(void)
     FreeDir(cldir);
 }

+void
+ProcessTwoPhaseStandbyRecords(TransactionId xid)
+{
+    char       *buf;
+    char       *bufptr;
+    TwoPhaseFileHeader *hdr;
+
+    /* Read and validate file, if possible */
+    buf = ReadTwoPhaseFile(xid);
+    if (buf != NULL)
+    {
+        /* Deconstruct header */
+        hdr = (TwoPhaseFileHeader *) buf;
+        Assert(TransactionIdEquals(hdr->xid, xid));
+        bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
+        bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+        bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+        bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+        /*
+         * Recover other state using resource managers
+         */
+        ProcessRecords(bufptr, xid, twophase_postcommit_standby_callbacks);
+
+        pfree(buf);
+    }
+}
+
 /*
  *    RecordTransactionCommitPrepared
  *
@@ -1719,8 +1747,11 @@ RecordTransactionCommitPrepared(TransactionId xid,
     /* Emit the XLOG commit record */
     xlrec.xid = xid;
     xlrec.crec.xact_time = GetCurrentTimestamp();
+    xlrec.crec.xinfo = 0;
+    xlrec.crec.nmsgs = 0;
     xlrec.crec.nrels = nrels;
     xlrec.crec.nsubxacts = nchildren;
+
     rdata[0].data = (char *) (&xlrec);
     rdata[0].len = MinSizeOfXactCommitPrepared;
     rdata[0].buffer = InvalidBuffer;
diff --git a/src/backend/access/transam/twophase_rmgr.c b/src/backend/access/transam/twophase_rmgr.c
index 90f3c0e..a2500cd 100644
--- a/src/backend/access/transam/twophase_rmgr.c
+++ b/src/backend/access/transam/twophase_rmgr.c
@@ -21,6 +21,15 @@
 #include "utils/flatfiles.h"
 #include "utils/inval.h"

+const TwoPhaseCallback twophase_postcommit_standby_callbacks[TWOPHASE_RM_MAX_ID + 1] =
+{
+    NULL,                        /* END ID */
+    NULL,                        /* Lock */
+    inval_twophase_postcommit,    /* Inval */
+    flatfile_twophase_postcommit,        /* flat file update */
+    NULL,                        /* notify/listen */
+    NULL                        /* pgstat */
+};

 const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 16a7534..4c15505 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -277,6 +277,16 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
     curXid = ShmemVariableCache->nextXid;
     LWLockRelease(XidGenLock);

+    /*
+     * If we are in recovery then we are just replaying what has happened on
+     * the master. If we do need to trigger an autovacuum then it will happen
+     * on the master and changes will be fed through to the standby.
+     * So we have nothing to do here but be patient. We may be called during
+     * recovery by Startup process when updating db flat files.
+     */
+    if (InRecovery)
+        return;
+
     /* Log the info */
     ereport(DEBUG1,
        (errmsg("transaction ID wrap limit is %u, limited by database \"%s\"",
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c94e2a2..3ab78f4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -40,6 +40,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/procarray.h"
+#include "storage/sinval.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
 #include "utils/combocid.h"
@@ -137,10 +138,13 @@ typedef struct TransactionStateData
     ResourceOwner curTransactionOwner;    /* my query resources */
     TransactionId *childXids;    /* subcommitted child XIDs, in XID order */
     int            nChildXids;        /* # of subcommitted child XIDs */
+    int            nReportedChildXids;
     int            maxChildXids;    /* allocated size of childXids[] */
     Oid            prevUser;        /* previous CurrentUserId setting */
     bool        prevSecDefCxt;    /* previous SecurityDefinerContext setting */
     bool        prevXactReadOnly;        /* entry-time xact r/o state */
+    bool        startedInRecovery;    /* did we start in recovery? */
+    bool        reportedXid;
     struct TransactionStateData *parent;        /* back link to parent */
 } TransactionStateData;

@@ -165,10 +169,13 @@ static TransactionStateData TopTransactionStateData = {
     NULL,                        /* cur transaction resource owner */
     NULL,                        /* subcommitted child Xids */
     0,                            /* # of subcommitted child Xids */
+    0,                            /* # of reported child Xids */
     0,                            /* allocated size of childXids[] */
     InvalidOid,                    /* previous CurrentUserId setting */
     false,                        /* previous SecurityDefinerContext setting */
     false,                        /* entry-time xact r/o state */
+    false,                        /* startedInRecovery */
+    false,                        /* reportedXid */
     NULL                        /* link to parent state block */
 };

@@ -212,6 +219,11 @@ static bool forceSyncCommit = false;
 static MemoryContext TransactionAbortContext = NULL;

 /*
+ * Local state to optimise recovery conflict resolution
+ */
+static    TransactionId    latestRemovedXid = InvalidTransactionId;
+
+/*
  * List of add-on start- and end-of-xact callbacks
  */
 typedef struct XactCallbackItem
@@ -276,6 +288,9 @@ static const char *BlockStateAsString(TBlockState blockState);
 static const char *TransStateAsString(TransState state);


+static TransactionId *xactGetUnreportedChildren(int threshold, int *nxids);
+static TransactionId *xactCollectUnreportedChildren(TransactionState s, TransactionId *xids);
+
 /* ----------------------------------------------------------------
  *    transaction state accessors
  * ----------------------------------------------------------------
@@ -394,6 +409,9 @@ AssignTransactionId(TransactionState s)
     bool        isSubXact = (s->parent != NULL);
     ResourceOwner currentOwner;

+    if (RecoveryInProgress())
+        elog(ERROR, "cannot assign TransactionIds during recovery");
+
     /* Assert that caller didn't screw up */
     Assert(!TransactionIdIsValid(s->transactionId));
     Assert(s->state == TRANS_INPROGRESS);
@@ -437,8 +455,58 @@ AssignTransactionId(TransactionState s)
     }
     PG_END_TRY();
     CurrentResourceOwner = currentOwner;
-}

+    /*
+     * Every 64th assigned transaction id, within the top-level transaction,
+     * issues a WAL record with the top-level xid and all the subxids not
+     * yet WAL-logged.
+     * This is only needed to limit the shared memory usage of a hot standby
+     * server. In hot standby, the list of running transactions in the master
+     * is kept in a fixed size UnobservedXids array, and we reserve
+     * 64 * max_connections slots there. As soon as the standby server knows
+     * that a transaction is a subtransaction, and knows its parent, it
+     * can mark the subtransaction in pg_subtrans, and remove the entry from
+     * the unobserved xids array, making room for new entries.
+     *
+     * XXX: We don't actually keep track of the parent of each subtransaction,
+     * but only of the top-level transaction that each subxact belongs to.
+     * I think that's enough, but perhaps it would be better to store the
+     * exact relationships for debugging purposes.
+     */
+    if (isSubXact)
+    {
+        int nchildren;
+        TransactionId *children;
+        children = xactGetUnreportedChildren(64, &nchildren);
+
+        if (children != NULL)
+        {
+            XLogRecData rdata[2];
+            xl_xact_assignment    xlrec;
+
+            xlrec.xtop = GetTopTransactionIdIfAny();
+            Assert(TransactionIdIsValid(xlrec.xtop));
+
+            elog(trace_recovery(DEBUG2),
+                 "AssignTransactionId xtop %u nest %d hasParent %s",
+                 xlrec.xtop,
+                 GetCurrentTransactionNestLevel(),
+                 isSubXact ? "t" : "f");
+
+            rdata[0].data = (char *) (&xlrec);
+            rdata[0].len = MinSizeOfXactAssignment;
+            rdata[0].buffer = InvalidBuffer;
+            rdata[0].next = &rdata[1];
+
+            rdata[1].data = (char *) children;
+            rdata[1].len = sizeof(TransactionId) * nchildren;
+            rdata[1].buffer = InvalidBuffer;
+            rdata[1].next = NULL;
+
+            (void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT, rdata);
+        }
+    }
+}

 /*
  *    GetCurrentSubTransactionId
@@ -597,6 +665,16 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
     return false;
 }

+/*
+ *    TransactionStartedDuringRecovery, used during index scans
+ */
+bool
+TransactionStartedDuringRecovery(void)
+{
+    TransactionState s = CurrentTransactionState;
+
+    return s->startedInRecovery;
+}

 /*
  *    CommandCounterIncrement
@@ -824,11 +902,15 @@ RecordTransactionCommit(void)
     bool        haveNonTemp;
     int            nchildren;
     TransactionId *children;
+    int            nmsgs;
+    SharedInvalidationMessage *invalMessages = NULL;
+    bool        RelcacheInitFileInval;

     /* Get data needed for commit record */
     nrels = smgrGetPendingDeletes(true, &rels, &haveNonTemp);
     nchildren = xactGetCommittedChildren(&children);
-
+    nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
+                                                 &RelcacheInitFileInval);
     /*
      * If we haven't been assigned an XID yet, we neither can, nor do we want
      * to write a COMMIT record.
@@ -862,7 +944,7 @@ RecordTransactionCommit(void)
         /*
          * Begin commit critical section and insert the commit XLOG record.
          */
-        XLogRecData rdata[3];
+        XLogRecData rdata[4];
         int            lastrdata = 0;
         xl_xact_commit xlrec;

@@ -870,6 +952,19 @@ RecordTransactionCommit(void)
         BufmgrCommit();

         /*
+         * Set flags required for recovery processing of commits.
+         * Nothing too critical here that we would want to include this
+         * within the critical section following.
+         */
+        xlrec.xinfo = 0;
+        if (AtEOXact_Database_FlatFile_Update_Needed())
+            xlrec.xinfo |= XACT_COMPLETION_UPDATE_DB_FILE;
+        if (AtEOXact_Auth_FlatFile_Update_Needed())
+            xlrec.xinfo |= XACT_COMPLETION_UPDATE_AUTH_FILE;
+        if (RelcacheInitFileInval)
+            xlrec.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE;
+
+        /*
          * Mark ourselves as within our "commit critical section".    This
          * forces any concurrent checkpoint to wait until we've updated
          * pg_clog.  Without this, it is possible for the checkpoint to set
@@ -893,6 +988,8 @@ RecordTransactionCommit(void)
         xlrec.xact_time = xactStopTimestamp;
         xlrec.nrels = nrels;
         xlrec.nsubxacts = nchildren;
+        xlrec.nmsgs = nmsgs;
+
         rdata[0].data = (char *) (&xlrec);
         rdata[0].len = MinSizeOfXactCommit;
         rdata[0].buffer = InvalidBuffer;
@@ -914,6 +1011,15 @@ RecordTransactionCommit(void)
             rdata[2].buffer = InvalidBuffer;
             lastrdata = 2;
         }
+        /* dump shared cache invalidation messages */
+        if (nmsgs > 0)
+        {
+            rdata[lastrdata].next = &(rdata[3]);
+            rdata[3].data = (char *) invalMessages;
+            rdata[3].len = nmsgs * sizeof(SharedInvalidationMessage);
+            rdata[3].buffer = InvalidBuffer;
+            lastrdata = 3;
+        }
         rdata[lastrdata].next = NULL;

         (void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
@@ -1147,6 +1253,8 @@ AtSubCommit_childXids(void)
     s->childXids = NULL;
     s->nChildXids = 0;
     s->maxChildXids = 0;
+    s->nReportedChildXids = 0;
+    s->reportedXid = false;
 }

 /* ----------------------------------------------------------------
@@ -1266,7 +1374,7 @@ RecordTransactionAbort(bool isSubXact)
      * main xacts, the equivalent happens just after this function returns.
      */
     if (isSubXact)
-        XidCacheRemoveRunningXids(xid, nchildren, children, latestXid);
+        XidCacheRemoveRunningXids(MyProc, xid, nchildren, children, latestXid);

     /* Reset XactLastRecEnd until the next transaction writes something */
     if (!isSubXact)
@@ -1355,6 +1463,8 @@ AtSubAbort_childXids(void)
     s->childXids = NULL;
     s->nChildXids = 0;
     s->maxChildXids = 0;
+    s->nReportedChildXids = 0;
+    s->reportedXid = false;
 }

 /* ----------------------------------------------------------------
@@ -1524,7 +1634,10 @@ StartTransaction(void)
     s->gucNestLevel = 1;
     s->childXids = NULL;
     s->nChildXids = 0;
+    s->nReportedChildXids = 0;
+    s->reportedXid = false;
     s->maxChildXids = 0;
+    s->startedInRecovery = RecoveryInProgress();
     GetUserIdAndContext(&s->prevUser, &s->prevSecDefCxt);
     /* SecurityDefinerContext should never be set outside a transaction */
     Assert(!s->prevSecDefCxt);
@@ -1727,6 +1840,8 @@ CommitTransaction(void)
     s->childXids = NULL;
     s->nChildXids = 0;
     s->maxChildXids = 0;
+    s->nReportedChildXids = 0;
+    s->reportedXid = false;

     /*
      * done with commit processing, set current transaction state back to
@@ -1962,6 +2077,8 @@ PrepareTransaction(void)
     s->childXids = NULL;
     s->nChildXids = 0;
     s->maxChildXids = 0;
+    s->nReportedChildXids = 0;
+    s->reportedXid = false;

     /*
      * done with 1st phase commit processing, set current transaction state
@@ -2134,6 +2251,8 @@ CleanupTransaction(void)
     s->childXids = NULL;
     s->nChildXids = 0;
     s->maxChildXids = 0;
+    s->nReportedChildXids = 0;
+    s->reportedXid = false;

     /*
      * done with abort processing, set current transaction state back to
@@ -4213,33 +4332,299 @@ xactGetCommittedChildren(TransactionId **ptr)
     return s->nChildXids;
 }

+static TransactionId *
+xactGetUnreportedChildren(int threshold, int *nxids)
+{
+    TransactionState s;
+    int nTotalUnreportedXids;
+    TransactionId *xids;
+
+    /* Count unreported xids in the tree */
+    for (s = CurrentTransactionState; s != NULL; s = s->parent)
+    {
+        if (!s->reportedXid)
+            nTotalUnreportedXids++;
+        nTotalUnreportedXids += s->nChildXids - s->nReportedChildXids;
+        if (s->reportedXid)
+            break;
+    }
+
+    *nxids = nTotalUnreportedXids;
+
+    if (nTotalUnreportedXids < threshold)
+        return NULL;
+
+    xids = (TransactionId *) palloc(sizeof(TransactionId) * nTotalUnreportedXids);
+    xactCollectUnreportedChildren(CurrentTransactionState, xids);
+    return xids;
+}
+
+/* Helper function for xactGetUnreportedChildren */
+static TransactionId *
+xactCollectUnreportedChildren(TransactionState s, TransactionId *xids)
+{
+    int nUnreportedChildXids;
+
+    if (s->parent != NULL)
+    {
+        xids = xactCollectUnreportedChildren(s->parent, xids);
+        if (!s->reportedXid)
+        {
+            s->reportedXid = true;
+            *(xids++) = s->transactionId;
+        }
+    }
+
+    nUnreportedChildXids = s->nChildXids - s->nReportedChildXids;
+    memcpy(xids, &s->childXids[s->nReportedChildXids],
+           nUnreportedChildXids * sizeof(TransactionId));
+    xids += nUnreportedChildXids;
+
+    s->nReportedChildXids = s->nChildXids;
+
+    return xids;
+}
+
+/*
+ * Record an enhanced snapshot of running transactions into WAL.
+ */
+void
+LogCurrentRunningXacts(void)
+{
+    RunningTransactions        CurrRunningXacts = GetRunningTransactionData();
+    xl_xact_running_xacts    xlrec;
+    XLogRecData             rdata[3];
+    int                        lastrdata = 0;
+    XLogRecPtr                recptr;
+
+    xlrec.xcnt = CurrRunningXacts->xcnt;
+    xlrec.subxcnt = CurrRunningXacts->subxcnt;
+    xlrec.latestRunningXid = CurrRunningXacts->latestRunningXid;
+    xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
+    xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
+
+    /* Header */
+    rdata[0].data = (char *) (&xlrec);
+    rdata[0].len = MinSizeOfXactRunningXacts;
+    rdata[0].buffer = InvalidBuffer;
+
+    /* array of RunningXact */
+    if (xlrec.xcnt > 0)
+    {
+        rdata[0].next = &(rdata[1]);
+        rdata[1].data = (char *) CurrRunningXacts->xrun;
+        rdata[1].len = xlrec.xcnt * sizeof(RunningXact);
+        rdata[1].buffer = InvalidBuffer;
+        lastrdata = 1;
+    }
+
+    /* array of TransactionIds */
+    if (xlrec.subxcnt > 0)
+    {
+        rdata[lastrdata].next = &(rdata[2]);
+        rdata[2].data = (char *) CurrRunningXacts->subxip;
+        rdata[2].len = xlrec.subxcnt * sizeof(TransactionId);
+        rdata[2].buffer = InvalidBuffer;
+        lastrdata = 2;
+    }
+
+    rdata[lastrdata].next = NULL;
+
+    recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_RUNNING_XACTS, rdata);
+
+    elog(trace_recovery(DEBUG2), "captured snapshot of running xacts %X/%X", recptr.xlogid, recptr.xrecoff);
+}
+
+/*
+ * We need to issue shared invalidations and hold locks. Holding locks
+ * means others may want to wait on us, so we need to make lock table
+ * inserts to appear like a transaction. We could create and delete
+ * lock table entries for each transaction but its simpler just to create
+ * one permanent entry and leave it there all the time. Locks are then
+ * acquired and released as needed. Yes, this means you can see the
+ * Startup process in pg_locks once we have run this.
+ */
+void
+InitRecoveryTransactionEnvironment(void)
+{
+    VirtualTransactionId vxid;
+
+    /*
+     * Initialise shared invalidation management for Startup process,
+     * being careful to register ourselves as a sendOnly process so
+     * we don't need to read messages, nor will we get signalled
+     * when the queue starts filling up.
+     */
+    SharedInvalBackendInit(true);
+
+    /*
+     * Additional initialisation tasks. Most of this was performed
+     * during initial stages of startup.
+     */
+    ProcArrayInitRecoveryEnvironment();
+
+    /*
+     * Lock a virtual transaction id for Startup process.
+     *
+     * We need to do GetNextLocalTransactionId() because
+     * SharedInvalBackendInit() leaves localTransactionid invalid and
+     * the lock manager doesn't like that at all.
+     *
+     * Note that we don't need to run XactLockTableInsert() because nobody
+     * needs to wait on xids. That sounds a little strange, but table locks
+     * are held by vxids and row level locks are held by xids. All queries
+     * hold AccessShareLocks so never block while we write or lock new rows.
+     */
+    vxid.backendId = MyBackendId;
+    vxid.localTransactionId = GetNextLocalTransactionId();
+    VirtualXactLockTableInsert(vxid);
+
+    /*
+     * Now that the database is consistent we can create a valid copy of
+     * the flat files required for connection and authentication. This
+     * may already have been executed at appropriate commit points, but
+     * we cannot trust that those executions were correct, so force it
+     * again now just to be safe.
+     */
+    BuildFlatFiles(false);
+}
+
+void
+XactClearRecoveryTransactions(void)
+{
+    /*
+     * Remove entries from shared data structures
+     */
+    UnobservedTransactionsClearXids();
+    RelationClearRecoveryLocks();
+}
+
+/*
+ * LatestRemovedXidAdvances - returns true if latestRemovedXid is moved
+ *                                 forwards by the latest provided value
+ */
+bool
+LatestRemovedXidAdvances(TransactionId latestXid)
+{
+    /*
+     * Don't bother checking for conflicts for cleanup records earlier than
+     * we have already tested for.
+     */
+    if (!TransactionIdIsValid(latestRemovedXid) ||
+        (TransactionIdIsValid(latestRemovedXid) &&
+        TransactionIdPrecedes(latestRemovedXid, latestXid)))
+    {
+        latestRemovedXid = latestXid;
+        return true;
+    }
+
+    return false;
+}
+
 /*
  *    XLOG support routines
  */

+/*
+ * Before 8.4 this was a fairly short function, but now it performs many
+ * actions for which the order of execution is critical.
+ */
 static void
-xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid)
+xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, bool preparedXact)
 {
     TransactionId *sub_xids;
     TransactionId max_xid;
     int            i;

-    /* Mark the transaction committed in pg_clog */
     sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+
+    max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids);
+
+    /* XXX: Is there any reason to mark (sub)transactions in unobserved
+     * array that we're just about to mark as committed anyway?
+      RecordKnownAssignedSubTransactionIds(max_xid, xlrec->nsubxacts, sub_xids);
+    */
+
+    /* Mark the transaction committed in pg_clog */
     TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids);

-    /* Make sure nextXid is beyond any XID mentioned in the record */
-    max_xid = xid;
-    for (i = 0; i < xlrec->nsubxacts; i++)
+    if (InArchiveRecovery)
     {
-        if (TransactionIdPrecedes(max_xid, sub_xids[i]))
-            max_xid = sub_xids[i];
+        /*
+         * We must mark clog before we update the ProcArray.
+         */
+        UnobservedTransactionsRemoveXids(xid, xlrec->nsubxacts, sub_xids, false);
+
+        if (preparedXact)
+        {
+            /*
+             * Commit prepared xlog records do not carry invalidation data,
+             * since this is already held within the two phase state file.
+             * So we read it from there instead, with much the same effects.
+             */
+            ProcessTwoPhaseStandbyRecords(xid);
+        }
+        else
+        {
+            /*
+             * If requested, update the flat files for DB and Auth Files by
+             * reading the catalog tables. Needs to be the first action taken
+             * after marking transaction complete to minimise race conditions.
+             * This is the opposite way round to the original actions, which
+             * update the files and then mark committed, so there is a race
+             * condition in both places.
+             */
+            if (XactCompletionUpdateDBFile(xlrec) ||
+                XactCompletionUpdateAuthFile(xlrec))
+            {
+                if (XactCompletionUpdateAuthFile(xlrec))
+                    BuildFlatFiles(false);
+                else
+                    BuildFlatFiles(true);
+            }
+
+            /*
+             * Send any cache invalidations attached to the commit. We must
+             * maintain the same order of invalidation then release locks
+             * as occurs in RecordTransactionCommit.
+             */
+            if (xlrec->nmsgs > 0)
+            {
+                int    offset = OffsetSharedInvalInXactCommit();
+                SharedInvalidationMessage *msgs = (SharedInvalidationMessage *)
+                                (((char *) xlrec) + offset);
+
+                SendSharedInvalidMessages(msgs, xlrec->nmsgs);
+            }
+        }
+
+        /*
+         * Release locks, if any.
+         */
+        RelationReleaseRecoveryLockTree(xid, xlrec->nsubxacts, sub_xids);
     }
+
+    /* Make sure nextXid is beyond any XID mentioned in the record */
+    /* XXX: We don't expect anyone else to modify nextXid, hence we
+     * don't need to hold a lock while checking this. We still acquire
+     * the lock to modify it, though.
+     */
     if (TransactionIdFollowsOrEquals(max_xid,
                                      ShmemVariableCache->nextXid))
     {
+        LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
         ShmemVariableCache->nextXid = max_xid;
         TransactionIdAdvance(ShmemVariableCache->nextXid);
+        LWLockRelease(XidGenLock);
+    }
+
+    /* XXX: Same here, don't use lock to test, but need one to modify */
+    if (TransactionIdFollowsOrEquals(max_xid,
+                                     ShmemVariableCache->latestCompletedXid))
+    {
+        LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+        ShmemVariableCache->latestCompletedXid = max_xid;
+        LWLockRelease(ProcArrayLock);
     }

     /* Make sure files supposed to be dropped are dropped */
@@ -4260,6 +4645,15 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid)
     }
 }

+/*
+ * Be careful with the order of execution, as with xact_redo_commit().
+ * The two functions are similar but differ in key places.
+ *
+ * Note also that an abort can be for a subtransaction and its children,
+ * not just for a top level abort. That means we have to consider
+ * topxid != xid, whereas in commit we would find topxid == xid always
+ * because subtransaction commit is never WAL logged.
+ */
 static void
 xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
 {
@@ -4267,21 +4661,40 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
     TransactionId max_xid;
     int            i;

-    /* Mark the transaction aborted in pg_clog */
     sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+    max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids);
+
+/* XXX same as commit
+    RecordKnownAssignedSubTransactionIds(max_xid, xlrec->nsubxacts, sub_xids);
+*/
+
+    /* Mark the transaction aborted in pg_clog */
     TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids);

-    /* Make sure nextXid is beyond any XID mentioned in the record */
-    max_xid = xid;
-    for (i = 0; i < xlrec->nsubxacts; i++)
+    if (InArchiveRecovery)
     {
-        if (TransactionIdPrecedes(max_xid, sub_xids[i]))
-            max_xid = sub_xids[i];
+        /*
+         * We must mark clog before we update the ProcArray.
+         */
+        UnobservedTransactionsRemoveXids(xid, xlrec->nsubxacts, sub_xids, false);
+
+        /*
+         * There are no flat files that need updating, nor invalidation
+         * messages to send or undo.
+         */
+
+        /*
+         * Release locks, if any. There are no invalidations to send.
+         */
+        RelationReleaseRecoveryLockTree(xid, xlrec->nsubxacts, sub_xids);
     }
+
+    /* Make sure nextXid is beyond any XID mentioned in the record */
     if (TransactionIdFollowsOrEquals(max_xid,
                                      ShmemVariableCache->nextXid))
     {
         ShmemVariableCache->nextXid = max_xid;
+        ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
         TransactionIdAdvance(ShmemVariableCache->nextXid);
     }

@@ -4303,6 +4716,54 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
     }
 }

+static void
+xact_redo_assignment(XLogRecPtr lsn, XLogRecord *record)
+{
+    xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
+    int nsubxids = (record->xl_len - offsetof(xl_xact_assignment, xsub)) /
+        sizeof(TransactionId);
+    int i;
+
+    Assert(nsubxids > 0);
+
+    if (!InArchiveRecovery)
+        return;
+
+    /*
+     * Notice that we update pg_subtrans with the top-level xid, rather
+     * than the parent xid. This is a difference between normal
+     * processing and recovery, yet is still correct in all cases. The
+     * reason is that subtransaction commit is not marked in clog until
+     * commit processing, so all aborted subtransactions have already been
+     * clearly marked in clog. As a result we are able to refer directly
+     * to the top-level transaction's state rather than skipping through
+     * all the intermediate states in the subtransaction tree.
+     */
+    for (i = 0; i < nsubxids; i++)
+    {
+        TransactionId subxid = xlrec->xsub[i];
+
+        ExtendSUBTRANS(subxid);
+        SubTransSetParent(subxid, xlrec->xtop);
+        /*
+         * XXX: As long as there's room in the unobserved xids array, we
+         * could add entries there. But we don't know if there is, and
+         * then we'd have to keep track of subxids in the array that we
+         * could remove later on if the array fills up
+         *
+          RecordKnownAssignedTransactionXids(lsn, subxid);
+        */
+    }
+
+    /*
+     * Remove the subxids from the array, now that they have their parents
+     * set correctly in subtrans.
+     */
+    AdvanceLastOverflowedUnobservedXid(xlrec->xsub[nsubxids - 1]);
+    UnobservedTransactionsRemoveXids(InvalidTransactionId, nsubxids,
+                                    xlrec->xsub, false);
+}
+
 void
 xact_redo(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -4311,11 +4772,43 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)
     /* Backup blocks are not used in xact records */
     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));

+    if (info == XLOG_XACT_ASSIGNMENT)
+    {
+        xact_redo_assignment(lsn, record);
+        return;
+    }
+    else if (info == XLOG_XACT_RUNNING_XACTS)
+    {
+        xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) XLogRecGetData(record);
+
+        /*
+         * If RunningXact data is complete then apply it.
+         *
+         * XLOG_XACT_RUNNING_XACTS initialises the first snapshot, if
+         * InHotStandby. It also cross-checks
+         * the contents of recovery procs in case of FATAL errors as
+         * recovery progresses.
+         */
+        if (InHotStandby &&
+            TransactionIdIsValid(xlrec->latestRunningXid))
+            ProcArrayUpdateRecoveryTransactions(lsn, xlrec);
+
+        return;
+    }
+
+    if (InArchiveRecovery)
+    {
+        /*
+         * No conflict resolution is required for transaction completion records
+         */
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+    }
+
     if (info == XLOG_XACT_COMMIT)
     {
         xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);

-        xact_redo_commit(xlrec, record->xl_xid);
+        xact_redo_commit(xlrec, record->xl_xid, false);
     }
     else if (info == XLOG_XACT_ABORT)
     {
@@ -4333,7 +4826,7 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)
     {
         xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record);

-        xact_redo_commit(&xlrec->crec, xlrec->xid);
+        xact_redo_commit(&xlrec->crec, xlrec->xid, true);
         RemoveTwoPhaseFile(xlrec->xid, false);
     }
     else if (info == XLOG_XACT_ABORT_PREPARED)
@@ -4352,10 +4845,19 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
 {
     int            i;

+    if (XactCompletionUpdateDBFile(xlrec))
+        appendStringInfo(buf, "; update db file");
+
+    if (XactCompletionUpdateDBFile(xlrec))
+        appendStringInfo(buf, "; update auth file");
+
+    if (XactCompletionRelcacheInitFileInval(xlrec))
+        appendStringInfo(buf, "; relcache init file inval");
+
     appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
     if (xlrec->nrels > 0)
     {
-        appendStringInfo(buf, "; rels:");
+        appendStringInfo(buf, "; %d rels:", xlrec->nrels);
         for (i = 0; i < xlrec->nrels; i++)
         {
             char *path = relpath(xlrec->xnodes[i], MAIN_FORKNUM);
@@ -4366,12 +4868,34 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
     if (xlrec->nsubxacts > 0)
     {
         TransactionId *xacts = (TransactionId *)
-        &xlrec->xnodes[xlrec->nrels];
-
-        appendStringInfo(buf, "; subxacts:");
+                                    &xlrec->xnodes[xlrec->nrels];
+        appendStringInfo(buf, "; %d subxacts:", xlrec->nsubxacts);
         for (i = 0; i < xlrec->nsubxacts; i++)
             appendStringInfo(buf, " %u", xacts[i]);
     }
+    if (xlrec->nmsgs > 0)
+    {
+        /*
+         * The invalidation messages are the third variable length array
+         * from the start of the record. The record header has everything
+         * we need to calculate where that starts.
+         */
+        int    offset = OffsetSharedInvalInXactCommit();
+        SharedInvalidationMessage *msgs = (SharedInvalidationMessage *)
+                        (((char *) xlrec) + offset);
+        appendStringInfo(buf, "; %d inval msgs:", xlrec->nmsgs);
+        for (i = 0; i < xlrec->nmsgs; i++)
+        {
+            SharedInvalidationMessage *msg = msgs + i;
+
+            if (msg->id >= 0)
+                appendStringInfo(buf,  "catcache id%d ", msg->id);
+            else if (msg->id == SHAREDINVALRELCACHE_ID)
+                appendStringInfo(buf,  "relcache ");
+            else if (msg->id == SHAREDINVALSMGR_ID)
+                appendStringInfo(buf,  "smgr ");
+        }
+    }
 }

 static void
@@ -4393,14 +4917,51 @@ xact_desc_abort(StringInfo buf, xl_xact_abort *xlrec)
     if (xlrec->nsubxacts > 0)
     {
         TransactionId *xacts = (TransactionId *)
-        &xlrec->xnodes[xlrec->nrels];
+                                    &xlrec->xnodes[xlrec->nrels];

-        appendStringInfo(buf, "; subxacts:");
+        appendStringInfo(buf, "; %d subxacts:", xlrec->nsubxacts);
         for (i = 0; i < xlrec->nsubxacts; i++)
             appendStringInfo(buf, " %u", xacts[i]);
     }
 }

+static void
+xact_desc_running_xacts(StringInfo buf, xl_xact_running_xacts *xlrec)
+{
+    int                xid_index,
+                    subxid_index;
+    TransactionId     *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]);
+
+    appendStringInfo(buf, "nxids %u nsubxids %u latestRunningXid %d",
+                                xlrec->xcnt,
+                                xlrec->subxcnt,
+                                xlrec->latestRunningXid);
+
+    appendStringInfo(buf, " oldestRunningXid %d latestCompletedXid %d",
+                                xlrec->oldestRunningXid,
+                                xlrec->latestCompletedXid);
+
+    for (xid_index = 0; xid_index < xlrec->xcnt; xid_index++)
+    {
+        RunningXact        *rxact = (RunningXact *) xlrec->xrun;
+
+        appendStringInfo(buf, "; xid %d", rxact[xid_index].xid);
+
+        if (rxact[xid_index].nsubxids > 0)
+        {
+            appendStringInfo(buf, " nsubxids %u offset %d ovflow? %s",
+                                    rxact[xid_index].nsubxids,
+                                    rxact[xid_index].subx_offset,
+                                    (rxact[xid_index].overflowed ? "t" : "f"));
+
+            appendStringInfo(buf, "; subxacts: ");
+            for (subxid_index = 0; subxid_index < rxact[xid_index].nsubxids; subxid_index++)
+                appendStringInfo(buf, " %u",
+                        subxip[subxid_index + rxact[xid_index].subx_offset]);
+        }
+    }
+}
+
 void
 xact_desc(StringInfo buf, uint8 xl_info, char *rec)
 {
@@ -4428,16 +4989,31 @@ xact_desc(StringInfo buf, uint8 xl_info, char *rec)
     {
         xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) rec;

-        appendStringInfo(buf, "commit %u: ", xlrec->xid);
+        appendStringInfo(buf, "commit prepared %u: ", xlrec->xid);
         xact_desc_commit(buf, &xlrec->crec);
     }
     else if (info == XLOG_XACT_ABORT_PREPARED)
     {
         xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) rec;

-        appendStringInfo(buf, "abort %u: ", xlrec->xid);
+        appendStringInfo(buf, "abort prepared %u: ", xlrec->xid);
         xact_desc_abort(buf, &xlrec->arec);
     }
+    else if (info == XLOG_XACT_ASSIGNMENT)
+    {
+        xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
+
+        /* ignore the main xid, it may be Invalid and misleading */
+        appendStringInfo(buf, "assignment: xtop %u XXX subxids",
+                            xlrec->xtop);
+    }
+    else if (info == XLOG_XACT_RUNNING_XACTS)
+    {
+        xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) rec;
+
+        appendStringInfo(buf, "running xacts: ");
+        xact_desc_running_xacts(buf, xlrec);
+    }
     else
         appendStringInfo(buf, "UNKNOWN");
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 97fb148..e4d50c2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -25,6 +25,7 @@

 #include "access/clog.h"
 #include "access/multixact.h"
+#include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/tuptoaster.h"
@@ -45,6 +46,7 @@
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
+#include "storage/sinval.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
@@ -53,13 +55,14 @@
 #include "utils/ps_status.h"
 #include "pg_trace.h"

-
 /* File path names (all relative to $PGDATA) */
 #define BACKUP_LABEL_FILE        "backup_label"
 #define BACKUP_LABEL_OLD        "backup_label.old"
 #define RECOVERY_COMMAND_FILE    "recovery.conf"
 #define RECOVERY_COMMAND_DONE    "recovery.done"

+/* copied from tcopprot.h rather than include whole file */
+extern    int    PostAuthDelay;

 /* User-settable parameters */
 int            CheckPointSegments = 3;
@@ -71,6 +74,8 @@ bool        fullPageWrites = true;
 bool        log_checkpoints = false;
 int         sync_method = DEFAULT_SYNC_METHOD;

+#define WAL_DEBUG
+
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
 #endif
@@ -134,7 +139,9 @@ TimeLineID    ThisTimeLineID = 0;
 bool        InRecovery = false;

 /* Are we recovering using offline XLOG archives? */
-static bool InArchiveRecovery = false;
+bool         InArchiveRecovery = false;
+
+static     XLogRecPtr    LastRec;

 /*
  * Local copy of SharedRecoveryInProgress variable. True actually means "not
@@ -142,17 +149,40 @@ static bool InArchiveRecovery = false;
  */
 static bool LocalRecoveryInProgress = true;

+/* is the database proven consistent yet? */
+bool    reachedSafeStartPoint = false;
+
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;

 /* options taken from recovery.conf */
 static char *recoveryRestoreCommand = NULL;
-static bool recoveryTarget = false;
 static bool recoveryTargetExact = false;
 static bool recoveryTargetInclusive = true;
 static TransactionId recoveryTargetXid;
 static TimestampTz recoveryTargetTime;
+static XLogRecPtr recoveryTargetLSN;
+static int recoveryTargetAdvance = 0;
+bool InHotStandby = true;
+
+/* recovery target modes */
+#define    RECOVERY_TARGET_NONE                0
+#define RECOVERY_TARGET_PAUSE_ALL            1
+#define RECOVERY_TARGET_PAUSE_XID            2
+#define RECOVERY_TARGET_PAUSE_TIME            3
+#define RECOVERY_TARGET_PAUSE_LSN            4
+#define RECOVERY_TARGET_ADVANCE                5
+#define RECOVERY_TARGET_STOP_IMMEDIATE        6
+#define RECOVERY_TARGET_STOP_XID            7
+#define RECOVERY_TARGET_STOP_TIME            8
+static int recoveryTargetMode = RECOVERY_TARGET_NONE;
+static bool recoveryStartsPaused = false;
+
+#define DEFAULT_MAX_STANDBY_DELAY     30
+int maxStandbyDelay = DEFAULT_MAX_STANDBY_DELAY;
+
 static TimestampTz recoveryLastXTime = 0;
+static TransactionId recoveryLastXid = InvalidTransactionId;

 /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
 static TransactionId recoveryStopXid;
@@ -347,6 +377,16 @@ typedef struct XLogCtlData
     /* end+1 of the last record replayed (or being replayed) */
     XLogRecPtr    replayEndRecPtr;

+    int                recoveryTargetMode;
+    TransactionId    recoveryTargetXid;
+    TimestampTz        recoveryTargetTime;
+    int                recoveryTargetAdvance;
+    XLogRecPtr        recoveryTargetLSN;
+
+    TimestampTz     recoveryLastXTime;
+    TransactionId     recoveryLastXid;
+    XLogRecPtr        recoveryLastRecPtr;
+
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;

@@ -432,7 +472,7 @@ static bool InRedo = false;
 static volatile sig_atomic_t shutdown_requested = false;
 /*
  * Flag set when executing a restore command, to tell SIGTERM signal handler
- * that it's safe to just proc_exit(0).
+ * that it's safe to just proc_exit.
  */
 static volatile sig_atomic_t in_restore_command = false;

@@ -879,25 +919,6 @@ begin:;
     FIN_CRC32(rdata_crc);
     record->xl_crc = rdata_crc;

-#ifdef WAL_DEBUG
-    if (XLOG_DEBUG)
-    {
-        StringInfoData buf;
-
-        initStringInfo(&buf);
-        appendStringInfo(&buf, "INSERT @ %X/%X: ",
-                         RecPtr.xlogid, RecPtr.xrecoff);
-        xlog_outrec(&buf, record);
-        if (rdata->data != NULL)
-        {
-            appendStringInfo(&buf, " - ");
-            RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
-        }
-        elog(LOG, "%s", buf.data);
-        pfree(buf.data);
-    }
-#endif
-
     /* Record begin of record in appropriate places */
     ProcLastRecPtr = RecPtr;
     Insert->PrevRecord = RecPtr;
@@ -2752,7 +2773,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
      */
     in_restore_command = true;
     if (shutdown_requested)
-        proc_exit(0);
+        proc_exit(1);

     /*
      * Copy xlog from archival storage to XLOGDIR
@@ -2818,7 +2839,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
      * On SIGTERM, assume we have received a fast shutdown request, and exit
      * cleanly. It's pure chance whether we receive the SIGTERM first, or the
      * child process. If we receive it first, the signal handler will call
-     * proc_exit(0), otherwise we do it here. If we or the child process
+     * proc_exit, otherwise we do it here. If we or the child process
      * received SIGTERM for any other reason than a fast shutdown request,
      * postmaster will perform an immediate shutdown when it sees us exiting
      * unexpectedly.
@@ -2829,7 +2850,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
      * too.
      */
     if (WTERMSIG(rc) == SIGTERM)
-        proc_exit(0);
+        proc_exit(1);

     signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

@@ -4695,7 +4716,7 @@ readRecoveryCommandFile(void)
             ereport(LOG,
                     (errmsg("recovery_target_xid = %u",
                             recoveryTargetXid)));
-            recoveryTarget = true;
+            recoveryTargetMode = RECOVERY_TARGET_STOP_XID;
             recoveryTargetExact = true;
         }
         else if (strcmp(tok1, "recovery_target_time") == 0)
@@ -4706,7 +4727,7 @@ readRecoveryCommandFile(void)
              */
             if (recoveryTargetExact)
                 continue;
-            recoveryTarget = true;
+            recoveryTargetMode = RECOVERY_TARGET_STOP_TIME;
             recoveryTargetExact = false;

             /*
@@ -4733,6 +4754,51 @@ readRecoveryCommandFile(void)
             ereport(LOG,
                     (errmsg("recovery_target_inclusive = %s", tok2)));
         }
+        else if (strcmp(tok1, "recovery_connections") == 0)
+        {
+            /*
+             * enables/disables snapshot processing and user connections
+             */
+            if (!parse_bool(tok2, &InHotStandby))
+                  ereport(ERROR,
+                            (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                      errmsg("parameter \"recovery_connections\" requires a Boolean value")));
+            ereport(LOG,
+                    (errmsg("recovery_connections = %s", tok2)));
+        }
+        else if (strcmp(tok1, "recovery_starts_paused") == 0)
+        {
+            /*
+             * enables/disables snapshot processing and user connections
+             */
+            if (!parse_bool(tok2, &recoveryStartsPaused))
+                  ereport(ERROR,
+                            (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                      errmsg("parameter \"recovery_starts_paused\" requires a Boolean value")));
+
+            ereport(LOG,
+                    (errmsg("recovery_starts_paused = %s", tok2)));
+        }
+        else if (strcmp(tok1, "max_standby_delay") == 0)
+        {
+            errno = 0;
+            maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
+            if (errno == EINVAL || errno == ERANGE)
+                ereport(FATAL,
+                 (errmsg("max_standby_delay is not a valid number: \"%s\"",
+                         tok2)));
+            /*
+             * 2E6 seconds is about 23 days. Allows us to measure delay in
+             * milliseconds.
+             */
+            if (maxStandbyDelay > INT_MAX || maxStandbyDelay < -1)
+                ereport(FATAL,
+                 (errmsg("max_standby_delay must be between -1 (wait forever) and 2 000 000 secs")));
+
+            ereport(LOG,
+                    (errmsg("max_standby_delay = %u",
+                            maxStandbyDelay)));
+        }
         else
             ereport(FATAL,
                     (errmsg("unrecognized recovery parameter \"%s\"",
@@ -4882,8 +4948,8 @@ exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
 }

 /*
- * For point-in-time recovery, this function decides whether we want to
- * stop applying the XLOG at or after the current record.
+ * For archive recovery, this function decides whether we want to
+ * pause or stop applying the XLOG at or after the current record.
  *
  * Returns TRUE if we are stopping, FALSE otherwise.  On TRUE return,
  * *includeThis is set TRUE if we should apply this record before stopping.
@@ -4896,72 +4962,285 @@ exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
 static bool
 recoveryStopsHere(XLogRecord *record, bool *includeThis)
 {
-    bool        stopsHere;
-    uint8        record_info;
-    TimestampTz recordXtime;
-
+    bool        stopsHere = false;
+    bool        pauseHere = false;
+    static bool        paused = false;
+    uint8        record_info = 0;        /* valid iff (is_xact_completion_record) */
+    TimestampTz recordXtime = 0;
+    bool        is_xact_completion_record = false;
+
     /* We only consider stopping at COMMIT or ABORT records */
-    if (record->xl_rmid != RM_XACT_ID)
-        return false;
-    record_info = record->xl_info & ~XLR_INFO_MASK;
-    if (record_info == XLOG_XACT_COMMIT)
+    if (record->xl_rmid == RM_XACT_ID)
     {
-        xl_xact_commit *recordXactCommitData;
+        record_info = record->xl_info & ~XLR_INFO_MASK;
+        if (record_info == XLOG_XACT_COMMIT)
+        {
+            xl_xact_commit *recordXactCommitData;

-        recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record);
-        recordXtime = recordXactCommitData->xact_time;
-    }
-    else if (record_info == XLOG_XACT_ABORT)
-    {
-        xl_xact_abort *recordXactAbortData;
+            recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record);
+            recordXtime = recordXactCommitData->xact_time;
+            is_xact_completion_record = true;
+        }
+        else if (record_info == XLOG_XACT_ABORT)
+        {
+            xl_xact_abort *recordXactAbortData;

-        recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record);
-        recordXtime = recordXactAbortData->xact_time;
-    }
-    else
-        return false;
+            recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record);
+            recordXtime = recordXactAbortData->xact_time;
+            is_xact_completion_record = true;
+        }

-    /* Do we have a PITR target at all? */
-    if (!recoveryTarget)
-    {
-        recoveryLastXTime = recordXtime;
-        return false;
+        /* Remember the most recent COMMIT/ABORT time for logging purposes */
+        if (is_xact_completion_record)
+        {
+            recoveryLastXTime = recordXtime;
+            recoveryLastXid = record->xl_xid;
+        }
     }

-    if (recoveryTargetExact)
+    do
     {
+        int    prevRecoveryTargetMode = recoveryTargetMode;
+
+        CHECK_FOR_INTERRUPTS();
+
         /*
-         * there can be only one transaction end record with this exact
-         * transactionid
-         *
-         * when testing for an xid, we MUST test for equality only, since
-         * transactions are numbered in the order they start, not the order
-         * they complete. A higher numbered xid will complete before you about
-         * 50% of the time...
+         * Check if we were requested to exit without finishing
+         * recovery.
          */
-        stopsHere = (record->xl_xid == recoveryTargetXid);
-        if (stopsHere)
-            *includeThis = recoveryTargetInclusive;
-    }
-    else
-    {
+        if (shutdown_requested)
+            proc_exit(1);
+
         /*
-         * there can be many transactions that share the same commit time, so
-         * we stop after the last one, if we are inclusive, or stop at the
-         * first one if we are exclusive
+         * Let's see if user has updated our recoveryTargetMode.
          */
-        if (recoveryTargetInclusive)
-            stopsHere = (recordXtime > recoveryTargetTime);
-        else
-            stopsHere = (recordXtime >= recoveryTargetTime);
-        if (stopsHere)
-            *includeThis = false;
+        {
+            /* use volatile pointer to prevent code rearrangement */
+            volatile XLogCtlData *xlogctl = XLogCtl;
+
+            SpinLockAcquire(&xlogctl->info_lck);
+            recoveryTargetMode = xlogctl->recoveryTargetMode;
+            if (recoveryTargetMode != RECOVERY_TARGET_NONE)
+            {
+                recoveryTargetXid = xlogctl->recoveryTargetXid;
+                recoveryTargetTime = xlogctl->recoveryTargetTime;
+
+                /* Don't reset counter while we're advancing */
+                if (recoveryTargetAdvance <= 0)
+                {
+                    recoveryTargetAdvance = xlogctl->recoveryTargetAdvance;
+                    xlogctl->recoveryTargetAdvance = 0;
+                }
+            }
+            if (is_xact_completion_record)
+            {
+                xlogctl->recoveryLastXTime = recordXtime;
+                xlogctl->recoveryLastXid = record->xl_xid;
+            }
+            xlogctl->recoveryLastRecPtr = LastRec;
+            SpinLockRelease(&xlogctl->info_lck);
+        }
+
+        /* Decide how to act on any pause target */
+        switch (recoveryTargetMode)
+        {
+            case RECOVERY_TARGET_PAUSE_LSN:
+                    return false;
+
+            case RECOVERY_TARGET_NONE:
+                    /*
+                     * If we aren't paused and we're not looking to stop,
+                     * just exit out quickly and get on with recovery.
+                     */
+                    if (paused)
+                    {
+                        ereport(LOG,
+                                (errmsg("recovery restarting after pause")));
+                        set_ps_display("recovery continues", false);
+                        paused = false;
+                    }
+                    return false;
+
+            case RECOVERY_TARGET_PAUSE_ALL:
+                    pauseHere = true;
+                    break;
+
+            case RECOVERY_TARGET_ADVANCE:
+                    if (paused)
+                    {
+                        if (recoveryTargetAdvance-- > 0)
+                        {
+                            elog(LOG, "recovery advancing 1 record");
+                            return false;
+                        }
+                        else
+                            break;
+                    }
+
+                    if (recoveryTargetAdvance-- <= 0)
+                        pauseHere = true;
+                    break;
+
+            case RECOVERY_TARGET_STOP_IMMEDIATE:
+            case RECOVERY_TARGET_STOP_XID:
+            case RECOVERY_TARGET_STOP_TIME:
+                    paused = false;
+                    break;
+
+            /*
+             * If we're paused, and mode has changed reset to allow new settings
+             * to apply and maybe allow us to continue.
+             */
+            if (paused && prevRecoveryTargetMode != recoveryTargetMode)
+                paused = false;
+
+            case RECOVERY_TARGET_PAUSE_XID:
+                    /*
+                     * there can be only one transaction end record with this exact
+                     * transactionid
+                     *
+                     * when testing for an xid, we MUST test for equality only, since
+                     * transactions are numbered in the order they start, not the order
+                     * they complete. A higher numbered xid will complete before you about
+                     * 50% of the time...
+                     */
+                    if (is_xact_completion_record)
+                        pauseHere = (record->xl_xid == recoveryTargetXid);
+                    break;
+
+            case RECOVERY_TARGET_PAUSE_TIME:
+                    /*
+                     * there can be many transactions that share the same commit time, so
+                     * we pause after the last one, if we are inclusive, or pause at the
+                     * first one if we are exclusive
+                     */
+                    if (is_xact_completion_record)
+                    {
+                        if (recoveryTargetInclusive)
+                            pauseHere = (recoveryLastXTime > recoveryTargetTime);
+                        else
+                            pauseHere = (recoveryLastXTime >= recoveryTargetTime);
+                    }
+                    break;
+
+            default:
+                    ereport(WARNING,
+                            (errmsg("unknown recovery mode %d, continuing recovery",
+                                            recoveryTargetMode)));
+                    return false;
+        }
+
+        /*
+         * If we just entered pause, issue log messages
+         */
+        if (pauseHere && !paused)
+        {
+            if (is_xact_completion_record)
+            {
+                if (record_info == XLOG_XACT_COMMIT)
+                    ereport(LOG,
+                        (errmsg("recovery pausing before commit of transaction %u, log time %s",
+                                    record->xl_xid,
+                                    timestamptz_to_str(recoveryLastXTime))));
+                else
+                    ereport(LOG,
+                        (errmsg("recovery pausing before abort of transaction %u, log time %s",
+                                    record->xl_xid,
+                                    timestamptz_to_str(recoveryLastXTime))));
+            }
+            else
+                ereport(LOG,
+                        (errmsg("recovery pausing; last recovered transaction %u, "
+                                "last recovered xact timestamp %s",
+                                    recoveryLastXid,
+                                    timestamptz_to_str(recoveryLastXTime))));
+
+            set_ps_display("recovery paused", false);
+
+            paused = true;
+        }
+
+        /*
+         * Pause for a while before rechecking mode at top of loop.
+         */
+        if (paused)
+        {
+            recoveryTargetAdvance = 0;
+
+            /*
+             * Update the recoveryTargetMode
+             */
+            {
+                /* use volatile pointer to prevent code rearrangement */
+                volatile XLogCtlData *xlogctl = XLogCtl;
+
+                SpinLockAcquire(&xlogctl->info_lck);
+                xlogctl->recoveryTargetMode = RECOVERY_TARGET_PAUSE_ALL;
+                xlogctl->recoveryTargetAdvance = 0;
+                SpinLockRelease(&xlogctl->info_lck);
+            }
+
+            pg_usleep(200000L);
+        }
+
+        /*
+         * We leave the loop at the bottom only if our recovery mode is
+         * set (or has been recently reset) to one of the stop options.
+         */
+    } while (paused);
+
+    /*
+     * Decide how to act if stop target mode set. We run this separately from
+     * pause to allow user to reset their stop target while paused.
+     */
+    switch (recoveryTargetMode)
+    {
+        case RECOVERY_TARGET_STOP_IMMEDIATE:
+                ereport(LOG,
+                        (errmsg("recovery stopping immediately due to user request")));
+                return true;
+
+        case RECOVERY_TARGET_STOP_XID:
+                /*
+                 * there can be only one transaction end record with this exact
+                 * transactionid
+                 *
+                 * when testing for an xid, we MUST test for equality only, since
+                 * transactions are numbered in the order they start, not the order
+                 * they complete. A higher numbered xid will complete before you about
+                 * 50% of the time...
+                 */
+                if (is_xact_completion_record)
+                {
+                    stopsHere = (record->xl_xid == recoveryTargetXid);
+                    if (stopsHere)
+                        *includeThis = recoveryTargetInclusive;
+                }
+                break;
+
+        case RECOVERY_TARGET_STOP_TIME:
+                /*
+                 * there can be many transactions that share the same commit time, so
+                 * we stop after the last one, if we are inclusive, or stop at the
+                 * first one if we are exclusive
+                 */
+                if (is_xact_completion_record)
+                {
+                    if (recoveryTargetInclusive)
+                        stopsHere = (recoveryLastXTime > recoveryTargetTime);
+                    else
+                        stopsHere = (recoveryLastXTime >= recoveryTargetTime);
+                    if (stopsHere)
+                        *includeThis = false;
+                }
+                break;
     }

     if (stopsHere)
     {
+        Assert(is_xact_completion_record);
         recoveryStopXid = record->xl_xid;
-        recoveryStopTime = recordXtime;
+        recoveryStopTime = recoveryLastXTime;
         recoveryStopAfter = *includeThis;

         if (record_info == XLOG_XACT_COMMIT)
@@ -4990,14 +5269,340 @@ recoveryStopsHere(XLogRecord *record, bool *includeThis)
                                 recoveryStopXid,
                                 timestamptz_to_str(recoveryStopTime))));
         }
+    }

-        if (recoveryStopAfter)
-            recoveryLastXTime = recordXtime;
+    return stopsHere;
+}
+
+static void
+recoveryPausesAfterLSN(void)
+{
+    while (recoveryTargetMode == RECOVERY_TARGET_PAUSE_LSN &&
+            XLByteLE(recoveryTargetLSN, LastRec))
+    {
+        {
+            /* use volatile pointer to prevent code rearrangement */
+            volatile XLogCtlData *xlogctl = XLogCtl;
+
+            SpinLockAcquire(&xlogctl->info_lck);
+            recoveryTargetMode = xlogctl->recoveryTargetMode;
+            recoveryTargetLSN = xlogctl->recoveryTargetLSN;
+            SpinLockRelease(&xlogctl->info_lck);
+        }
+
+        pg_usleep(25000L);
+    }
+}
+
+/*
+ * Utility function used by various user functions to set the recovery
+ * target mode. This allows user control over the progress of recovery.
+ */
+static void
+SetRecoveryTargetMode(int mode, TransactionId xid, TimestampTz ts,
+                        XLogRecPtr lsn, int advance)
+{
+    if (!RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("recovery is not in progress"),
+                 errhint("WAL control functions can only be executed during recovery.")));
+
+    if (!InRecovery && !superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 errmsg("must be superuser to control recovery")));
+
+
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        SpinLockAcquire(&xlogctl->info_lck);
+        xlogctl->recoveryTargetMode = mode;
+
+        if (mode == RECOVERY_TARGET_STOP_XID ||
+            mode == RECOVERY_TARGET_PAUSE_XID)
+            xlogctl->recoveryTargetXid = xid;
+        else if (mode == RECOVERY_TARGET_STOP_TIME ||
+                  mode == RECOVERY_TARGET_PAUSE_TIME)
+            xlogctl->recoveryTargetTime = ts;
+        else if (mode == RECOVERY_TARGET_ADVANCE)
+            xlogctl->recoveryTargetAdvance = advance;
+        else if (mode == RECOVERY_TARGET_PAUSE_LSN)
+            xlogctl->recoveryTargetLSN = lsn;
+
+        SpinLockRelease(&xlogctl->info_lck);
+    }
+}
+
+/*
+ * Forces recovery mode to reset to unfrozen.
+ * Returns void.
+ */
+Datum
+pg_recovery_continue(PG_FUNCTION_ARGS)
+{
+    SetRecoveryTargetMode(RECOVERY_TARGET_NONE,
+                            InvalidTransactionId, 0, InvalidXLogRecPtr, 0);
+
+    PG_RETURN_VOID();
+}
+
+/*
+ * Pause recovery immediately. Stays paused until asked to play again.
+ * Returns void.
+ */
+Datum
+pg_recovery_pause(PG_FUNCTION_ARGS)
+{
+    SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_ALL,
+                            InvalidTransactionId, 0, InvalidXLogRecPtr, 0);
+
+    PG_RETURN_VOID();
+}
+
+/*
+ * Pause recovery at stated xid, if ever seen. Once paused, stays paused
+ * until asked to play again.
+ */
+Datum
+pg_recovery_pause_xid(PG_FUNCTION_ARGS)
+{
+    int              xidi = PG_GETARG_INT32(0);
+    TransactionId xid = (TransactionId) xidi;
+
+    if (xid < 3)
+        elog(ERROR, "cannot specify special values for transaction id");
+
+    SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_XID,
+                            xid, 0, InvalidXLogRecPtr, 0);
+
+    PG_RETURN_VOID();
+}
+
+/*
+ * Pause recovery at stated timestamp, if ever reached. Once paused, stays paused
+ * until asked to play again.
+ */
+Datum
+pg_recovery_pause_time(PG_FUNCTION_ARGS)
+{
+    TimestampTz ts = PG_GETARG_TIMESTAMPTZ(0);
+
+    SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_TIME,
+                            InvalidTransactionId, ts, InvalidXLogRecPtr, 0);
+
+    PG_RETURN_VOID();
+}
+
+/*
+ * Pause recovery after stated LSN, if ever reached. Once paused, stays paused
+ * until asked to play again.
+ */
+Datum
+pg_recovery_pause_lsn(PG_FUNCTION_ARGS)
+{
+    XLogRecPtr lsn;
+
+    lsn.xlogid = PG_GETARG_INT32(0);
+    lsn.xrecoff = PG_GETARG_INT32(1);
+
+    SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_LSN,
+                            InvalidTransactionId, 0, lsn, 0);
+
+    PG_RETURN_VOID();
+}
+
+/*
+ * If paused, advance N records.
+ */
+Datum
+pg_recovery_advance(PG_FUNCTION_ARGS)
+{
+    int adv = PG_GETARG_INT32(0);
+
+    if (adv < 1)
+        elog(ERROR, "recovery advance must be greater than or equal to 1");
+
+    SetRecoveryTargetMode(RECOVERY_TARGET_ADVANCE,
+                            InvalidTransactionId, 0, InvalidXLogRecPtr, adv);
+
+    PG_RETURN_VOID();
+}
+
+/*
+ * Forces recovery to stop now if paused, or at end of next record if playing.
+ */
+Datum
+pg_recovery_stop(PG_FUNCTION_ARGS)
+{
+    SetRecoveryTargetMode(RECOVERY_TARGET_STOP_IMMEDIATE,
+                            InvalidTransactionId, 0, InvalidXLogRecPtr, 0);
+
+    PG_RETURN_VOID();
+}
+
+Datum
+pg_current_recovery_target(PG_FUNCTION_ARGS)
+{
+    StringInfoData buf;
+
+    initStringInfo(&buf);
+
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        SpinLockAcquire(&xlogctl->info_lck);
+
+        recoveryTargetMode = xlogctl->recoveryTargetMode;
+        if (recoveryTargetMode != RECOVERY_TARGET_NONE)
+        {
+            recoveryTargetXid = xlogctl->recoveryTargetXid;
+            recoveryTargetTime = xlogctl->recoveryTargetTime;
+            recoveryTargetAdvance = xlogctl->recoveryTargetAdvance;
+        }
+
+        SpinLockRelease(&xlogctl->info_lck);
+    }
+
+    switch (recoveryTargetMode)
+    {
+        case RECOVERY_TARGET_NONE:
+                appendStringInfo(&buf, "No recovery target has been set");
+                break;
+        case RECOVERY_TARGET_PAUSE_ALL:
+                appendStringInfo(&buf, "Recovery paused");
+                break;
+        case RECOVERY_TARGET_PAUSE_XID:
+                appendStringInfo(&buf, "Recovery will pause after commit of transaction %u", recoveryTargetXid);
+                break;
+        case RECOVERY_TARGET_PAUSE_TIME:
+                appendStringInfo(&buf, "Recovery will pause after transaction completion timestamp %s",
+                                        timestamptz_to_str(recoveryTargetTime));
+                break;
+        case RECOVERY_TARGET_PAUSE_LSN:
+                appendStringInfo(&buf, "Recovery will pause after applying record at xlog location %X/%X",
+                                        recoveryTargetLSN.xlogid,
+                                        recoveryTargetLSN.xrecoff);
+                break;
+        case RECOVERY_TARGET_ADVANCE:
+                appendStringInfo(&buf, "Recovery will advance");
+                break;
+        case RECOVERY_TARGET_STOP_IMMEDIATE:
+                appendStringInfo(&buf, "No recovery target has been set");
+                break;
+        case RECOVERY_TARGET_STOP_XID:
+                appendStringInfo(&buf, "Recovery will stop after commit of transaction %u", recoveryTargetXid);
+                break;
+        case RECOVERY_TARGET_STOP_TIME:
+                appendStringInfo(&buf, "Recovery will stop after transaction completion timestamp %s",
+                                        timestamptz_to_str(recoveryTargetTime));
+                break;
     }
+
+    PG_RETURN_TEXT_P(cstring_to_text(buf.data));
+}
+
+/*
+ * Returns bool with current recovery mode, a global state.
+ */
+Datum
+pg_is_in_recovery(PG_FUNCTION_ARGS)
+{
+    PG_RETURN_BOOL(RecoveryInProgress());
+}
+
+/*
+ * Returns timestamp of last completed transaction
+ */
+Datum
+pg_last_recovered_xact_timestamp(PG_FUNCTION_ARGS)
+{
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        SpinLockAcquire(&xlogctl->info_lck);
+
+        recoveryLastXTime = xlogctl->recoveryLastXTime;
+
+        SpinLockRelease(&xlogctl->info_lck);
+    }
+
+    PG_RETURN_TIMESTAMPTZ(recoveryLastXTime);
+}
+
+/*
+ * Returns xid of last completed transaction
+ */
+Datum
+pg_last_recovered_xid(PG_FUNCTION_ARGS)
+{
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        SpinLockAcquire(&xlogctl->info_lck);
+
+        recoveryLastXid = xlogctl->recoveryLastXid;
+
+        SpinLockRelease(&xlogctl->info_lck);
+    }
+
+    PG_RETURN_INT32(recoveryLastXid);
+}
+
+/*
+ * Returns xlog location of last recovered WAL record.
+ */
+Datum
+pg_last_recovered_xlog_location(PG_FUNCTION_ARGS)
+{
+    char        location[MAXFNAMELEN];
+
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile XLogCtlData *xlogctl = XLogCtl;
+
+        SpinLockAcquire(&xlogctl->info_lck);
+
+        LastRec = xlogctl->recoveryLastRecPtr;
+
+        SpinLockRelease(&xlogctl->info_lck);
+    }
+
+    snprintf(location, sizeof(location), "%X/%X",
+             LastRec.xlogid, LastRec.xrecoff);
+    PG_RETURN_TEXT_P(cstring_to_text(location));
+}
+
+/*
+ * Returns delay in milliseconds, or -1 if delay too large
+ */
+int
+GetLatestReplicationDelay(void)
+{
+    long        delay_secs;
+    int            delay_usecs;
+    int            delay;
+    TimestampTz currTz = GetCurrentTimestamp();
+
+    TimestampDifference(recoveryLastXTime, currTz,
+                        &delay_secs, &delay_usecs);
+
+    /*
+     * If delay is very large we probably aren't looking at
+     * a replication situation at all, just a recover from backup.
+     * So return a special value instead.
+     */
+    if (delay_secs > (long)(INT_MAX / 1000))
+        delay = -1;
     else
-        recoveryLastXTime = recordXtime;
+        delay = (int)(delay_secs * 1000) + (delay_usecs / 1000);

-    return stopsHere;
+    return delay;
 }

 /*
@@ -5012,7 +5617,6 @@ StartupXLOG(void)
     bool        reachedStopPoint = false;
     bool        haveBackupLabel = false;
     XLogRecPtr    RecPtr,
-                LastRec,
                 checkPointLoc,
                 backupStopLoc,
                 EndOfLog;
@@ -5088,6 +5692,16 @@ StartupXLOG(void)
      */
     readRecoveryCommandFile();

+    /*
+     * PostAuthDelay is a debugging aid for investigating problems in startup
+     * and/or recovery: it can be set in postgresql.conf to allow time to
+     * attach to the newly-forked backend with a debugger. It can also be set
+     * using the postmaster -W switch, which can be specified using the -o
+     * option of pg_ctl, e.g. pg_ctl -D data -o "-W 30"
+     */
+    if (PostAuthDelay > 0)
+        pg_usleep(PostAuthDelay * 1000000L);
+
     /* Now we can determine the list of expected TLIs */
     expectedTLIs = readTimeLineHistory(recoveryTargetTLI);

@@ -5344,7 +5958,9 @@ StartupXLOG(void)
             do
             {
 #ifdef WAL_DEBUG
-                if (XLOG_DEBUG)
+                if (XLOG_DEBUG ||
+                    (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
+                    (rmid != RM_XACT_ID && trace_recovery_messages <= DEBUG3))
                 {
                     StringInfoData buf;

@@ -5361,14 +5977,6 @@ StartupXLOG(void)
                     pfree(buf.data);
                 }
 #endif
-
-                /*
-                 * Check if we were requested to exit without finishing
-                 * recovery.
-                 */
-                if (shutdown_requested)
-                    proc_exit(0);
-
                 /*
                  * Have we reached our safe starting point? If so, we can
                  * tell postmaster that the database is consistent now.
@@ -5381,8 +5989,17 @@ StartupXLOG(void)
                     {
                         ereport(LOG,
                                 (errmsg("consistent recovery state reached")));
+                        if (InHotStandby && IsRunningXactDataValid())
+                        {
+                            InitRecoveryTransactionEnvironment();
+                            StartCleanupDelayStats();
+                        }
                         if (IsUnderPostmaster)
                             SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
+                        if (InHotStandby && IsRunningXactDataValid() &&
+                            recoveryStartsPaused)
+                            SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_ALL,
+                                InvalidTransactionId, 0, InvalidXLogRecPtr, 0);
                     }
                 }

@@ -5427,6 +6044,8 @@ StartupXLOG(void)

                 LastRec = ReadRecPtr;

+                recoveryPausesAfterLSN();
+
                 record = ReadRecord(NULL, LOG);
             } while (record != NULL && recoveryContinue);

@@ -5629,6 +6248,9 @@ StartupXLOG(void)
     ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
     TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);

+    /* Shutdown the recovery environment. */
+    XactClearRecoveryTransactions();
+
     /* Start up the commit log and related stuff, too */
     StartupCLOG();
     StartupSUBTRANS(oldestActiveXID);
@@ -5686,7 +6308,8 @@ RecoveryInProgress(void)

         /*
          * Initialize TimeLineID and RedoRecPtr the first time we see that
-         * recovery is finished.
+         * recovery is finished. InitPostgres() relies upon this behaviour
+         * to ensure that InitXLOGAccess() is called at backend startup.
          */
         if (!LocalRecoveryInProgress)
             InitXLOGAccess();
@@ -5824,7 +6447,7 @@ InitXLOGAccess(void)
 {
     /* ThisTimeLineID doesn't change so we need no lock to copy it */
     ThisTimeLineID = XLogCtl->ThisTimeLineID;
-    Assert(ThisTimeLineID != 0);
+    Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());

     /* Use GetRedoRecPtr to copy the RedoRecPtr safely */
     (void) GetRedoRecPtr();
@@ -6380,8 +7003,19 @@ CreateCheckPoint(int flags)
                                 CheckpointStats.ckpt_segs_recycled);

     LWLockRelease(CheckpointLock);
-}

+    /*
+     * Take a snapshot of running transactions and write this to WAL.
+     * This allows us to reconstruct the state of running transactions
+     * during archive recovery, if required.
+     *
+     * If we are shutting down, or Startup process is completing crash
+     * recovery we don't need to write running xact data.
+     */
+    if (!shutdown && !RecoveryInProgress())
+        LogCurrentRunningXacts();
+}
+
 /*
  * Flush all data in shared memory to disk, and fsync
  *
@@ -6413,6 +7047,11 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
     volatile XLogCtlData *xlogctl = XLogCtl;

     /*
+     * Regular reports of wait statistics. Unrelated to restartpoints.
+     */
+    ReportCleanupDelayStats();
+
+    /*
      * Is it safe to checkpoint?  We must ask each of the resource managers
      * whether they have any partial state information that might prevent a
      * correct restart from this point.  If so, we skip this opportunity, but
@@ -6423,7 +7062,7 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
         if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
             if (!(RmgrTable[rmid].rm_safe_restartpoint()))
             {
-                elog(DEBUG2, "RM %d not safe to record restart point at %X/%X",
+                elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X",
                      rmid,
                      checkPoint->redo.xlogid,
                      checkPoint->redo.xrecoff);
@@ -6511,7 +7150,7 @@ CreateRestartPoint(int flags)

     if (log_checkpoints)
     {
-        /*
+          /*
          * Prepare to accumulate statistics.
          */
         MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
@@ -6555,7 +7194,7 @@ CreateRestartPoint(int flags)
     LWLockRelease(CheckpointLock);
     return true;
 }
-
+
 /*
  * Write a NEXTOID log record
  */
@@ -6635,6 +7274,9 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
     {
         Oid            nextOid;

+        if (InArchiveRecovery)
+            (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
         memcpy(&nextOid, XLogRecGetData(record), sizeof(Oid));
         if (ShmemVariableCache->nextOid < nextOid)
         {
@@ -6654,6 +7296,9 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
         MultiXactSetNextMXact(checkPoint.nextMulti,
                               checkPoint.nextMultiOffset);

+        /* We know nothing was running on the master at this point */
+        XactClearRecoveryTransactions();
+
         /* ControlFile->checkPointCopy always tracks the latest ckpt XID */
         ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
         ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
@@ -6764,6 +7409,9 @@ xlog_outrec(StringInfo buf, XLogRecord *record)
                      record->xl_prev.xlogid, record->xl_prev.xrecoff,
                      record->xl_xid);

+    appendStringInfo(buf, "; len %u",
+                     record->xl_len);
+
     for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
     {
         if (record->xl_info & XLR_SET_BKP_BLOCK(i))
@@ -6919,6 +7567,12 @@ pg_start_backup(PG_FUNCTION_ARGS)
                 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                  errmsg("must be superuser to run a backup")));

+    if (RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("recovery is in progress"),
+                 errhint("WAL control functions cannot be executed during recovery.")));
+
     if (!XLogArchivingActive())
         ereport(ERROR,
                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -7091,6 +7745,12 @@ pg_stop_backup(PG_FUNCTION_ARGS)
                 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                  (errmsg("must be superuser to run a backup"))));

+    if (RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("recovery is in progress"),
+                 errhint("WAL control functions cannot be executed during recovery.")));
+
     if (!XLogArchivingActive())
         ereport(ERROR,
                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -7252,6 +7912,12 @@ pg_switch_xlog(PG_FUNCTION_ARGS)
                 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
              (errmsg("must be superuser to switch transaction log files"))));

+    if (RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("recovery is in progress"),
+                 errhint("WAL control functions cannot be executed during recovery.")));
+
     switchpoint = RequestXLogSwitch();

     /*
@@ -7274,6 +7940,12 @@ pg_current_xlog_location(PG_FUNCTION_ARGS)
 {
     char        location[MAXFNAMELEN];

+    if (RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("recovery is in progress"),
+                 errhint("WAL control functions cannot be executed during recovery.")));
+
     /* Make sure we have an up-to-date local LogwrtResult */
     {
         /* use volatile pointer to prevent code rearrangement */
@@ -7301,6 +7973,12 @@ pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
     XLogRecPtr    current_recptr;
     char        location[MAXFNAMELEN];

+    if (RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("recovery is in progress"),
+                 errhint("WAL control functions cannot be executed during recovery.")));
+
     /*
      * Get the current end-of-WAL position ... shared lock is sufficient
      */
@@ -7646,7 +8324,7 @@ static void
 StartupProcShutdownHandler(SIGNAL_ARGS)
 {
     if (in_restore_command)
-        proc_exit(0);
+        proc_exit(1);
     else
         shutdown_requested = true;
 }
@@ -7694,9 +8372,9 @@ StartupProcessMain(void)

     BuildFlatFiles(false);

-    /* Let postmaster know that startup is finished */
-    SendPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED);
-
-    /* exit normally */
+    /*
+     * Exit normally. Exit code 0 tells postmaster that we completed
+     * recovery successfully.
+     */
     proc_exit(0);
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 309fa46..ab7feef 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -404,6 +404,9 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
     /* Backup blocks are not used in smgr records */
     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
     if (info == XLOG_SMGR_CREATE)
     {
         xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d08e3d0..5bd61f6 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -26,6 +26,7 @@

 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -52,6 +53,7 @@
 #include "utils/flatfiles.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
 #include "utils/snapmgr.h"
@@ -1966,6 +1968,14 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record)
         src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
         dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);

+        if (InArchiveRecovery)
+        {
+            /*
+             * No conflict resolution is required for a create database record
+             */
+            (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+        }
+
         /*
          * Our theory for replaying a CREATE is to forcibly drop the target
          * subdirectory if present, then re-copy the source data. This may be
@@ -1999,6 +2009,28 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record)

         dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);

+        if (InArchiveRecovery &&
+            RecordKnownAssignedTransactionIds(lsn, record->xl_xid))
+        {
+            VirtualTransactionId *database_users;
+
+            /*
+             * Find all users connected to this database and ask them
+             * politely to kill themselves before processing the
+             * drop database record, after the usual grace period.
+             * We don't wait for commit because drop database is
+             * non-transactional.
+             */
+            database_users = GetConflictingVirtualXIDs(InvalidTransactionId,
+                                                        xlrec->db_id,
+                                                        InvalidTransactionId);
+
+            ResolveRecoveryConflictWithVirtualXIDs(database_users,
+                                                    "drop database",
+                                                    FATAL,
+                                                    InvalidXLogRecPtr);
+        }
+
         /* Drop pages for this database that are in the shared buffer cache */
         DropDatabaseBuffers(xlrec->db_id);

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 348e6e0..613dbc1 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -65,7 +65,8 @@ DiscardAll(bool isTopLevel)
     ResetAllOptions();
     DropAllPreparedStatements();
     PortalHashTableDeleteAll();
-    Async_UnlistenAll();
+    if (!RecoveryInProgress())
+        Async_UnlistenAll();
     LockReleaseAll(USER_LOCKMETHOD, true);
     ResetPlanCache();
     ResetTempTableNamespace();
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8c36d7d..a4ca7a2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -648,7 +648,7 @@ DefineIndex(RangeVar *heapRelation,
      * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
      * check for that.
      */
-    old_snapshots = GetCurrentVirtualXIDs(snapshot->xmax, false,
+    old_snapshots = GetCurrentVirtualXIDs(snapshot->xmax, MyDatabaseId,
                                           PROC_IS_AUTOVACUUM | PROC_IN_VACUUM);

     while (VirtualTransactionIdIsValid(*old_snapshots))
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index e32b184..fe1e518 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -48,6 +48,16 @@ LockTableCommand(LockStmt *lockstmt)

         reloid = RangeVarGetRelid(relation, false);

+        /*
+         * During recovery we only accept these variations:
+         *
+         * LOCK TABLE foo       -- parser translates as AccessEclusiveLock request
+         * LOCK TABLE foo IN AccessShareLock MODE
+         * LOCK TABLE foo IN AccessExclusiveLock MODE
+         */
+        if (!(lockstmt->mode == AccessShareLock || lockstmt->mode == AccessExclusiveLock))
+            PreventCommandDuringRecovery();
+
         if (recurse)
             children_and_self = find_all_inheritors(reloid);
         else
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 2f17805..1812edd 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -457,6 +457,8 @@ nextval_internal(Oid relid)
                 rescnt = 0;
     bool        logit = false;

+    PreventCommandDuringRecovery();
+
     /* open and AccessShareLock sequence */
     init_sequence(relid, &elm, &seqrel);

@@ -1342,6 +1344,11 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
     /* Backup blocks are not used in seq records */
     Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));

+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
+    RestoreBkpBlocks(lsn, record, false);
+
     if (info != XLOG_SEQ_LOG)
         elog(PANIC, "seq_redo: unknown op code %u", info);

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 75f772f..b298148 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -51,6 +51,7 @@
 #include "access/heapam.h"
 #include "access/sysattr.h"
 #include "access/xact.h"
+#include "access/transam.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
@@ -60,10 +61,12 @@
 #include "miscadmin.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -1285,6 +1288,14 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record)
         char       *location = xlrec->ts_path;
         char       *linkloc;

+        if (InArchiveRecovery)
+        {
+            /*
+             * No conflict resolution is required for a create database record
+             */
+            (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+        }
+
         /*
          * Attempt to coerce target directory to safe permissions.    If this
          * fails, it doesn't exist or has the wrong owner.
@@ -1316,12 +1327,70 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record)
     else if (info == XLOG_TBLSPC_DROP)
     {
         xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
+        bool                process_conflicts = false;

+        /*
+         * Process recovery transaction information
+         */
+        if (InArchiveRecovery)
+            process_conflicts = RecordKnownAssignedTransactionIds(lsn,
+                                                    record->xl_xid);
+        /*
+         * If we issued a WAL record for a drop tablespace it is
+         * because there were no files in it at all. That means that
+         * no permanent objects can exist in it at this point.
+         *
+         * It is possible for standby users to be using this tablespace
+         * as a location for their temporary files, so if we fail to
+         * remove all files then do conflict processing and try again,
+         * if currently enabled.
+         */
         if (!remove_tablespace_directories(xlrec->ts_id, true))
-            ereport(ERROR,
+        {
+            if (process_conflicts)
+            {
+                VirtualTransactionId *temp_file_users;
+
+                /*
+                 * Standby users may be currently using this tablespace for
+                 * for their temporary files. We only care about current
+                 * users because temp_tablespace parameter will just ignore
+                 * tablespaces that no longer exist.
+                 *
+                 * We can work out the pids of currently active backends using
+                 * this tablespace by examining the temp filenames in the
+                 * directory. We then convert the pids into VirtualXIDs before
+                 * attempting to cancel them.
+                 *
+                 * We don't wait for commit because drop database is
+                 * non-transactional.
+                 *
+                 * XXXHS: that's the theory, but right now we choose to nuke the
+                 * entire site from orbit, cos its the only way to be sure,
+                 * after the usual grace period.
+                 */
+                temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
+                                                            InvalidOid,
+                                                            InvalidOid);
+
+                ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
+                                                        "drop tablespace",
+                                                        ERROR,
+                                                        InvalidXLogRecPtr);
+            }
+
+            /*
+             * If we did recovery processing then hopefully the
+             * backends who wrote temp files should have cleaned up and
+             * exited by now. So lets recheck before we throw an error.
+             * If !process_conflicts then this will just fail again.
+             */
+            if (!remove_tablespace_directories(xlrec->ts_id, true))
+                ereport(ERROR,
                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                      errmsg("tablespace %u is not empty",
-                            xlrec->ts_id)));
+                                    xlrec->ts_id)));
+        }
     }
     else
         elog(PANIC, "tblspc_redo: unknown op code %u", info);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 9b46c85..1599506 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -141,6 +141,7 @@ typedef struct VRelStats
     /* vtlinks array for tuple chain following - sorted by new_tid */
     int            num_vtlinks;
     VTupleLink    vtlinks;
+    TransactionId    latestRemovedXid;
 } VRelStats;

 /*----------------------------------------------------------------------
@@ -224,7 +225,7 @@ static void scan_heap(VRelStats *vacrelstats, Relation onerel,
 static void repair_frag(VRelStats *vacrelstats, Relation onerel,
             VacPageList vacuum_pages, VacPageList fraged_pages,
             int nindexes, Relation *Irel);
-static void move_chain_tuple(Relation rel,
+static void move_chain_tuple(VRelStats *vacrelstats, Relation rel,
                  Buffer old_buf, Page old_page, HeapTuple old_tup,
                  Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
                  ExecContext ec, ItemPointer ctid, bool cleanVpd);
@@ -237,7 +238,7 @@ static void update_hint_bits(Relation rel, VacPageList fraged_pages,
                  int num_moved);
 static void vacuum_heap(VRelStats *vacrelstats, Relation onerel,
             VacPageList vacpagelist);
-static void vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage);
+static void vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage);
 static void vacuum_index(VacPageList vacpagelist, Relation indrel,
              double num_tuples, int keep_tuples);
 static void scan_index(Relation indrel, double num_tuples);
@@ -1271,6 +1272,7 @@ full_vacuum_rel(Relation onerel, VacuumStmt *vacstmt)
     vacrelstats->rel_tuples = 0;
     vacrelstats->rel_indexed_tuples = 0;
     vacrelstats->hasindex = false;
+    vacrelstats->latestRemovedXid = InvalidTransactionId;

     /* scan the heap */
     vacuum_pages.num_pages = fraged_pages.num_pages = 0;
@@ -1674,6 +1676,9 @@ scan_heap(VRelStats *vacrelstats, Relation onerel,
             {
                 ItemId        lpp;

+                HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+                                            &vacrelstats->latestRemovedXid);
+
                 /*
                  * Here we are building a temporary copy of the page with dead
                  * tuples removed.    Below we will apply
@@ -1987,7 +1992,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
                 /* there are dead tuples on this page - clean them */
                 Assert(!isempty);
                 LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-                vacuum_page(onerel, buf, last_vacuum_page);
+                vacuum_page(vacrelstats, onerel, buf, last_vacuum_page);
                 LockBuffer(buf, BUFFER_LOCK_UNLOCK);
             }
             else
@@ -2476,7 +2481,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
                     tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid);
                     tuple_len = tuple.t_len = ItemIdGetLength(Citemid);

-                    move_chain_tuple(onerel, Cbuf, Cpage, &tuple,
+                    move_chain_tuple(vacrelstats, onerel, Cbuf, Cpage, &tuple,
                                      dst_buffer, dst_page, destvacpage,
                                      &ec, &Ctid, vtmove[ti].cleanVpd);

@@ -2562,7 +2567,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
                 dst_page = BufferGetPage(dst_buffer);
                 /* if this page was not used before - clean it */
                 if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0)
-                    vacuum_page(onerel, dst_buffer, dst_vacpage);
+                    vacuum_page(vacrelstats, onerel, dst_buffer, dst_vacpage);
             }
             else
                 LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -2739,7 +2744,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
             page = BufferGetPage(buf);
             if (!PageIsEmpty(page))
-                vacuum_page(onerel, buf, *curpage);
+                vacuum_page(vacrelstats, onerel, buf, *curpage);
             UnlockReleaseBuffer(buf);
         }
     }
@@ -2875,7 +2880,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
                 recptr = log_heap_clean(onerel, buf,
                                         NULL, 0, NULL, 0,
                                         unused, uncnt,
-                                        false);
+                                        vacrelstats->latestRemovedXid, false);
                 PageSetLSN(page, recptr);
                 PageSetTLI(page, ThisTimeLineID);
             }
@@ -2925,7 +2930,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
  *        already too long and almost unreadable.
  */
 static void
-move_chain_tuple(Relation rel,
+move_chain_tuple(VRelStats *vacrelstats, Relation rel,
                  Buffer old_buf, Page old_page, HeapTuple old_tup,
                  Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
                  ExecContext ec, ItemPointer ctid, bool cleanVpd)
@@ -2981,7 +2986,7 @@ move_chain_tuple(Relation rel,
         int            sv_offsets_used = dst_vacpage->offsets_used;

         dst_vacpage->offsets_used = 0;
-        vacuum_page(rel, dst_buf, dst_vacpage);
+        vacuum_page(vacrelstats, rel, dst_buf, dst_vacpage);
         dst_vacpage->offsets_used = sv_offsets_used;
     }

@@ -3305,7 +3310,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
             buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno,
                                      RBM_NORMAL, vac_strategy);
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-            vacuum_page(onerel, buf, *vacpage);
+            vacuum_page(vacrelstats, onerel, buf, *vacpage);
             UnlockReleaseBuffer(buf);
         }
     }
@@ -3335,7 +3340,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
  * Caller must hold pin and lock on buffer.
  */
 static void
-vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
+vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage)
 {
     Page        page = BufferGetPage(buffer);
     int            i;
@@ -3364,7 +3369,7 @@ vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
         recptr = log_heap_clean(onerel, buffer,
                                 NULL, 0, NULL, 0,
                                 vacpage->offsets, vacpage->offsets_free,
-                                false);
+                                vacrelstats->latestRemovedXid, false);
         PageSetLSN(page, recptr);
         PageSetTLI(page, ThisTimeLineID);
     }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 59c02e2..a48c51a 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -97,6 +97,7 @@ typedef struct LVRelStats
     ItemPointer dead_tuples;    /* array of ItemPointerData */
     int            num_index_scans;
     bool        scanned_all;    /* have we scanned all pages (this far)? */
+    TransactionId latestRemovedXid;
 } LVRelStats;


@@ -246,6 +247,36 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
         *scanned_all = vacrelstats->scanned_all;
 }

+/*
+ * For Hot Standby we need to know the highest transaction id that will
+ * be removed by any change. VACUUM proceeds in a number of passes so
+ * we need to consider how each pass operates. The first pass runs
+ * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
+ * progresses - these will have a latestRemovedXid on each record.
+ * In many cases this removes all of the tuples to be removed.
+ * Then we look at tuples to be removed, but do not actually remove them
+ * until phase three. However, index records for those rows are removed
+ * in phase two and index blocks do not have MVCC information attached.
+ * So before we can allow removal of *any* index tuples we need to issue
+ * a WAL record indicating what the latestRemovedXid will be at the end
+ * of phase three. This then allows Hot Standby queries to block at the
+ * correct place, i.e. before phase two, rather than during phase three
+ * as we issue more XLOG_HEAP2_CLEAN records. If we need to run multiple
+ * phase two/three because of memory constraints we need to issue multiple
+ * log records also.
+ */
+static void
+vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+{
+    /*
+     * No need to log changes for temp tables, they do not contain
+     * data visible on the standby server.
+     */
+    if (rel->rd_istemp)
+        return;
+
+    (void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+}

 /*
  *    lazy_scan_heap() -- scan an open heap relation
@@ -296,6 +327,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
     nblocks = RelationGetNumberOfBlocks(onerel);
     vacrelstats->rel_pages = nblocks;
     vacrelstats->nonempty_pages = 0;
+    vacrelstats->latestRemovedXid = InvalidTransactionId;

     lazy_space_alloc(vacrelstats, nblocks);

@@ -354,6 +386,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
         if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
             vacrelstats->num_dead_tuples > 0)
         {
+            /* Log cleanup info before we touch indexes */
+            vacuum_log_cleanup_info(onerel, vacrelstats);
+
             /* Remove index entries */
             for (i = 0; i < nindexes; i++)
                 lazy_vacuum_index(Irel[i],
@@ -363,6 +398,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
             lazy_vacuum_heap(onerel, vacrelstats);
             /* Forget the now-vacuumed tuples, and press on */
             vacrelstats->num_dead_tuples = 0;
+            vacrelstats->latestRemovedXid = InvalidTransactionId;
             vacrelstats->num_index_scans++;
         }

@@ -593,6 +629,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
             if (tupgone)
             {
                 lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+                HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+                                                &vacrelstats->latestRemovedXid);
                 tups_vacuumed += 1;
             }
             else
@@ -641,6 +679,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
             lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats);
             /* Forget the now-vacuumed tuples, and press on */
             vacrelstats->num_dead_tuples = 0;
+            vacrelstats->latestRemovedXid = InvalidTransactionId;
             vacuumed_pages++;
         }

@@ -703,6 +742,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
     /* XXX put a threshold on min number of tuples here? */
     if (vacrelstats->num_dead_tuples > 0)
     {
+        /* Log cleanup info before we touch indexes */
+        vacuum_log_cleanup_info(onerel, vacrelstats);
+
         /* Remove index entries */
         for (i = 0; i < nindexes; i++)
             lazy_vacuum_index(Irel[i],
@@ -847,7 +889,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
         recptr = log_heap_clean(onerel, buffer,
                                 NULL, 0, NULL, 0,
                                 unused, uncnt,
-                                false);
+                                vacrelstats->latestRemovedXid, false);
         PageSetLSN(page, recptr);
         PageSetTLI(page, ThisTimeLineID);
     }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 90b2ad7..cff40cf 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -225,15 +225,7 @@ static pid_t StartupPID = 0,
 static int    Shutdown = NoShutdown;

 static bool FatalError = false; /* T if recovering from backend crash */
-static bool RecoveryError = false; /* T if recovery failed */
-
-/* State of WAL redo */
-#define            NoRecovery            0
-#define            RecoveryStarted        1
-#define            RecoveryConsistent    2
-#define            RecoveryCompleted    3
-
-static int    RecoveryStatus = NoRecovery;
+static bool RecoveryError = false; /* T if WAL recovery failed */

 /*
  * We use a simple state machine to control startup, shutdown, and
@@ -252,13 +244,14 @@ static int    RecoveryStatus = NoRecovery;
  * could start accepting connections to perform read-only queries at this
  * point, if we had the infrastructure to do that.
  *
- * When the WAL redo is finished, the startup process signals us the third
- * time, and we switch to PM_RUN state. The startup process can also skip the
+ * When WAL redo is finished, the startup process exits with exit code 0
+ * and we switch to PM_RUN state. Startup process can also skip the
  * recovery and consistent recovery phases altogether, as it will during
  * normal startup when there's no recovery to be done, for example.
  *
- * Normal child backends can only be launched when we are in PM_RUN state.
- * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
+ * Normal child backends can only be launched when we are in PM_RUN or
+ * PM_RECOVERY_CONSISTENT state.  (We also allow launch of normal
+ * child backends in PM_WAIT_BACKUP state, but only for superusers.)
  * In other states we handle connection requests by launching "dead_end"
  * child processes, which will simply send the client an error message and
  * quit.  (We track these in the BackendList so that we can know when they
@@ -338,7 +331,6 @@ static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
-static void CheckRecoverySignals(void);
 static void CleanupBackend(int pid, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
 static void LogChildExit(int lev, const char *procname,
@@ -1685,11 +1677,6 @@ retry1:
                     (errcode(ERRCODE_CANNOT_CONNECT_NOW),
                      errmsg("the database system is shutting down")));
             break;
-        case CAC_RECOVERY:
-            ereport(FATAL,
-                    (errcode(ERRCODE_CANNOT_CONNECT_NOW),
-                     errmsg("the database system is in recovery mode")));
-            break;
         case CAC_TOOMANY:
             ereport(FATAL,
                     (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
@@ -1698,6 +1685,7 @@ retry1:
         case CAC_WAITBACKUP:
             /* OK for now, will check in InitPostgres */
             break;
+        case CAC_RECOVERY:
         case CAC_OK:
             break;
     }
@@ -1774,7 +1762,7 @@ static enum CAC_state
 canAcceptConnections(void)
 {
     /*
-     * Can't start backends when in startup/shutdown/recovery state.
+     * Can't start backends when in startup/shutdown/inconsistent recovery state.
      *
      * In state PM_WAIT_BACKUP only superusers can connect (this must be
      * allowed so that a superuser can end online backup mode); we return
@@ -1788,9 +1776,11 @@ canAcceptConnections(void)
             return CAC_SHUTDOWN;    /* shutdown is pending */
         if (!FatalError &&
             (pmState == PM_STARTUP ||
-             pmState == PM_RECOVERY ||
-             pmState == PM_RECOVERY_CONSISTENT))
+             pmState == PM_RECOVERY))
             return CAC_STARTUP; /* normal startup */
+        if (!FatalError &&
+             pmState == PM_RECOVERY_CONSISTENT)
+            return CAC_OK; /* connection OK during recovery */
         return CAC_RECOVERY;    /* else must be crash recovery */
     }

@@ -2019,10 +2009,12 @@ pmdie(SIGNAL_ARGS)
             ereport(LOG,
                     (errmsg("received smart shutdown request")));

-            if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
+            if (pmState == PM_RUN || pmState == PM_RECOVERY ||
+                pmState == PM_RECOVERY_CONSISTENT)
             {
                 /* autovacuum workers are told to shut down immediately */
-                SignalAutovacWorkers(SIGTERM);
+                if (pmState == PM_RUN)
+                    SignalAutovacWorkers(SIGTERM);
                 /* and the autovac launcher too */
                 if (AutoVacPID != 0)
                     signal_child(AutoVacPID, SIGTERM);
@@ -2162,28 +2154,30 @@ reaper(SIGNAL_ARGS)
             StartupPID = 0;

             /*
-             * Check if we've received a signal from the startup process
-             * first. This can change pmState. If the startup process sends
-             * a signal and exits immediately after that, we might not have
-             * processed the signal yet. We need to know if it completed
-             * recovery before it exited.
-             */
-            CheckRecoverySignals();
-
-            /*
              * Unexpected exit of startup process (including FATAL exit)
              * during PM_STARTUP is treated as catastrophic. There is no
-             * other processes running yet.
+             * other processes running yet, so we can just exit.
              */
-            if (pmState == PM_STARTUP)
+            if (pmState == PM_STARTUP && !EXIT_STATUS_0(exitstatus))
             {
                 LogChildExit(LOG, _("startup process"),
                              pid, exitstatus);
                 ereport(LOG,
-                (errmsg("aborting startup due to startup process failure")));
+                        (errmsg("aborting startup due to startup process failure")));
                 ExitPostmaster(1);
             }
             /*
+             * Startup process exited in response to a shutdown request (or
+             * it completed normally regardless of the shutdown request).
+             */
+            if (Shutdown > NoShutdown &&
+                (EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
+            {
+                pmState = PM_WAIT_BACKENDS;
+                /* PostmasterStateMachine logic does the rest */
+                continue;
+            }
+            /*
              * Any unexpected exit (including FATAL exit) of the startup
              * process is treated as a crash, except that we don't want
              * to reinitialize.
@@ -2195,18 +2189,46 @@ reaper(SIGNAL_ARGS)
                                  _("startup process"));
                 continue;
             }
+
             /*
-             * Startup process exited normally, but didn't finish recovery.
-             * This can happen if someone else than postmaster kills the
-             * startup process with SIGTERM. Treat it like a crash.
+             * Startup succeeded, commence normal operations
              */
-            if (pmState == PM_RECOVERY || pmState == PM_RECOVERY_CONSISTENT)
-            {
-                RecoveryError = true;
-                HandleChildCrash(pid, exitstatus,
-                                 _("startup process"));
-                continue;
-            }
+            FatalError = false;
+            pmState = PM_RUN;
+
+            /*
+             * Load the flat authorization file into postmaster's cache. The
+             * startup process has recomputed this from the database contents,
+             * so we wait till it finishes before loading it.
+             */
+            load_role();
+
+            /*
+             * Crank up the background writer, if we didn't do that already
+             * when we entered consistent recovery phase.  It doesn't matter
+             * if this fails, we'll just try again later.
+             */
+            if (BgWriterPID == 0)
+                BgWriterPID = StartBackgroundWriter();
+
+            /*
+             * Likewise, start other special children as needed.  In a restart
+             * situation, some of them may be alive already.
+             */
+            if (WalWriterPID == 0)
+                WalWriterPID = StartWalWriter();
+            if (AutoVacuumingActive() && AutoVacPID == 0)
+                AutoVacPID = StartAutoVacLauncher();
+            if (XLogArchivingActive() && PgArchPID == 0)
+                PgArchPID = pgarch_start();
+            if (PgStatPID == 0)
+                PgStatPID = pgstat_start();
+
+            /* at this point we are really open for business */
+            ereport(LOG,
+                 (errmsg("database system is ready to accept connections")));
+
+            continue;
         }

         /*
@@ -2622,124 +2644,6 @@ LogChildExit(int lev, const char *procname, int pid, int exitstatus)
 static void
 PostmasterStateMachine(void)
 {
-    /* Startup states */
-
-    if (pmState == PM_STARTUP && RecoveryStatus > NoRecovery)
-    {
-        /* WAL redo has started. We're out of reinitialization. */
-        FatalError = false;
-
-        /*
-         * Go to shutdown mode if a shutdown request was pending.
-         */
-        if (Shutdown > NoShutdown)
-        {
-            pmState = PM_WAIT_BACKENDS;
-            /* PostmasterStateMachine logic does the rest */
-        }
-        else
-        {
-            /*
-             * Crank up the background writer.    It doesn't matter if this
-             * fails, we'll just try again later.
-             */
-            Assert(BgWriterPID == 0);
-            BgWriterPID = StartBackgroundWriter();
-
-            pmState = PM_RECOVERY;
-        }
-    }
-    if (pmState == PM_RECOVERY && RecoveryStatus >= RecoveryConsistent)
-    {
-        /*
-         * Recovery has reached a consistent recovery point. Go to shutdown
-         * mode if a shutdown request was pending.
-         */
-        if (Shutdown > NoShutdown)
-        {
-            pmState = PM_WAIT_BACKENDS;
-            /* PostmasterStateMachine logic does the rest */
-        }
-        else
-        {
-            pmState = PM_RECOVERY_CONSISTENT;
-
-            /*
-             * Load the flat authorization file into postmaster's cache. The
-             * startup process won't have recomputed this from the database yet,
-             * so we it may change following recovery.
-             */
-            load_role();
-
-            /*
-             * Likewise, start other special children as needed.
-             */
-            Assert(PgStatPID == 0);
-            PgStatPID = pgstat_start();
-
-            /* XXX at this point we could accept read-only connections */
-            ereport(DEBUG1,
-                 (errmsg("database system is in consistent recovery mode")));
-        }
-    }
-    if ((pmState == PM_RECOVERY ||
-         pmState == PM_RECOVERY_CONSISTENT ||
-         pmState == PM_STARTUP) &&
-        RecoveryStatus == RecoveryCompleted)
-    {
-        /*
-         * Startup succeeded.
-         *
-         * Go to shutdown mode if a shutdown request was pending.
-         */
-        if (Shutdown > NoShutdown)
-        {
-            pmState = PM_WAIT_BACKENDS;
-            /* PostmasterStateMachine logic does the rest */
-        }
-        else
-        {
-            /*
-             * Otherwise, commence normal operations.
-             */
-            pmState = PM_RUN;
-
-            /*
-             * Load the flat authorization file into postmaster's cache. The
-             * startup process has recomputed this from the database contents,
-             * so we wait till it finishes before loading it.
-             */
-            load_role();
-
-            /*
-             * Crank up the background writer, if we didn't do that already
-             * when we entered consistent recovery phase.  It doesn't matter
-             * if this fails, we'll just try again later.
-             */
-            if (BgWriterPID == 0)
-                BgWriterPID = StartBackgroundWriter();
-
-            /*
-             * Likewise, start other special children as needed.  In a restart
-             * situation, some of them may be alive already.
-             */
-            if (WalWriterPID == 0)
-                WalWriterPID = StartWalWriter();
-            if (AutoVacuumingActive() && AutoVacPID == 0)
-                AutoVacPID = StartAutoVacLauncher();
-            if (XLogArchivingActive() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
-
-            /* at this point we are really open for business */
-            ereport(LOG,
-                (errmsg("database system is ready to accept connections")));
-        }
-    }
-
-    /* Shutdown states */
-
     if (pmState == PM_WAIT_BACKUP)
     {
         /*
@@ -2901,8 +2805,6 @@ PostmasterStateMachine(void)
         shmem_exit(1);
         reset_shared(PostPortNumber);

-        RecoveryStatus = NoRecovery;
-
         StartupPID = StartupDataBase();
         Assert(StartupPID != 0);
         pmState = PM_STARTUP;
@@ -3067,7 +2969,8 @@ BackendStartup(Port *port)
     bn->pid = pid;
     bn->cancel_key = MyCancelKey;
     bn->is_autovacuum = false;
-    bn->dead_end = (port->canAcceptConnections != CAC_OK &&
+    bn->dead_end = (!(port->canAcceptConnections == CAC_RECOVERY ||
+                      port->canAcceptConnections == CAC_OK) &&
                     port->canAcceptConnections != CAC_WAITBACKUP);
     DLAddHead(BackendList, DLNewElem(bn));
 #ifdef EXEC_BACKEND
@@ -4007,47 +3910,58 @@ ExitPostmaster(int status)
 }

 /*
- * common code used in sigusr1_handler() and reaper() to handle
- * recovery-related signals from startup process
+ * sigusr1_handler - handle signal conditions from child processes
  */
 static void
-CheckRecoverySignals(void)
+sigusr1_handler(SIGNAL_ARGS)
 {
-    bool changed = false;
+    int            save_errno = errno;

-    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED))
-    {
-        Assert(pmState == PM_STARTUP);
+    PG_SETMASK(&BlockSig);

-        RecoveryStatus = RecoveryStarted;
-        changed = true;
-    }
-    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT))
+    /*
+     * RECOVERY_STARTED and RECOVERY_CONSISTENT signals are ignored in
+     * unexpected states. If the startup process quickly starts up, completes
+     * recovery, exits, we might process the death of the startup process
+     * first. We don't want to go back to recovery in that case.
+     */
+    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
+        pmState == PM_STARTUP)
     {
-        RecoveryStatus = RecoveryConsistent;
-        changed = true;
+        /* WAL redo has started. We're out of reinitialization. */
+        FatalError = false;
+
+        /*
+         * Crank up the background writer.    It doesn't matter if this
+         * fails, we'll just try again later.
+         */
+        Assert(BgWriterPID == 0);
+        BgWriterPID = StartBackgroundWriter();
+
+        pmState = PM_RECOVERY;
     }
-    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_COMPLETED))
+    if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT) &&
+        pmState == PM_RECOVERY)
     {
-        RecoveryStatus = RecoveryCompleted;
-        changed = true;
-    }
-
-    if (changed)
-        PostmasterStateMachine();
-}
+        /*
+         * Load the flat authorization file into postmaster's cache. The
+         * startup process won't have recomputed this from the database yet,
+         * so we it may change following recovery.
+         */
+        load_role();

-/*
- * sigusr1_handler - handle signal conditions from child processes
- */
-static void
-sigusr1_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
+        /*
+         * Likewise, start other special children as needed.
+         */
+        Assert(PgStatPID == 0);
+        PgStatPID = pgstat_start();

-    PG_SETMASK(&BlockSig);
+        /* XXX at this point we could accept read-only connections */
+        ereport(DEBUG1,
+                (errmsg("database system is in consistent recovery mode")));

-    CheckRecoverySignals();
+        pmState = PM_RECOVERY_CONSISTENT;
+    }

     if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
     {
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bd053d5..6ee9b4e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -33,6 +33,9 @@
 #include <sys/file.h>
 #include <unistd.h>

+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlogdefs.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -78,7 +81,9 @@ static bool IsForInput;

 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
-
+static long        CleanupWaitSecs = 0;
+static int        CleanupWaitUSecs = 0;
+static bool        CleanupWaitStats = false;

 static Buffer ReadBuffer_common(SMgrRelation reln, bool isLocalBuf,
                     ForkNumber forkNum, BlockNumber blockNum,
@@ -100,7 +105,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
             bool *foundPtr);
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
-
+static void BufferProcessRecoveryConflicts(volatile  BufferDesc *bufHdr);
+

 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
@@ -1580,6 +1586,82 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
     return result | BUF_WRITTEN;
 }

+#define BufferProcessRecoveryConflictsIfAny(b) \
+{ \
+    if (MyProc->rconflicts.nConflicts > 0) \
+        BufferProcessRecoveryConflicts(b); \
+}
+
+/*
+ * BufferProcessRecoveryConflicts -- cancels recovery query, if required
+ *
+ * We could do an "if in recovery" test here, but there is no need. We don't
+ * set RecoveryConflicts unless we're in recovery.
+ *
+ * Called on locked buffer, lock held at release
+ */
+static void
+BufferProcessRecoveryConflicts(volatile BufferDesc *bufHdr)
+{
+    /*
+     * We already have the buffer locked, so just check nConflicts without
+     * acquiring spinlock for speed. It won't be possible for recovery
+     * to cleanup this buffer until we are finished with it, so any
+     * concurrent changes to the RecoveryConflictCache can be ignored until
+     * the next time we are here. To ensure we get this right, always fetch
+     * nConflicts first, then don't fetch again while looking at other vars.
+     */
+    XLogRecPtr    bufLSN;
+    int            nConflicts = MyProc->rconflicts.nConflicts;
+
+    Assert(nConflicts > 0);
+
+    /*
+     * If we have a non-overflowed conflict cache, check it.
+     */
+    if (nConflicts <= PGPROC_MAX_CACHED_CONFLICT_RELS)
+    {
+        int        i;
+        bool    found = false;
+
+        /*
+         * Search cache to see if it is already listed.
+         */
+        for (i = 0; i < nConflicts; i++)
+        {
+            if (bufHdr->tag.rnode.relNode == MyProc->rconflicts.rels[i])
+                found = true;
+        }
+
+        /*
+         * If we have not overflowed and the current rel isn't in the cache
+         * then there is no conflict and we can exit.
+         */
+        if (!found)
+            return;
+    }
+
+    /*
+     * If the buffer is recent we may need to cancel ourselves
+     * rather than risk returning a wrong answer. This test is
+     * too conservative, but that's OK. The correct LSN would be
+     * the buffer's latestCleanedLSN rather than latestModifiedLSN,
+     * but that isn't recorded anywhere.
+     *
+     * We only need to cancel the current subtransaction.
+     * Once we've handled the error then other subtransactions can
+     * continue processing. Note that we do *not* reset the
+     * BufferRecoveryConflictLSN at subcommit/abort, but we do
+     * reset it if we release our last remaining sbapshot.
+     * see SnapshotResetXmin()
+     */
+    bufLSN = BufferGetLSN(bufHdr);
+    if (XLByteLE(bufLSN, MyProc->rconflicts.lsn))
+        ereport(ERROR,
+            (errcode(IsXactIsoLevelSerializable ? ERRCODE_T_R_SERIALIZATION_FAILURE
+                                                : ERRCODE_QUERY_CANCELED),
+             errmsg("canceling statement due to recent buffer changes during recovery")));
+}

 /*
  * Return a palloc'd string containing buffer usage statistics.
@@ -2338,9 +2420,15 @@ LockBuffer(Buffer buffer, int mode)
     if (mode == BUFFER_LOCK_UNLOCK)
         LWLockRelease(buf->content_lock);
     else if (mode == BUFFER_LOCK_SHARE)
+    {
         LWLockAcquire(buf->content_lock, LW_SHARED);
+        BufferProcessRecoveryConflictsIfAny(buf);
+    }
     else if (mode == BUFFER_LOCK_EXCLUSIVE)
+    {
         LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);
+        BufferProcessRecoveryConflictsIfAny(buf);
+    }
     else
         elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -2361,7 +2449,62 @@ ConditionalLockBuffer(Buffer buffer)

     buf = &(BufferDescriptors[buffer - 1]);

-    return LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE);
+    if (LWLockConditionalAcquire(buf->content_lock, LW_EXCLUSIVE))
+    {
+        BufferProcessRecoveryConflictsIfAny(buf);
+        return true;
+    }
+
+    return false;
+}
+
+/*
+ * On standby servers only the Startup process applies Cleanup. As a result
+ * a single buffer pin can be enough to effectively halt recovery for short
+ * periods. We need special instrumentation to monitor this so we can judge
+ * whether additional measures are required to control the negative effects.
+ */
+void
+StartCleanupDelayStats(void)
+{
+    CleanupWaitSecs = 0;
+    CleanupWaitUSecs = 0;
+    CleanupWaitStats = true;
+}
+
+void
+EndCleanupDelayStats(void)
+{
+    CleanupWaitStats = false;
+}
+
+/*
+ * Called by Startup process whenever we request restartpoint
+ */
+void
+ReportCleanupDelayStats(void)
+{
+    Assert(InRecovery);
+
+    elog(trace_recovery(DEBUG2), "cleanup wait total=%ld.%03d s",
+                 CleanupWaitSecs, CleanupWaitUSecs / 1000);
+}
+
+static void
+CleanupDelayStats(TimestampTz start_ts, TimestampTz end_ts)
+{
+    long            wait_secs;
+    int                wait_usecs;
+
+    TimestampDifference(start_ts, end_ts, &wait_secs, &wait_usecs);
+
+    CleanupWaitSecs +=wait_secs;
+    CleanupWaitUSecs +=wait_usecs;
+    if (CleanupWaitUSecs > 999999)
+    {
+        CleanupWaitSecs += 1;
+        CleanupWaitUSecs -= 1000000;
+    }
 }

 /*
@@ -2407,6 +2550,8 @@ LockBufferForCleanup(Buffer buffer)

     for (;;)
     {
+        TimestampTz     start_ts = 0;
+
         /* Try to acquire lock */
         LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
         LockBufHdr(bufHdr);
@@ -2429,9 +2574,14 @@ LockBufferForCleanup(Buffer buffer)
         PinCountWaitBuf = bufHdr;
         UnlockBufHdr(bufHdr);
         LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+        if (CleanupWaitStats)
+            start_ts = GetCurrentTimestamp();
         /* Wait to be signaled by UnpinBuffer() */
         ProcWaitForSignal();
         PinCountWaitBuf = NULL;
+        if (CleanupWaitStats)
+            CleanupDelayStats(start_ts, GetCurrentTimestamp());
+
         /* Loop back and try again */
     }
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 06f8ad8..59f1ce0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -17,6 +17,32 @@
  * as are the myProcLocks lists.  They can be distinguished from regular
  * backend PGPROCs at need by checking for pid == 0.
  *
+ * The process array now also includes PGPROC structures representing
+ * transactions being recovered. The xid and subxids fields of these are valid,
+ * though few other fields are.  They can be distinguished from regular backend
+ * PGPROCs by checking for pid == 0.  The proc array also has an
+ * secondary array of UnobservedXids representing transactions that are
+ * known to be running on the master but for which we do not yet have
+ * a recovery proc. We infer the existence of UnobservedXids by watching
+ * the sequence of arriving xids. This is very important because if we leave
+ * those xids out of the snapshot then they will appear to be already complete.
+ * Later, when they have actually completed this could lead to confusion as to
+ * whether those xids are visible or not, blowing a huge hole in MVCC.
+ * We need 'em.
+ *
+ * Although we have max_connections procs during recovery, they will only
+ * be used when the master is running a write transaction. Read only
+ * transactions never show up in WAL at all and it is valid to ignore them.
+ * So we would only ever use all max_connections procs is we were running
+ * a write transaction on every session at once. As a result, we may be
+ * able to continue running normally even if max_connections is set lower
+ * on the standby than on the master.
+ *
+ * It is theoretically possible for a FATAL error to explode before writing
+ * an abort record. This would then tie up a recovery proc until the next
+ * WAL record containing a valid list of running xids arrives. This is
+ * relatively unlikely, so considered both a minor and an acceptable flaw
+ * in the emulation of transactions during recovery.
  *
  * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -33,28 +59,47 @@

 #include "access/subtrans.h"
 #include "access/transam.h"
-#include "access/xact.h"
+#include "access/xlog.h"
 #include "access/twophase.h"
 #include "miscadmin.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/snapmgr.h"

+static RunningXactsData    CurrentRunningXactsData;

 /* Our shared memory area */
 typedef struct ProcArrayStruct
 {
     int            numProcs;        /* number of valid procs entries */
-    int            maxProcs;        /* allocated size of procs array */
+    int            maxProcs;            /* allocated size of total procs array */
+
+    int            numUnobservedXids;    /* number of valid unobserved xids */
+    int            maxUnobservedXids;    /* allocated size of unobserved array */
+    /*
+     * Last subxid that overflowed unobserved xids array. Similar to
+     * overflowing subxid cached in PGPROC entries.
+     */
+    TransactionId    lastOverflowedUnobservedXid;
+
+    bool        allowStandbySnapshots;    /* can queries take snapshots? */

     /*
      * We declare procs[] as 1 entry because C wants a fixed-size array, but
      * actually it is maxProcs entries long.
      */
     PGPROC       *procs[1];        /* VARIABLE LENGTH ARRAY */
+
+    /* ARRAY OF UNOBSERVED TRANSACTION XIDs FOLLOWS */
 } ProcArrayStruct;

 static ProcArrayStruct *procArray;

+/*
+ * Bookkeeping for tracking emulated transactions in Recovery Procs.
+ */
+static TransactionId    latestObservedXid = InvalidTransactionId;
+static bool                RunningXactIsValid = false;

 #ifdef XIDCACHE_DEBUG

@@ -100,8 +145,16 @@ ProcArrayShmemSize(void)
     Size        size;

     size = offsetof(ProcArrayStruct, procs);
-    size = add_size(size, mul_size(sizeof(PGPROC *),
-                                 add_size(MaxBackends, max_prepared_xacts)));
+
+    /* Normal processing */
+    /* MyProc slots */
+    size = add_size(size, mul_size(sizeof(PGPROC *), MaxBackends));
+    size = add_size(size, mul_size(sizeof(PGPROC *), max_prepared_xacts));
+
+    /* Recovery processing */
+
+    /* UnobservedXids */
+    size = add_size(size, mul_size(sizeof(TransactionId), 65 * MaxBackends));

     return size;
 }
@@ -123,8 +176,21 @@ CreateSharedProcArray(void)
         /*
          * We're the first - initialize.
          */
+        /* Normal processing */
         procArray->numProcs = 0;
         procArray->maxProcs = MaxBackends + max_prepared_xacts;
+
+        /* Recovery processing */
+        procArray->maxProcs += MaxBackends;
+
+        procArray->allowStandbySnapshots = false;
+
+        /*
+         * If you change this, also change ProcArrayShmemSize()
+         */
+        procArray->maxUnobservedXids = 65 * MaxBackends;
+        procArray->numUnobservedXids = 0;
+        procArray->lastOverflowedUnobservedXid = InvalidTransactionId;
     }
 }

@@ -132,11 +198,12 @@ CreateSharedProcArray(void)
  * Add the specified PGPROC to the shared array.
  */
 void
-ProcArrayAdd(PGPROC *proc)
+ProcArrayAdd(PGPROC *proc, bool need_lock)
 {
     ProcArrayStruct *arrayP = procArray;

-    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+    if (need_lock)
+        LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

     if (arrayP->numProcs >= arrayP->maxProcs)
     {
@@ -154,13 +221,15 @@ ProcArrayAdd(PGPROC *proc)
     arrayP->procs[arrayP->numProcs] = proc;
     arrayP->numProcs++;

-    LWLockRelease(ProcArrayLock);
+    if (need_lock)
+        LWLockRelease(ProcArrayLock);
 }

 /*
  * Remove the specified PGPROC from the shared array.
  *
- * When latestXid is a valid XID, we are removing a live 2PC gxact from the
+ * When latestXid is a valid XID, it is either an emulated transaction during
+ * recovery or removing a live 2PC gxact that we wish to remove from the
  * array, and thus causing it to appear as "not running" anymore.  In this
  * case we must advance latestCompletedXid.  (This is essentially the same
  * as ProcArrayEndTransaction followed by removal of the PGPROC, but we take
@@ -168,7 +237,8 @@ ProcArrayAdd(PGPROC *proc)
  * twophase.c depends on the latter.)
  */
 void
-ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
+ProcArrayRemove(PGPROC *proc, TransactionId latestXid,
+                int nsubxids, TransactionId *subxids)
 {
     ProcArrayStruct *arrayP = procArray;
     int            index;
@@ -181,6 +251,18 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)

     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

+    /*
+     * Remove any UnobservedXids remaining
+     */
+    if (RecoveryInProgress())
+    {
+        /* XXX: How can any of the known subxids be in the unobserved xids
+         * array?
+        for (index = 0; index < nsubxids; index++)
+            UnobservedTransactionsRemoveXid(subxids[index], false);
+        */
+    }
+
     if (TransactionIdIsValid(latestXid))
     {
         Assert(TransactionIdIsValid(proc->xid));
@@ -193,7 +275,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
     else
     {
         /* Shouldn't be trying to remove a live transaction here */
-        Assert(!TransactionIdIsValid(proc->xid));
+        Assert(RecoveryInProgress() || !TransactionIdIsValid(proc->xid));
     }

     for (index = 0; index < arrayP->numProcs; index++)
@@ -210,9 +292,19 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
     /* Ooops */
     LWLockRelease(ProcArrayLock);

-    elog(LOG, "failed to find proc %p in ProcArray", proc);
+    elog(RecoveryInProgress() ? ERROR : LOG,
+            "failed to find proc %p in ProcArray", proc);
 }

+/*
+ * Initialisation when we switch into PM_RECOVERY mode.
+ * Expected caller is InitRecoveryTransactionEnvironment()
+ */
+void
+ProcArrayInitRecoveryEnvironment(void)
+{
+    PublishStartupProcessInformation();
+}

 /*
  * ProcArrayEndTransaction -- mark a transaction as no longer running
@@ -301,6 +393,7 @@ ProcArrayClearTransaction(PGPROC *proc)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
+    proc->lsn = InvalidXLogRecPtr;

     /* redundant, but just in case */
     proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -313,6 +406,247 @@ ProcArrayClearTransaction(PGPROC *proc)


 /*
+ * xidComparator
+ *        qsort comparison function for XIDs
+ */
+static int
+xidComparator(const void *arg1, const void *arg2)
+{
+    TransactionId xid1 = *(const TransactionId *) arg1;
+    TransactionId xid2 = *(const TransactionId *) arg2;
+
+    if (TransactionIdFollows(xid1, xid2))
+        return 1;
+    if (TransactionIdPrecedes(xid1, xid2))
+        return -1;
+    return 0;
+}
+
+void
+ProcArrayDisplay(int trace_level)
+{
+    ProcArrayStruct *arrayP = procArray;
+    int            index, i;
+    StringInfoData buf;
+
+    for (index = 0; index < arrayP->numProcs; index++)
+    {
+        PGPROC    *proc = arrayP->procs[index];
+        TransactionId    xid = proc->xid;
+        int nsubxids = proc->subxids.nxids;
+
+        initStringInfo(&buf);
+        appendStringInfo(&buf, "procarray %d xid %d lsn %X/%X ",
+                            index, xid,
+                            proc->lsn.xlogid, proc->lsn.xrecoff);
+        if (nsubxids > 0)
+        {
+            appendStringInfo(&buf, "nsubxids %u : ", nsubxids);
+
+            for (i = 0;    i < nsubxids; i++)
+                appendStringInfo(&buf, "%u ", proc->subxids.xids[i]);
+        }
+
+        elog(trace_level, "%s", buf.data);
+    }
+
+    UnobservedTransactionsDisplay(trace_level);
+}
+
+/*
+ * ProcArrayUpdateRecoveryTransactions -- initialise the proc array in recovery
+ *
+ * Use the data about running transactions on master to either create the
+ * initial state of the recovery procs, or maintain correctness of their
+ * state. In a sense this is almost the opposite of GetSnapshotData(),
+ * since we are updating the proc array based upon the snapshot. We do this
+ * as a cross-check that the proc array is correctly maintained, because
+ * we know it is possible that some transactions with FATAL errors do not
+ * write abort records and also to create the initial state of the procarray.
+ *
+ * Only used during recovery. Notice the signature is very similar to a
+ * _redo function.
+ */
+void
+ProcArrayUpdateRecoveryTransactions(XLogRecPtr lsn, xl_xact_running_xacts *xlrec)
+{
+    int                xid_index;    /* main loop */
+    int             index;
+    TransactionId  *xids;
+    int                nxids;
+
+    if (TransactionIdPrecedes(latestObservedXid, xlrec->latestRunningXid))
+        latestObservedXid = xlrec->latestRunningXid;
+
+    xids = palloc(sizeof(TransactionId) * (xlrec->xcnt + xlrec->subxcnt));
+    nxids = 0;
+
+    ProcArrayDisplay(trace_recovery(DEBUG3));
+
+    /*
+     * Scan through the incoming array of RunningXacts and collect xids.
+     * XXX: mark subtransactions with SubtransSetParent here too.
+     */
+    for (xid_index = 0; xid_index < xlrec->xcnt; xid_index++)
+    {
+        RunningXact        *rxact = (RunningXact *) xlrec->xrun;
+        TransactionId    xid = rxact[xid_index].xid;
+        TransactionId   *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]);
+        int i;
+
+        xids[nxids++] = xid;
+        for(i = 0; i < rxact[xid_index].nsubxids; i++)
+            xids[nxids++] = subxip[index + rxact[xid_index].subx_offset + i];
+
+        elog(trace_recovery(DEBUG5),
+            "running xact lsn %X/%X xid %d",
+                lsn.xlogid, lsn.xrecoff, rxact[xid_index].xid);
+    }
+
+    /* We keep the unobserved xids array sorted at all times */
+    qsort(xids, nxids, sizeof(TransactionId), xidComparator);
+
+    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+    if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
+                              xlrec->latestCompletedXid))
+        ShmemVariableCache->latestCompletedXid = xlrec->latestCompletedXid;
+
+    /*
+     * Left prune the UnobservedXids array up to latestRunningXid.
+     * This is correct because at the time we take this snapshot, all
+     * completed transactions prior to latestRunningXid will be marked in
+     * WAL or they are explicitly present here.
+     *
+     * We can't clear the array completely because race conditions allow
+     * things to slip through sometimes. XXX: We do anyway
+     */
+    UnobservedTransactionsClearXids();
+
+    UnobservedTransactionsAddXids(InvalidTransactionId, nxids, xids);
+
+    /* Advance global latestCompletedXid while holding the lock */
+    if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
+                              xlrec->latestCompletedXid))
+        ShmemVariableCache->latestCompletedXid = xlrec->latestCompletedXid;
+
+    /*
+     * If we fully applied the RunningXact data then we can (re)open
+     * for business. Whatever state we were in before.
+     */
+    procArray->allowStandbySnapshots = true;
+    RunningXactIsValid = true;
+
+    ProcArrayDisplay(trace_recovery(DEBUG3));
+
+    LWLockRelease(ProcArrayLock);
+
+    elog(DEBUG2, "Running xact data applied -- standby snapshots now enabled");
+}
+
+/*
+ * During recovery we maintain ProcArray with incoming xids when we first
+ * observe them in use. Uses local variables, so should only be called
+ * by Startup process.
+ *
+ * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
+ * type apart from XLOG_XACT_RUNNING_XACTS, since that initialises the
+ * first snapshot so that RecordKnownAssignedTransactionIds() can be
+ * called. We don't currently check that rmgrs have called us.
+ * XXXHS: Perhaps we should?
+ *
+ * We record all xids that we know have been assigned. That includes
+ * all the xids on the WAL record, plus all unobserved xids that
+ * we can deduce have been assigned. We can deduce the existence of
+ * unobserved xids because we know xids are in sequence, with no gaps.
+ */
+bool
+RecordKnownAssignedTransactionIds(XLogRecPtr lsn, TransactionId xid)
+{
+    /*
+     * Skip processing if the current snapshot is invalid. If you're
+     * thinking of removing this, think again. We must have a valid
+     * initial state before we try to modify it.
+     */
+    if (!IsRunningXactDataValid())
+        return false;
+
+    /*
+     * VACUUM records are always sent with InvalidTransactionId, so
+     * invoke conflict processing if we see a record like this, even
+     * if there is no xid data to record.
+     */
+    if (!TransactionIdIsValid(xid))
+        return true;
+
+    /*
+     * When a newly observed xid arrives, it is frequently the case
+     * that it is *not* the next xid in sequence. When this occurs, we
+     * must treat the intervening xids as running also. So we maintain
+     * a special list of these UnobservedXids, so that snapshots can
+     * see the missing xids as in-progress.
+     *
+     * We maintain both recovery Procs *and* UnobservedXids because we
+     * need them both. Recovery procs allow us to store top-level xids
+     * and subtransactions separately, otherwise we wouldn't know
+     * when to overflow the subxid cache. UnobservedXids allow us to
+     * make sense of the out-of-order arrival of xids.
+     *
+     * Some examples:
+     * 1)    latestObservedXid = 647
+     *        next xid observed in WAL = 651 (a top-level transaction)
+     *        so we add 648, 649, 650 to UnobservedXids
+     *        and add 651 as a recovery proc
+     *
+     * 2)    latestObservedXid = 769
+     *        next xid observed in WAL = 771 (a subtransaction)
+     *        so we add 770 to UnobservedXids
+     *        and add 771 into the subxid cache of its top-level xid
+     *
+     * 3)    latestObservedXid = 769
+     *        next xid observed in WAL = 810 (a subtransaction)
+     *        810's parent had not yet recorded WAL = 807
+     *        so we add 770 thru 809 inclusive to UnobservedXids
+     *        then remove 807
+     *
+     * 4)    latestObservedXid = 769
+     *        next xid observed in WAL = 771 (a subtransaction)
+     *        771's parent had not yet recorded WAL = 770
+     *        so do nothing
+     *
+     * 5)    latestObservedXid = 7747
+     *        next xid observed in WAL = 7748 (a subtransaction)
+     *        7748's parent had not yet recorded WAL = 7742
+     *        so we add 7748 and removed 7742
+     */
+    if (TransactionIdFollows(xid, latestObservedXid))
+    {
+        TransactionId    next_expected_xid = latestObservedXid;
+        TransactionIdAdvance(next_expected_xid);
+
+        LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+        UnobservedTransactionsAddRange(next_expected_xid, xid);
+        LWLockRelease(ProcArrayLock);
+
+        latestObservedXid = xid;
+    }
+
+    elog(trace_recovery(DEBUG4),
+         "record known xact %u latestObservedXid %u",
+         xid, latestObservedXid);
+    return true;
+}
+
+/*
+ * Is the data available to allow valid snapshots?
+ */
+bool
+IsRunningXactDataValid(void)
+{
+    return RunningXactIsValid;
+}
+
+/*
  * TransactionIdIsInProgress -- is given transaction running in some backend
  *
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
@@ -589,6 +923,9 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
     TransactionId result;
     int            index;

+    /* Cannot look for individual databases during recovery */
+    Assert(allDbs || !RecoveryInProgress());
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     /*
@@ -655,7 +992,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
  * but since PGPROC has only a limited cache area for subxact XIDs, full
  * information may not be available.  If we find any overflowed subxid arrays,
  * we have to mark the snapshot's subxid data as overflowed, and extra work
- * will need to be done to determine what's running (see XidInMVCCSnapshot()
+ * *may* need to be done to determine what's running (see XidInMVCCSnapshot()
  * in tqual.c).
  *
  * We also update the following backend-global variables:
@@ -697,17 +1034,23 @@ GetSnapshotData(Snapshot snapshot)
     if (snapshot->xip == NULL)
     {
         /*
-         * First call for this snapshot
+         * First call for this snapshot. In recovery we need an additional
+         * space allocation to allow for UnobservedXids, which never occur
+         * in normal running.
          */
-        snapshot->xip = (TransactionId *)
-            malloc(arrayP->maxProcs * sizeof(TransactionId));
+        if (RecoveryInProgress())
+            snapshot->xip = (TransactionId *)
+                malloc(3 * arrayP->maxProcs * sizeof(TransactionId));
+        else
+            snapshot->xip = (TransactionId *)
+                malloc(arrayP->maxProcs * sizeof(TransactionId));
         if (snapshot->xip == NULL)
             ereport(ERROR,
                     (errcode(ERRCODE_OUT_OF_MEMORY),
                      errmsg("out of memory")));
         Assert(snapshot->subxip == NULL);
         snapshot->subxip = (TransactionId *)
-            malloc(arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS * sizeof(TransactionId));
+            malloc((arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS) * sizeof(TransactionId));
         if (snapshot->subxip == NULL)
             ereport(ERROR,
                     (errcode(ERRCODE_OUT_OF_MEMORY),
@@ -720,6 +1063,16 @@ GetSnapshotData(Snapshot snapshot)
      */
     LWLockAcquire(ProcArrayLock, LW_SHARED);

+    if (RecoveryInProgress() && !arrayP->allowStandbySnapshots)
+    {
+        LWLockRelease(ProcArrayLock);
+        ereport(ERROR,
+            (errcode(ERRCODE_QUERY_CANCELED),
+             errmsg("canceling statement because standby snapshots are currently disabled"),
+             errdetail("Valid MVCC snapshot cannot be taken at this time."),
+             errhint("Contact your administrator if this error recurs frequently")));
+    }
+
     /* xmax is always latestCompletedXid + 1 */
     xmax = ShmemVariableCache->latestCompletedXid;
     Assert(TransactionIdIsNormal(xmax));
@@ -789,18 +1142,64 @@ GetSnapshotData(Snapshot snapshot)
             if (proc->subxids.overflowed)
                 subcount = -1;    /* overflowed */
             else
-            {
-                int            nxids = proc->subxids.nxids;
+              {
+                  int            nxids = proc->subxids.nxids;
+
+                  if (nxids > 0)
+                  {
+                      memcpy(snapshot->subxip + subcount,
+                             (void *) proc->subxids.xids,
+                             nxids * sizeof(TransactionId));
+                      subcount += nxids;
+                  }
+              }
+          }
+    }
+
+    /*
+     * Also check for unobserved xids. There is no need for us to specify
+     * that this is only if RecoveryInProgress(), since the list will
+     * always be empty when normal processing begins and the test will be
+     * optimised to nearly nothing very quickly.
+     */
+    if (RecoveryInProgress())
+    {
+        volatile TransactionId    *UnobservedXids;
+        UnobservedXids = (TransactionId *) &(arrayP->procs[arrayP->maxProcs]);
+        for (index = 0; index < arrayP->numUnobservedXids; index++)
+        {
+            TransactionId     xid;
+
+            /* Fetch xid just once - see GetNewTransactionId */
+            xid = UnobservedXids[index];
+
+            /*
+             * If there are no more visible xids, we're done. This works
+             * because UnobservedXids is maintained in strict ascending order.
+             */
+            if (!TransactionIdIsNormal(xid) || TransactionIdPrecedes(xid, xmax))
+                break;
+
+            /*
+             * Add unobserved xids onto the main xip array.
+             */
+            snapshot->xip[count++] = xid;
+
+            /*
+             * Check to see if this changes xmin. It is possible that an unobserved
+             * xid could be xmin if there is contention between long-lived
+             * transactions.
+             */
+            if (TransactionIdPrecedes(xid, xmin))
+                xmin = xid;

-                if (nxids > 0)
-                {
-                    memcpy(snapshot->subxip + subcount,
-                           (void *) proc->subxids.xids,
-                           nxids * sizeof(TransactionId));
-                    subcount += nxids;
-                }
-            }
         }
+        /*
+         * See if we have removed any subxids from the unobserved xids array
+         * that we might need to see.
+         */
+        if (!TransactionIdPrecedes(arrayP->lastOverflowedUnobservedXid, xmin))
+            subcount = -1;
     }

     if (!TransactionIdIsValid(MyProc->xmin))
@@ -839,6 +1238,197 @@ GetSnapshotData(Snapshot snapshot)
 }

 /*
+ * GetRunningTransactionData -- returns information about running transactions.
+ *
+ * Similar to GetSnapshotData but returning more information. We include
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes. We
+ * also keep track of which subtransactions go with each PGPROC. All of this
+ * looks very similar to GetSnapshotData, but we have more procs and more info
+ * about each proc.
+ *
+ * This is never executed during recovery so there is no need to look at
+ * UnobservedXids.
+ *
+ * We don't worry about updating other counters, we want to keep this as
+ * simple as possible and leave GetSnapshotData() as the primary code for
+ * that bookkeeping.
+ */
+RunningTransactions
+GetRunningTransactionData(void)
+{
+    ProcArrayStruct *arrayP = procArray;
+    static RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData;
+    RunningXact    *rxact;
+    TransactionId *subxip;
+    TransactionId latestRunningXid = InvalidTransactionId;
+    TransactionId latestCompletedXid;
+    TransactionId oldestRunningXid = InvalidTransactionId;
+    int            index;
+    int            count = 0;
+    int            subcount = 0;
+    bool        suboverflowed = false;
+
+    /*
+     * Allocating space for maxProcs xids is usually overkill; numProcs would
+     * be sufficient.  But it seems better to do the malloc while not holding
+     * the lock, so we can't look at numProcs.  Likewise, we allocate much
+     * more subxip storage than is probably needed.
+     *
+     * Should only be allocated for bgwriter, since only ever executed
+     * during checkpoints.
+     */
+    if (CurrentRunningXacts->xrun == NULL)
+    {
+        /*
+         * First call
+         */
+        CurrentRunningXacts->xrun = (RunningXact *)
+            malloc(arrayP->maxProcs * sizeof(RunningXact));
+        if (CurrentRunningXacts->xrun == NULL)
+            ereport(ERROR,
+                    (errcode(ERRCODE_OUT_OF_MEMORY),
+                     errmsg("out of memory")));
+        Assert(CurrentRunningXacts->subxip == NULL);
+        CurrentRunningXacts->subxip = (TransactionId *)
+            malloc((arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS) * sizeof(TransactionId));
+        if (CurrentRunningXacts->subxip == NULL)
+            ereport(ERROR,
+                    (errcode(ERRCODE_OUT_OF_MEMORY),
+                     errmsg("out of memory")));
+    }
+
+    rxact = CurrentRunningXacts->xrun;
+    subxip = CurrentRunningXacts->subxip;
+
+    count = 0;
+    subcount = 0;
+    suboverflowed = false;
+
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+    latestCompletedXid = ShmemVariableCache->latestCompletedXid;
+
+    /*
+     * Spin over procArray checking xid, and subxids. Shared lock is enough
+     * because new transactions don't use locks at all, so LW_EXCLUSIVE
+     * wouldn't be enough to prevent them, so don't bother.
+     */
+    for (index = 0; index < arrayP->numProcs; index++)
+    {
+        volatile PGPROC *proc = arrayP->procs[index];
+        TransactionId xid;
+        int            nxids;
+
+        /* Fetch xid just once - see GetNewTransactionId */
+        xid = proc->xid;
+
+        /*
+         * We store all xids, even XIDs >= xmax and our own XID, if any.
+         * But we don't store transactions that don't have a TransactionId
+         * yet because they will not show as running on a standby server.
+         */
+        if (!TransactionIdIsValid(xid))
+            continue;
+
+        rxact[count].xid = xid;
+
+        if (TransactionIdPrecedes(latestRunningXid, xid))
+            latestRunningXid = xid;
+
+        if (!TransactionIdIsValid(oldestRunningXid) ||
+            TransactionIdPrecedes(xid, oldestRunningXid))
+            oldestRunningXid = xid;
+
+        /*
+         * Save subtransaction XIDs.
+         *
+         * The other backend can add more subxids concurrently, but cannot
+         * remove any.    Hence it's important to fetch nxids just once. Should
+         * be safe to use memcpy, though.  (We needn't worry about missing any
+         * xids added concurrently, because they must postdate xmax.)
+         *
+         * Again, our own XIDs *are* included in the snapshot.
+         */
+        nxids = proc->subxids.nxids;
+
+        if (nxids > 0)
+        {
+            TransactionId *subxids = (TransactionId *) proc->subxids.xids;
+
+            rxact[count].subx_offset = subcount;
+
+            memcpy(subxip + subcount,
+                   (void *) proc->subxids.xids,
+                   nxids * sizeof(TransactionId));
+            subcount += nxids;
+
+            if (proc->subxids.overflowed)
+            {
+                rxact[count].overflowed = true;
+                suboverflowed = true;
+            }
+
+            if (TransactionIdPrecedes(latestRunningXid, subxids[nxids - 1]))
+                latestRunningXid = subxids[nxids - 1];
+        }
+        else
+        {
+            rxact[count].subx_offset = 0;
+            rxact[count].overflowed = false;
+        }
+
+        rxact[count].nsubxids = nxids;
+        count++;
+    }
+
+    LWLockRelease(ProcArrayLock);
+
+    /*
+     * When there are no transactions running, just use the value
+     * of the last completed transaction. No need to check
+     * ReadNewTransactionId().
+     */
+    if (count == 0)
+        latestRunningXid = latestCompletedXid;
+
+    CurrentRunningXacts->xcnt = count;
+    CurrentRunningXacts->subxcnt = subcount;
+    CurrentRunningXacts->latestCompletedXid = latestCompletedXid;
+    CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
+    if (suboverflowed)
+        CurrentRunningXacts->latestRunningXid = InvalidTransactionId;
+    else
+        CurrentRunningXacts->latestRunningXid = latestRunningXid;
+
+#ifdef RUNNING_XACT_DEBUG
+    elog(trace_recovery(DEBUG3),
+                    "logging running xacts xcnt %d subxcnt %d latestCompletedXid %d latestRunningXid %d",
+                    CurrentRunningXacts->xcnt,
+                    CurrentRunningXacts->subxcnt,
+                    CurrentRunningXacts->latestCompletedXid,
+                    CurrentRunningXacts->latestRunningXid);
+
+    for (index = 0; index < CurrentRunningXacts->xcnt; index++)
+    {
+        int j;
+        elog(trace_recovery(DEBUG3),
+                    "xid %d nsubxids %d offset %d, ovflow %s",
+                    CurrentRunningXacts->xrun[index].xid,
+                    CurrentRunningXacts->xrun[index].nsubxids,
+                    CurrentRunningXacts->xrun[index].subx_offset,
+                    CurrentRunningXacts->xrun[index].overflowed ? "t" : "f");
+        for (j = 0; j < CurrentRunningXacts->xrun[index].nsubxids; j++)
+            elog(trace_recovery(DEBUG3),
+                    "subxid offset %d j %d xid %d",
+                    CurrentRunningXacts->xrun[index].subx_offset, j,
+                    CurrentRunningXacts->subxip[j + CurrentRunningXacts->xrun[index].subx_offset]);
+    }
+#endif
+
+    return CurrentRunningXacts;
+}
+
+/*
  * GetTransactionsInCommit -- Get the XIDs of transactions that are committing
  *
  * Constructs an array of XIDs of transactions that are currently in commit
@@ -968,6 +1558,41 @@ BackendPidGetProc(int pid)
 }

 /*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Returns NULL if not found.  Note that it is up to the caller to be
+ * sure that the question remains meaningful for long enough for the
+ * answer to be used ...
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+    PGPROC       *result = NULL;
+    ProcArrayStruct *arrayP = procArray;
+    int            index;
+
+    if (xid == InvalidTransactionId)    /* never match invalid xid */
+        return 0;
+
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+    for (index = 0; index < arrayP->numProcs; index++)
+    {
+        PGPROC       *proc = arrayP->procs[index];
+
+        if (proc->xid == xid)
+        {
+            result = proc;
+            break;
+        }
+    }
+
+    LWLockRelease(ProcArrayLock);
+
+    return result;
+}
+
+/*
  * BackendXidGetPid -- get a backend's pid given its XID
  *
  * Returns 0 if not found or it's a prepared transaction.  Note that
@@ -1024,13 +1649,14 @@ IsBackendPid(int pid)
  * The array is palloc'd and is terminated with an invalid VXID.
  *
  * If limitXmin is not InvalidTransactionId, we skip any backends
- * with xmin >= limitXmin.    If allDbs is false, we skip backends attached
+ * with xmin >= limitXmin.    If dbOid is valid we skip backends attached
  * to other databases.  If excludeVacuum isn't zero, we skip processes for
  * which (excludeVacuum & vacuumFlags) is not zero.  Also, our own process
  * is always skipped.
+ *
  */
 VirtualTransactionId *
-GetCurrentVirtualXIDs(TransactionId limitXmin, bool allDbs, int excludeVacuum)
+GetCurrentVirtualXIDs(TransactionId limitXmin, Oid dbOid, int excludeVacuum)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
@@ -1047,13 +1673,13 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool allDbs, int excludeVacuum)
     {
         volatile PGPROC *proc = arrayP->procs[index];

-        if (proc == MyProc)
+        if (proc == MyProc || proc->pid == 0)
             continue;

         if (excludeVacuum & proc->vacuumFlags)
             continue;

-        if (allDbs || proc->databaseId == MyDatabaseId)
+        if (!OidIsValid(dbOid) || proc->databaseId == dbOid)
         {
             /* Fetch xmin just once - might change on us? */
             TransactionId pxmin = proc->xmin;
@@ -1083,6 +1709,219 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool allDbs, int excludeVacuum)
     return vxids;
 }

+/*
+ * GetConflictingVirtualXIDs -- returns an array of currently active VXIDs.
+ *
+ * The array is palloc'd and is terminated with an invalid VXID.
+ *
+ * If limitXmin is not InvalidTransactionId, we skip any backends
+ * with xmin >= limitXmin.    If dbOid is valid we skip backends attached
+ * to other databases.  If roleId is valid we skip backends attached
+ * as other roles.
+ *
+ * Be careful to *not* pfree the result from this function. We reuse
+ * this array sufficiently often that we use malloc for the result.
+ * We only ever call
+ */
+VirtualTransactionId *
+GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid, Oid roleId)
+{
+    static VirtualTransactionId *vxids;
+    ProcArrayStruct *arrayP = procArray;
+    int            count = 0;
+    int            index;
+
+    /*
+     * If not first time through, get workspace to remember main XIDs in. We
+     * malloc it permanently to avoid repeated palloc/pfree overhead.
+     * Allow result space, remembering room for a terminator.
+     */
+    if (vxids == NULL)
+    {
+        vxids = (VirtualTransactionId *)
+            malloc(sizeof(VirtualTransactionId) * (arrayP->maxProcs + 1));
+        if (vxids == NULL)
+            ereport(ERROR,
+                    (errcode(ERRCODE_OUT_OF_MEMORY),
+                     errmsg("out of memory")));
+    }
+
+    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+    for (index = 0; index < arrayP->numProcs; index++)
+    {
+        volatile PGPROC *proc = arrayP->procs[index];
+
+        /* Exclude recovery procs and prepared transactions */
+        if (proc->pid == 0)
+            continue;
+
+        if ((!OidIsValid(dbOid) && !OidIsValid(roleId)) ||
+            (proc->databaseId == dbOid && !OidIsValid(roleId)) ||
+            (OidIsValid(dbOid) && proc->roleId == roleId))
+        {
+            /* Fetch xmin just once - can't change on us, but good coding */
+            TransactionId pxmin = proc->xmin;
+
+            /*
+             * If limitXmin is set we explicitly choose to ignore an invalid
+             * pxmin because this means that backend has no snapshot and
+             * cannot get another one while we hold exclusive lock.
+             */
+            if (!TransactionIdIsValid(limitXmin) ||
+                (TransactionIdPrecedes(pxmin, limitXmin) && TransactionIdIsValid(pxmin)))
+            {
+                VirtualTransactionId vxid;
+
+                GET_VXID_FROM_PGPROC(vxid, *proc);
+                if (VirtualTransactionIdIsValid(vxid))
+                    vxids[count++] = vxid;
+            }
+        }
+    }
+
+    LWLockRelease(ProcArrayLock);
+
+    /* add the terminator */
+    vxids[count].backendId = InvalidBackendId;
+    vxids[count].localTransactionId = InvalidLocalTransactionId;
+
+    return vxids;
+}
+
+void
+SetDeferredRecoveryConflicts(TransactionId latestRemovedXid, RelFileNode node,
+                             XLogRecPtr conflict_lsn)
+{
+    ProcArrayStruct *arrayP = procArray;
+    int            index;
+    Oid            dbOid = node.dbNode;
+
+    Assert(InRecovery);
+
+    if (!LatestRemovedXidAdvances(latestRemovedXid))
+        return;
+
+    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+    for (index = 0; index < arrayP->numProcs; index++)
+    {
+        volatile PGPROC *proc = arrayP->procs[index];
+
+        /* Exclude recovery procs and prepared transactions */
+        if (proc->pid == 0)
+            continue;
+
+        if (!OidIsValid(dbOid) || proc->databaseId == dbOid)
+        {
+            /* Fetch xmin just once - can't change on us, but good coding */
+            TransactionId pxmin = proc->xmin;
+
+            /*
+             * We ignore an invalid pxmin because this means that backend
+             * has no snapshot and cannot get another one while we hold
+             * exclusive lock on ProcArrayLock.
+             */
+            if (!TransactionIdIsValid(latestRemovedXid) ||
+                (TransactionIdPrecedes(pxmin, latestRemovedXid) &&
+                 TransactionIdIsValid(pxmin)))
+            {
+                /*
+                 * Fetch nConflicts just once and make sure we update
+                 * it last to ensure this needs no spinlocks
+                 */
+                int    nConflicts = proc->rconflicts.nConflicts;
+
+                if (nConflicts <= PGPROC_MAX_CACHED_CONFLICT_RELS)
+                {
+                    bool     found = false;
+                    int        i;
+
+                    if (nConflicts == 0)
+                    {
+                        /*
+                         * Record the first LSN we come across for *all* rels
+                         * this backend may access since we expect the cache
+                         * to overflow eventually and we'll end up needing to
+                         * use this LSN anyway.
+                         */
+                        proc->rconflicts.lsn = conflict_lsn;
+
+                        /*
+                         * Zero the cache before we start using it. We do this
+                         * here to avoid doing it during SnapshotResetXmin()
+                         * which would impact the non-recovery code path.
+                         */
+                        for (i = 0; i < PGPROC_MAX_CACHED_CONFLICT_RELS; i++)
+                            proc->rconflicts.rels[i] = 0;
+                    }
+                    else
+                    {
+                        /*
+                         * Search cache to see if it is already listed.
+                         */
+                        for (i = 0; i < nConflicts; i++)
+                        {
+                            if (proc->rconflicts.rels[i] == node.relNode)
+                                found = true;
+                        }
+                    }
+
+                    /* If we already have a conflict with this rel, continue */
+                    if (found)
+                        continue;
+
+                    /*
+                     * Add to cache, if there is still space
+                     */
+                    if (nConflicts < PGPROC_MAX_CACHED_CONFLICT_RELS)
+                        proc->rconflicts.rels[nConflicts] = node.relNode;
+
+                    /*
+                     * If nConflicts goes above PGPROC_MAX_CACHED_CONFLICT_RELS
+                     * that means we have overflowed and we won't store any
+                     * more rels.
+                     */
+                    proc->rconflicts.nConflicts = ++nConflicts;
+                }
+            }
+        }
+    }
+
+    LWLockRelease(ProcArrayLock);
+}
+
+PGPROC *
+VirtualTransactionIdGetProc(VirtualTransactionId vxid)
+{
+    ProcArrayStruct *arrayP = procArray;
+    PGPROC         *result = NULL;
+    int            index;
+
+    if (!VirtualTransactionIdIsValid(vxid))
+        return NULL;
+
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+    for (index = 0; index < arrayP->numProcs; index++)
+    {
+        VirtualTransactionId procvxid;
+        PGPROC       *proc = arrayP->procs[index];
+
+        GET_VXID_FROM_PGPROC(procvxid, *proc);
+
+        if (procvxid.backendId == vxid.backendId &&
+            procvxid.localTransactionId == vxid.localTransactionId)
+        {
+            result = proc;
+            break;
+        }
+    }
+
+    LWLockRelease(ProcArrayLock);
+
+    return result;
+}

 /*
  * CountActiveBackends --- count backends (other than myself) that are in
@@ -1111,7 +1950,7 @@ CountActiveBackends(void)
         if (proc == MyProc)
             continue;            /* do not count myself */
         if (proc->pid == 0)
-            continue;            /* do not count prepared xacts */
+            continue;            /* do not count prepared xacts or recovery procs */
         if (proc->xid == InvalidTransactionId)
             continue;            /* do not count if no XID assigned */
         if (proc->waitLock != NULL)
@@ -1139,7 +1978,7 @@ CountDBBackends(Oid databaseid)
         volatile PGPROC *proc = arrayP->procs[index];

         if (proc->pid == 0)
-            continue;            /* do not count prepared xacts */
+            continue;            /* do not count prepared xacts or recovery procs */
         if (proc->databaseId == databaseid)
             count++;
     }
@@ -1166,7 +2005,7 @@ CountUserBackends(Oid roleid)
         volatile PGPROC *proc = arrayP->procs[index];

         if (proc->pid == 0)
-            continue;            /* do not count prepared xacts */
+            continue;            /* do not count prepared xacts or recovery procs */
         if (proc->roleId == roleid)
             count++;
     }
@@ -1207,6 +2046,9 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
     int            autovac_pids[MAXAUTOVACPIDS];
     int            tries;

+    /* Gives wrong answer in recovery, so make sure we don't use it */
+    Assert(!RecoveryInProgress());
+
     /* 50 tries with 100ms sleep between tries makes 5 sec total wait */
     for (tries = 0; tries < 50; tries++)
     {
@@ -1266,8 +2108,8 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)

 #define XidCacheRemove(i) \
     do { \
-        MyProc->subxids.xids[i] = MyProc->subxids.xids[MyProc->subxids.nxids - 1]; \
-        MyProc->subxids.nxids--; \
+        proc->subxids.xids[i] = proc->subxids.xids[proc->subxids.nxids - 1]; \
+        proc->subxids.nxids--; \
     } while (0)

 /*
@@ -1279,7 +2121,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
  * latestXid must be the latest XID among the group.
  */
 void
-XidCacheRemoveRunningXids(TransactionId xid,
+XidCacheRemoveRunningXids(PGPROC *proc, TransactionId xid,
                           int nxids, const TransactionId *xids,
                           TransactionId latestXid)
 {
@@ -1306,9 +2148,9 @@ XidCacheRemoveRunningXids(TransactionId xid,
     {
         TransactionId anxid = xids[i];

-        for (j = MyProc->subxids.nxids - 1; j >= 0; j--)
+        for (j = proc->subxids.nxids - 1; j >= 0; j--)
         {
-            if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
+            if (TransactionIdEquals(proc->subxids.xids[j], anxid))
             {
                 XidCacheRemove(j);
                 break;
@@ -1322,21 +2164,21 @@ XidCacheRemoveRunningXids(TransactionId xid,
          * error during AbortSubTransaction.  So instead of Assert, emit a
          * debug warning.
          */
-        if (j < 0 && !MyProc->subxids.overflowed)
-            elog(WARNING, "did not find subXID %u in MyProc", anxid);
+        if (j < 0 && !proc->subxids.overflowed)
+            elog(WARNING, "did not find subXID %u in proc", anxid);
     }

-    for (j = MyProc->subxids.nxids - 1; j >= 0; j--)
+    for (j = proc->subxids.nxids - 1; j >= 0; j--)
     {
-        if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
+        if (TransactionIdEquals(proc->subxids.xids[j], xid))
         {
             XidCacheRemove(j);
             break;
         }
     }
     /* Ordinarily we should have found it, unless the cache has overflowed */
-    if (j < 0 && !MyProc->subxids.overflowed)
-        elog(WARNING, "did not find subXID %u in MyProc", xid);
+    if (j < 0 && !proc->subxids.overflowed)
+        elog(WARNING, "did not find subXID %u in proc", xid);

     /* Also advance global latestCompletedXid while holding the lock */
     if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
@@ -1367,3 +2209,225 @@ DisplayXidCache(void)
 }

 #endif   /* XIDCACHE_DEBUG */
+
+/* ----------------------------------------------
+ *         UnobservedTransactions sub-module
+ * ----------------------------------------------
+ *
+ * All functions must be called holding ProcArrayLock.
+ */
+void
+UnobservedTransactionsAddRange(TransactionId firstXid, TransactionId lastXid)
+{
+    TransactionId xid;
+
+    xid = firstXid;
+    while(TransactionIdPrecedes(xid, lastXid))
+    {
+        UnobservedTransactionsAddXids(firstXid, 0, NULL);
+        TransactionIdAdvance(xid);
+    }
+}
+
+/*
+ * Add unobserved xids to end of UnobservedXids array
+ */
+void
+UnobservedTransactionsAddXids(TransactionId xid, int nsubxids,
+                              TransactionId *subxid)
+{
+    int   index = procArray->numUnobservedXids;
+    TransactionId *UnobservedXids;
+
+    UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+    /*
+     * UnobservedXids is maintained as a ascending list of xids, with no gaps.
+     * Incoming xids are always higher than previous entries, so we just add
+     * them directly to the end of the array.
+     */
+    for(;;)
+    {
+        if (TransactionIdIsValid(xid))
+        {
+            /*
+             * check to see if we have space to store more UnobservedXids
+             */
+            if (index >= procArray->maxUnobservedXids)
+            {
+                UnobservedTransactionsDisplay(WARNING);
+                elog(FATAL, "no more room in UnobservedXids array");
+            }
+
+            /*
+             * append xid to UnobservedXids
+             */
+#ifdef USE_ASSERT_CHECKING
+            if (TransactionIdIsValid(UnobservedXids[index]))
+            {
+                UnobservedTransactionsDisplay(LOG);
+                elog(FATAL, "unobservedxids leak: adding xid %u onto existing entry %d",
+                     xid, UnobservedXids[index]);
+            }
+
+            if ((index > 0 && TransactionIdPrecedes(xid, UnobservedXids[index - 1])))
+            {
+                UnobservedTransactionsDisplay(LOG);
+                elog(FATAL, "UnobservedXids leak: adding xid %u out of order at index %d",
+                     xid, index);
+            }
+#endif
+
+            elog(trace_recovery(DEBUG4), "adding unobservedxid %u (numxids %d min %u max %u)",
+                 xid, procArray->numUnobservedXids,
+                 UnobservedXids[0],
+                 UnobservedXids[procArray->numUnobservedXids]);
+
+            UnobservedXids[index++] = xid;
+        }
+
+        if (nsubxids <= 0)
+            break;
+        xid = *(subxid++);
+        nsubxids--;
+    }
+
+    procArray->numUnobservedXids = index;
+}
+
+static void UnobservedTransactionsRemoveXid(TransactionId xid,
+                                            bool missing_is_error);
+
+void
+UnobservedTransactionsRemoveXids(TransactionId xid, int nsubxids,
+                                 TransactionId *subxids, bool missing_is_error)
+{
+    int i;
+    if (TransactionIdIsValid(xid))
+        UnobservedTransactionsRemoveXid(xid, missing_is_error);
+    for (i = 0; i < nsubxids; i++)
+        UnobservedTransactionsRemoveXid(subxids[i], missing_is_error);
+}
+
+/*
+ * Remove one unobserved xid from anywhere on UnobservedXids array.
+ * If xid has already been pruned away, no need to report as missing.
+ */
+static void
+UnobservedTransactionsRemoveXid(TransactionId xid, bool missing_is_error)
+{
+    int             index;
+    bool            found = false;
+    TransactionId    *UnobservedXids;
+
+    UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+    /*
+     * If we haven't initialised array yet, or if we've already cleared it
+     * ignore this and get on with it. If it's missing after this it is an
+     * ERROR if removal is requested and the value isn't present.
+     */
+    if (procArray->numUnobservedXids == 0 ||
+        (procArray->numUnobservedXids > 0 &&
+        TransactionIdPrecedes(xid, UnobservedXids[0])))
+        return;
+
+    elog(trace_recovery(DEBUG4), "remove unobservedxid %u (numxids %d min %u max %u)",
+                                        xid, procArray->numUnobservedXids,
+                                        UnobservedXids[0],
+                                        UnobservedXids[procArray->numUnobservedXids]);
+
+    /*
+     * Locate our xid, and if found shunt others sideways to close the gap.
+     */
+    for (index = 0; index < procArray->numUnobservedXids; index++)
+    {
+        if (!found)
+        {
+            if (UnobservedXids[index] == xid)
+                found = true;
+        }
+        else
+        {
+            Assert(index > 0);
+            UnobservedXids[index - 1] = UnobservedXids[index];
+        }
+    }
+
+    if (found)
+    {
+        procArray->numUnobservedXids--;
+        UnobservedXids[procArray->numUnobservedXids] = InvalidTransactionId;
+    }
+
+    elog(trace_recovery(DEBUG4), "finished removing unobservedxid %u (numxids %d min %u max %u)",
+                                        xid, procArray->numUnobservedXids,
+                                        UnobservedXids[0],
+                                        UnobservedXids[procArray->numUnobservedXids]);
+
+
+    if (!found && missing_is_error)
+    {
+        UnobservedTransactionsDisplay(LOG);
+        elog(ERROR, "could not remove unobserved xid = %d", xid);
+    }
+}
+
+/*
+ * Clear the whole array.
+ */
+void
+UnobservedTransactionsClearXids(void)
+{
+    int             index;
+    TransactionId    *UnobservedXids;
+
+    elog(trace_recovery(DEBUG4), "clear UnobservedXids");
+    UnobservedTransactionsDisplay(DEBUG4);
+
+    UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+    /*
+     * UnobservedTransactionsAddXids() asserts that array will be empty
+     * when we add new values. so it must be zeroes here each time.
+     * That needs to be fast and accurate, this can be slowish.
+     */
+    for (index = 0; index < procArray->numUnobservedXids; index++)
+    {
+        UnobservedXids[index] = 0;
+    }
+
+    procArray->numUnobservedXids = 0;
+
+    /* Note that this is cleared too */
+    procArray->lastOverflowedUnobservedXid = InvalidTransactionId;
+}
+
+void
+AdvanceLastOverflowedUnobservedXid(TransactionId xid)
+{
+    if (TransactionIdPrecedes(procArray->lastOverflowedUnobservedXid, xid))
+        procArray->lastOverflowedUnobservedXid = xid;
+}
+
+void
+UnobservedTransactionsDisplay(int trace_level)
+{
+    int                index;
+    TransactionId    *UnobservedXids;
+    StringInfoData buf;
+
+    initStringInfo(&buf);
+
+    UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]);
+
+    for (index = 0; index < procArray->numUnobservedXids; index++)
+    {
+        if (TransactionIdIsValid(UnobservedXids[index]))
+            appendStringInfo(&buf, "%u ", UnobservedXids[index]);
+    }
+
+    elog(trace_level, "%d unobserved xids %s", procArray->numUnobservedXids, buf.data);
+
+    pfree(buf.data);
+}
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index cb4e0a9..8e0a60f 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -142,6 +142,7 @@ typedef struct ProcState
     int            nextMsgNum;        /* next message number to read */
     bool        resetState;        /* backend needs to reset its state */
     bool        signaled;        /* backend has been sent catchup signal */
+    bool        sendOnly;        /* backend only sends, never receives */

     /*
      * Next LocalTransactionId to use for each idle backend slot.  We keep
@@ -248,7 +249,7 @@ CreateSharedInvalidationState(void)
  *        Initialize a new backend to operate on the sinval buffer
  */
 void
-SharedInvalBackendInit(void)
+SharedInvalBackendInit(bool sendOnly)
 {
     int            index;
     ProcState  *stateP = NULL;
@@ -307,6 +308,7 @@ SharedInvalBackendInit(void)
     stateP->nextMsgNum = segP->maxMsgNum;
     stateP->resetState = false;
     stateP->signaled = false;
+    stateP->sendOnly = sendOnly;

     LWLockRelease(SInvalWriteLock);

@@ -578,7 +580,9 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
     /*
      * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify
      * the furthest-back backend that needs signaling (if any), and reset
-     * any backends that are too far back.
+     * any backends that are too far back. Note that because we ignore
+     * sendOnly backends here it is possible for them to keep sending
+     * messages without a problem even when they are the only active backend.
      */
     min = segP->maxMsgNum;
     minsig = min - SIG_THRESHOLD;
@@ -590,7 +594,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
         int        n = stateP->nextMsgNum;

         /* Ignore if inactive or already in reset state */
-        if (stateP->procPid == 0 || stateP->resetState)
+        if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly)
             continue;

         /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 7c8b1f5..0154900 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -35,9 +35,11 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/xact.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/sinval.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner.h"
@@ -490,6 +492,15 @@ LockAcquire(const LOCKTAG *locktag,
     if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
         elog(ERROR, "unrecognized lock mode: %d", lockmode);

+    if (RecoveryInProgress() &&
+        locktag->locktag_type == LOCKTAG_OBJECT &&
+        lockmode > AccessShareLock)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
+                                    lockMethodTable->lockModeNames[lockmode]),
+                 errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+
 #ifdef LOCK_DEBUG
     if (LOCK_DEBUG_ENABLED(locktag))
         elog(LOG, "LockAcquire: lock [%u,%u] %s",
@@ -817,6 +828,54 @@ LockAcquire(const LOCKTAG *locktag,

     LWLockRelease(partitionLock);

+    /*
+     * We made it all the way here. We've got the lock and we've got
+     * it for the first time in this transaction. So now it's time
+     * to send a WAL message so that standby servers can see this event,
+     * if its an AccessExclusiveLock on a relation.
+     */
+    if (!RecoveryInProgress() && lockmode >= AccessExclusiveLock &&
+        locktag->locktag_type == LOCKTAG_RELATION)
+    {
+        XLogRecData        rdata;
+        xl_rel_lock        xlrec;
+        TransactionId    xid;
+
+        /*
+         * First thing we do is ensure that a TransactionId has been
+         * assigned to this transaction. We don't actually need the xid
+         * but if we don't do this then RecordTransactionCommit() and
+         * RecordTransactionAbort() will optimise away the transaction
+         * completion record which recovery relies upon to release locks.
+         * It's a hack, but for a corner case not worth adding code for
+         * into the main commit path.
+         */
+        xid = GetTopTransactionId();
+        Assert(TransactionIdIsValid(xid));
+
+        Assert(OidIsValid(locktag->locktag_field2));
+
+        START_CRIT_SECTION();
+
+        /*
+         * Decode the locktag back to the original values, to avoid
+         * sending lots of empty bytes with every message.  See
+         * lock.h to check how a locktag is defined  for LOCKTAG_RELATION
+         */
+        xlrec.xid = xid;
+        xlrec.dbOid = locktag->locktag_field1;
+        xlrec.relOid = locktag->locktag_field2;
+
+        rdata.data = (char *) (&xlrec);
+        rdata.len = sizeof(xl_rel_lock);
+        rdata.buffer = InvalidBuffer;
+        rdata.next = NULL;
+
+        (void) XLogInsert(RM_RELATION_ID, XLOG_RELATION_LOCK, &rdata);
+
+        END_CRIT_SECTION();
+    }
+
     return LOCKACQUIRE_OK;
 }

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9e871ef..905064c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -28,6 +28,10 @@
  *
  * ProcKill -- destroys the shared memory state (and locks)
  * associated with the process.
+ *
+ * In 8.4 we introduce the idea of recovery procs which hold state
+ * information for transactions currently being replayed. Many of the
+ * functions here apply only real procs representing connected users.
  */
 #include "postgres.h"

@@ -103,6 +107,8 @@ ProcGlobalShmemSize(void)
     size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
     /* MyProcs, including autovacuum */
     size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
+    /* RecoveryProcs, including recovery actions by autovacuum */
+    size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
     /* ProcStructLock */
     size = add_size(size, sizeof(slock_t));

@@ -172,6 +178,7 @@ InitProcGlobal(void)
      */
     ProcGlobal->freeProcs = NULL;
     ProcGlobal->autovacFreeProcs = NULL;
+    ProcGlobal->freeRecoveryProcs = NULL;

     ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY;

@@ -204,6 +211,35 @@ InitProcGlobal(void)
         ProcGlobal->autovacFreeProcs = &procs[i];
     }

+    /*
+     * Create enough recovery procs so there is a shadow proc for every
+     * proc on the master, including both normal procs, autovac procs
+     * and anything else that might run transactions and write WAL.
+     * Bgwriter writes WAL but does not have a TransactionId, so ignore.
+     * We use the same procs for prepared transactions whether we are
+     * in recovery or not, so no space required for them either.
+     *
+     * Recovery procs are just ghosts which store just enough information
+     * to make them look real to anyone requesting a snapshot from the
+     * procarray. So recovery procs don't need semaphores because they
+     * aren't actually performing any work.
+     *
+     * Although the recovery procs tie up some shared memory they will
+     * not be part of the ProcArray once the database has fully started
+     * up, so there is little performance effect during normal running.
+     */
+    procs = (PGPROC *) ShmemAlloc((MaxBackends) * sizeof(PGPROC));
+    if (!procs)
+        ereport(FATAL,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("out of shared memory")));
+    MemSet(procs, 0, MaxBackends * sizeof(PGPROC));
+    for (i = 0; i < MaxBackends; i++)
+    {
+        procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeRecoveryProcs;
+        ProcGlobal->freeRecoveryProcs = &procs[i];
+    }
+
     MemSet(AuxiliaryProcs, 0, NUM_AUXILIARY_PROCS * sizeof(PGPROC));
     for (i = 0; i < NUM_AUXILIARY_PROCS; i++)
     {
@@ -342,7 +378,7 @@ InitProcessPhase2(void)
     /*
      * Add our PGPROC to the PGPROC array in shared memory.
      */
-    ProcArrayAdd(MyProc);
+    ProcArrayAdd(MyProc, true);

     /*
      * Arrange to clean that up at backend exit.
@@ -363,6 +399,11 @@ InitProcessPhase2(void)
  * to the ProcArray or the sinval messaging mechanism, either.    They also
  * don't get a VXID assigned, since this is only useful when we actually
  * hold lockmgr locks.
+ *
+ * Startup process however uses locks but never waits for them in the
+ * normal backend sense. Startup process also takes part in sinval messaging
+ * as a sendOnly process, so never reads messages from sinval queue. So
+ * Startup process does have a VXID and does show up in pg_locks.
  */
 void
 InitAuxiliaryProcess(void)
@@ -452,6 +493,27 @@ InitAuxiliaryProcess(void)
 }

 /*
+ * Additional initialisation for Startup process
+ */
+void
+PublishStartupProcessInformation(void)
+{
+    /* use volatile pointer to prevent code rearrangement */
+    volatile PROC_HDR *procglobal = ProcGlobal;
+
+    SpinLockAcquire(ProcStructLock);
+
+    /*
+     * Record Startup process information, for use in ProcSendSignal().
+     * See comments there for further explanation.
+     */
+    procglobal->startupProc = MyProc;
+    procglobal->startupProcPid = MyProcPid;
+
+    SpinLockRelease(ProcStructLock);
+}
+
+/*
  * Check whether there are at least N free PGPROC objects.
  *
  * Note: this is designed on the assumption that N will generally be small.
@@ -565,17 +627,21 @@ ProcReleaseLocks(bool isCommit)

 /*
  * RemoveProcFromArray() -- Remove this process from the shared ProcArray.
+ *
+ * Only intended for use with real procs, not recovery procs.
  */
 static void
 RemoveProcFromArray(int code, Datum arg)
 {
     Assert(MyProc != NULL);
-    ProcArrayRemove(MyProc, InvalidTransactionId);
+    ProcArrayRemove(MyProc, InvalidTransactionId, 0, NULL);
 }

 /*
  * ProcKill() -- Destroy the per-proc data structure for
  *        this process. Release any of its held LW locks.
+ *
+ * Only intended for use with real procs, not recovery procs.
  */
 static void
 ProcKill(int code, Datum arg)
@@ -1271,7 +1337,31 @@ ProcWaitForSignal(void)
 void
 ProcSendSignal(int pid)
 {
-    PGPROC       *proc = BackendPidGetProc(pid);
+    PGPROC       *proc = NULL;
+
+    if (RecoveryInProgress())
+    {
+        /* use volatile pointer to prevent code rearrangement */
+        volatile PROC_HDR *procglobal = ProcGlobal;
+
+        SpinLockAcquire(ProcStructLock);
+
+        /*
+         * Check to see whether it is the Startup process we wish to signal.
+         * This call is made by the buffer manager when it wishes to wake
+         * up a process that has been waiting for a pin in so it can obtain a
+         * cleanup lock using LockBufferForCleanup(). Startup is not a normal
+         * backend, so BackendPidGetProc() will not return any pid at all.
+         * So we remember the information for this special case.
+         */
+        if (pid == procglobal->startupProcPid)
+            proc = procglobal->startupProc;
+
+        SpinLockRelease(ProcStructLock);
+    }
+
+    if (proc == NULL)
+        proc = BackendPidGetProc(pid);

     if (proc != NULL)
         PGSemaphoreUnlock(&proc->sem);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3781b55..e0d0412 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2579,8 +2579,8 @@ StatementCancelHandler(SIGNAL_ARGS)
          * the interrupt immediately.  No point in interrupting if we're
          * waiting for input, however.
          */
-        if (ImmediateInterruptOK && InterruptHoldoffCount == 0 &&
-            CritSectionCount == 0 && !DoingCommandRead)
+        if (InterruptHoldoffCount == 0 && CritSectionCount == 0 &&
+            (DoingCommandRead || ImmediateInterruptOK))
         {
             /* bump holdoff count to make ProcessInterrupts() a no-op */
             /* until we are done getting ready for it */
@@ -2660,10 +2660,50 @@ ProcessInterrupts(void)
             ereport(ERROR,
                     (errcode(ERRCODE_QUERY_CANCELED),
                      errmsg("canceling autovacuum task")));
-        else
+        else
+        {
+            int cancelMode = MyProc->rconflicts.cancelMode;
+
+            /*
+             * XXXHS: We don't yet have a clean way to cancel an
+             * idle-in-transaction session, so make it FATAL instead.
+             */
+            if (DoingCommandRead && IsTransactionBlock() && cancelMode == ERROR)
+                cancelMode = FATAL;
+
+            switch (cancelMode)
+            {
+                case FATAL:
+                        Assert(RecoveryInProgress());
+                        ereport(FATAL,
+                            (errcode(ERRCODE_QUERY_CANCELED),
+                             errmsg("canceling session due to conflict with recovery")));
+                case ERROR:
+                        /*
+                         * We are aborting because we need to release
+                         * locks. So we need to abort out of all
+                         * subtransactions to make sure we release
+                         * all locks at whatever their level.
+                         *
+                         * XXXHS: Should we try to examine the
+                         * transaction tree and remove just enough
+                         * subxacts to remove locks? Doubt it.
+                         */
+                        Assert(RecoveryInProgress());
+                        AbortOutOfAnyTransaction();
+                        ereport(ERROR,
+                            (errcode(ERRCODE_QUERY_CANCELED),
+                             errmsg("canceling statement due to conflict with recovery")));
+                        return;
+                default:
+                        /* No conflict pending, so fall through */
+                        break;
+            }
+
             ereport(ERROR,
                     (errcode(ERRCODE_QUERY_CANCELED),
                      errmsg("canceling statement due to user request")));
+        }
     }
     /* If we get here, do nothing (probably, QueryCancelPending was reset) */
 }
@@ -3293,12 +3333,6 @@ PostgresMain(int argc, char *argv[], const char *username)
          */
         StartupXLOG();
         on_shmem_exit(ShutdownXLOG, 0);
-
-        /*
-         * We have to build the flat file for pg_database, but not for the
-         * user and group tables, since we won't try to do authentication.
-         */
-        BuildFlatFiles(true);
     }

     /*
@@ -3315,6 +3349,15 @@ PostgresMain(int argc, char *argv[], const char *username)
 #endif

     /*
+     * We have to build the flat file for pg_database, but not for the
+     * user and group tables, since we won't try to do authentication.
+     * We do this after PGPROCs have been initialised, since we read
+     * database buffers to do this.
+     */
+    if (!IsUnderPostmaster)
+        BuildFlatFiles(true);
+
+    /*
      * General initialization.
      *
      * NOTE: if you are tempted to add code in this vicinity, consider putting
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 80070e3..65a9f2f 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -288,10 +288,22 @@ ProcessUtility(Node *parsetree,
                                     SetPGVariable("transaction_isolation",
                                                   list_make1(item->arg),
                                                   true);
+
                                 else if (strcmp(item->defname, "transaction_read_only") == 0)
+                                {
+                                    A_Const       *con;
+
+                                    Assert(IsA(item->arg, A_Const));
+                                    con = (A_Const *) item->arg;
+                                    Assert(nodeTag(&con->val) == T_Integer);
+
+                                    if (!intVal(&con->val))
+                                        PreventCommandDuringRecovery();
+
                                     SetPGVariable("transaction_read_only",
                                                   list_make1(item->arg),
                                                   true);
+                                }
                             }
                         }
                         break;
@@ -306,6 +318,7 @@ ProcessUtility(Node *parsetree,
                         break;

                     case TRANS_STMT_PREPARE:
+                        PreventCommandDuringRecovery();
                         if (!PrepareTransactionBlock(stmt->gid))
                         {
                             /* report unsuccessful commit in completionTag */
@@ -315,11 +328,13 @@ ProcessUtility(Node *parsetree,
                         break;

                     case TRANS_STMT_COMMIT_PREPARED:
+                        PreventCommandDuringRecovery();
                         PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
                         FinishPreparedTransaction(stmt->gid, true);
                         break;

                     case TRANS_STMT_ROLLBACK_PREPARED:
+                        PreventCommandDuringRecovery();
                         PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
                         FinishPreparedTransaction(stmt->gid, false);
                         break;
@@ -690,6 +705,7 @@ ProcessUtility(Node *parsetree,
             break;

         case T_GrantStmt:
+            PreventCommandDuringRecovery();
             ExecuteGrantStmt((GrantStmt *) parsetree);
             break;

@@ -860,6 +876,7 @@ ProcessUtility(Node *parsetree,
         case T_NotifyStmt:
             {
                 NotifyStmt *stmt = (NotifyStmt *) parsetree;
+                PreventCommandDuringRecovery();

                 Async_Notify(stmt->conditionname);
             }
@@ -868,6 +885,7 @@ ProcessUtility(Node *parsetree,
         case T_ListenStmt:
             {
                 ListenStmt *stmt = (ListenStmt *) parsetree;
+                PreventCommandDuringRecovery();

                 Async_Listen(stmt->conditionname);
             }
@@ -876,6 +894,7 @@ ProcessUtility(Node *parsetree,
         case T_UnlistenStmt:
             {
                 UnlistenStmt *stmt = (UnlistenStmt *) parsetree;
+                PreventCommandDuringRecovery();

                 if (stmt->conditionname)
                     Async_Unlisten(stmt->conditionname);
@@ -895,10 +914,12 @@ ProcessUtility(Node *parsetree,
             break;

         case T_ClusterStmt:
+            PreventCommandDuringRecovery();
             cluster((ClusterStmt *) parsetree, isTopLevel);
             break;

         case T_VacuumStmt:
+            PreventCommandDuringRecovery();
             vacuum((VacuumStmt *) parsetree, InvalidOid, true, NULL, false,
                    isTopLevel);
             break;
@@ -1014,12 +1035,14 @@ ProcessUtility(Node *parsetree,
                 ereport(ERROR,
                         (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                          errmsg("must be superuser to do CHECKPOINT")));
+            PreventCommandDuringRecovery();
             RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
             break;

         case T_ReindexStmt:
             {
                 ReindexStmt *stmt = (ReindexStmt *) parsetree;
+                PreventCommandDuringRecovery();

                 switch (stmt->kind)
                 {
@@ -2504,3 +2527,12 @@ GetCommandLogLevel(Node *parsetree)

     return lev;
 }
+
+void
+PreventCommandDuringRecovery(void)
+{
+    if (RecoveryInProgress())
+        ereport(ERROR,
+            (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+             errmsg("cannot be run until recovery completes")));
+}
diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c
index 7e51f9e..0a4ee4b 100644
--- a/src/backend/utils/adt/txid.c
+++ b/src/backend/utils/adt/txid.c
@@ -338,6 +338,12 @@ txid_current(PG_FUNCTION_ARGS)
     txid        val;
     TxidEpoch    state;

+    if (RecoveryInProgress())
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("cannot assign txid while recovery is in progress"),
+                 errhint("only read only queries can execute during recovery")));
+
     load_xid_epoch(&state);

     val = convert_xid(GetTopTransactionId(), &state);
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 9738fa1..541e0d9 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -86,10 +86,16 @@
  */
 #include "postgres.h"

+#include <signal.h>
+
+#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "storage/procarray.h"
+#include "storage/proc.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
@@ -155,6 +161,14 @@ typedef struct TransInvalidationInfo

 static TransInvalidationInfo *transInvalInfo = NULL;

+static SharedInvalidationMessage *SharedInvalidMessagesArray;
+static int                     numSharedInvalidMessagesArray;
+static int                     maxSharedInvalidMessagesArray;
+
+static List *RecoveryLockList;
+static MemoryContext    RelationLockContext;
+
+
 /*
  * Dynamically-registered callback functions.  Current implementation
  * assumes there won't be very many of these at once; could improve if needed.
@@ -741,6 +755,8 @@ AtStart_Inval(void)
         MemoryContextAllocZero(TopTransactionContext,
                                sizeof(TransInvalidationInfo));
     transInvalInfo->my_level = GetCurrentTransactionNestLevel();
+    SharedInvalidMessagesArray = NULL;
+    numSharedInvalidMessagesArray = 0;
 }

 /*
@@ -851,6 +867,126 @@ inval_twophase_postcommit(TransactionId xid, uint16 info,
     }
 }

+static void
+MakeSharedInvalidMessagesArray(const SharedInvalidationMessage *msgs, int n)
+{
+    /*
+     * Initialise array first time through in each commit
+     */
+    if (SharedInvalidMessagesArray == NULL)
+    {
+        maxSharedInvalidMessagesArray = FIRSTCHUNKSIZE;
+        numSharedInvalidMessagesArray = 0;
+
+        /*
+         * Although this is being palloc'd we don't actually free it directly.
+         * We're so close to EOXact that we now we're going to lose it anyhow.
+         */
+        SharedInvalidMessagesArray = palloc(maxSharedInvalidMessagesArray
+                                            * sizeof(SharedInvalidationMessage));
+    }
+
+    if ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray)
+    {
+        while ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray)
+            maxSharedInvalidMessagesArray *= 2;
+
+        SharedInvalidMessagesArray = repalloc(SharedInvalidMessagesArray,
+                                            maxSharedInvalidMessagesArray
+                                            * sizeof(SharedInvalidationMessage));
+    }
+
+    /*
+     * Append the next chunk onto the array
+     */
+    memcpy(SharedInvalidMessagesArray + numSharedInvalidMessagesArray,
+            msgs, n * sizeof(SharedInvalidationMessage));
+    numSharedInvalidMessagesArray += n;
+}
+
+/*
+ * xactGetCommittedInvalidationMessages() is executed by
+ * RecordTransactionCommit() to add invalidation messages onto the
+ * commit record. This applies only to commit message types, never to
+ * abort records. Must always run before AtEOXact_Inval(), since that
+ * removes the data we need to see.
+ *
+ * Remember that this runs before we have officially committed, so we
+ * must not do anything here to change what might occur *if* we should
+ * fail between here and the actual commit.
+ *
+ * Note that transactional validation does *not* write a invalidation
+ * WAL message using XLOG_RELATION_INVAL messages. Those are only used
+ * by non-transactional invalidation. see comments in
+ * EndNonTransactionalInvalidation().
+ *
+ * see also xact_redo_commit() and xact_desc_commit()
+ */
+int
+xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
+                                        bool *RelcacheInitFileInval)
+{
+    MemoryContext oldcontext;
+
+    /* Must be at top of stack */
+    Assert(transInvalInfo != NULL && transInvalInfo->parent == NULL);
+
+    /*
+     * Relcache init file invalidation requires processing both before and
+     * after we send the SI messages.  However, we need not do anything
+     * unless we committed.
+     */
+    if (transInvalInfo->RelcacheInitFileInval)
+        *RelcacheInitFileInval = true;
+    else
+        *RelcacheInitFileInval = false;
+
+    /*
+     * Walk through TransInvalidationInfo to collect all the messages
+     * into a single contiguous array of invalidation messages. It must
+     * be contiguous so we can copy directly into WAL message. Maintain the
+     * order that they would be processed in by AtEOXact_Inval(), to ensure
+     * emulated behaviour in redo is as similar as possible to original.
+     * We want the same bugs, if any, not new ones.
+     */
+    oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+
+    ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+                                     MakeSharedInvalidMessagesArray);
+    ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
+                                     MakeSharedInvalidMessagesArray);
+    MemoryContextSwitchTo(oldcontext);
+
+#ifdef STANDBY_INVAL_DEBUG
+    if (numSharedInvalidMessagesArray > 0)
+    {
+        int i;
+
+        elog(LOG, "numSharedInvalidMessagesArray = %d", numSharedInvalidMessagesArray);
+
+        Assert(SharedInvalidMessagesArray != NULL);
+
+        for (i = 0; i < numSharedInvalidMessagesArray; i++)
+        {
+            SharedInvalidationMessage *msg = SharedInvalidMessagesArray + i;
+
+            if (msg->id >= 0)
+                elog(LOG, "catcache id %d", msg->id);
+            else if (msg->id == SHAREDINVALRELCACHE_ID)
+                elog(LOG, "relcache id %d", msg->id);
+            else if (msg->id == SHAREDINVALSMGR_ID)
+                elog(LOG, "smgr cache id %d", msg->id);
+        }
+    }
+#endif
+
+    if (numSharedInvalidMessagesArray > 0)
+        Assert(SharedInvalidMessagesArray != NULL);
+
+    *msgs = SharedInvalidMessagesArray;
+
+    return numSharedInvalidMessagesArray;
+}

 /*
  * AtEOXact_Inval
@@ -1041,6 +1177,42 @@ BeginNonTransactionalInvalidation(void)
     Assert(transInvalInfo->CurrentCmdInvalidMsgs.cclist == NULL);
     Assert(transInvalInfo->CurrentCmdInvalidMsgs.rclist == NULL);
     Assert(transInvalInfo->RelcacheInitFileInval == false);
+
+    SharedInvalidMessagesArray = NULL;
+    numSharedInvalidMessagesArray = 0;
+}
+
+/*
+ * General function to log the SharedInvalidMessagesArray. Only current
+ * caller is EndNonTransactionalInvalidation(), but that may change.
+ */
+static void
+LogSharedInvalidMessagesArray(void)
+{
+    XLogRecData        rdata[2];
+    xl_rel_inval    xlrec;
+
+    if (numSharedInvalidMessagesArray == 0)
+        return;
+
+    START_CRIT_SECTION();
+
+    xlrec.nmsgs = numSharedInvalidMessagesArray;
+
+    rdata[0].data = (char *) (&xlrec);
+    rdata[0].len = MinSizeOfRelationInval;
+    rdata[0].buffer = InvalidBuffer;
+
+    rdata[0].next = &(rdata[1]);
+    rdata[1].data = (char *) SharedInvalidMessagesArray;
+    rdata[1].len = numSharedInvalidMessagesArray *
+                                sizeof(SharedInvalidationMessage);
+    rdata[1].buffer = InvalidBuffer;
+    rdata[1].next = NULL;
+
+    (void) XLogInsert(RM_RELATION_ID, XLOG_RELATION_INVAL, rdata);
+
+    END_CRIT_SECTION();
 }

 /*
@@ -1081,7 +1253,25 @@ EndNonTransactionalInvalidation(void)
     ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
                                      SendSharedInvalidMessages);

+    /*
+     * Write invalidation messages to WAL. This is not required for
+     * recovery, it is only required for standby servers. It's fairly
+     * low overhead so don't worry. This allows us to trigger inval
+     * messages on the standby as soon as we see these records.
+     * see relation_redo_inval()
+     *
+     * Note that transactional validation uses an array attached to
+     * a WAL commit record, so these messages are rare.
+     */
+    ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+                                     MakeSharedInvalidMessagesArray);
+    LogSharedInvalidMessagesArray();
+
     /* Clean up and release memory */
+
+    /* XXXHS: Think some more on memory allocation and freeing.
+     */
+
     for (chunk = transInvalInfo->CurrentCmdInvalidMsgs.cclist;
          chunk != NULL;
          chunk = next)
@@ -1235,3 +1425,439 @@ CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,

     ++relcache_callback_count;
 }
+
+/*
+ * -----------------------------------------------------
+ *         Standby wait timers and backend cancel logic
+ * -----------------------------------------------------
+ */
+
+static void
+InitStandbyDelayTimers(int *currentDelay_ms, int *standbyWait_ms)
+{
+    *currentDelay_ms = GetLatestReplicationDelay();
+
+    /*
+     * If replication delay is enormously huge, just treat that as
+     * zero and work up from there. This prevents us from acting
+     * foolishly when replaying old log files.
+     */
+    if (*currentDelay_ms < 0)
+        *currentDelay_ms = 0;
+
+#define STANDBY_INITIAL_WAIT_MS  1
+    *standbyWait_ms = STANDBY_INITIAL_WAIT_MS;
+}
+
+/*
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * We wait here for a while then return. If we decide we can't wait any
+ * more then we return true, if we can wait some more return false.
+ */
+static bool
+WaitExceedsMaxStandbyDelay(int *currentDelay_ms, int *standbyWait_ms)
+{
+    int        maxStandbyDelay_ms = maxStandbyDelay * 1000;
+
+    /*
+     * If the server is already further behind than we would
+     * like then no need to wait or do more complex logic.
+     * max_standby_delay = -1 means wait for ever, if necessary
+     */
+    if (maxStandbyDelay >= 0 &&
+        *currentDelay_ms >= maxStandbyDelay_ms)
+        return true;
+
+    /*
+     * Sleep, then do bookkeeping.
+     */
+    pg_usleep(*standbyWait_ms * 1000L);
+    *currentDelay_ms += *standbyWait_ms;
+
+    /*
+     * Progressively increase the sleep times.
+     */
+    *standbyWait_ms *= 2;
+    if (*standbyWait_ms > 1000)
+        *standbyWait_ms = 1000;
+
+    /*
+     * Re-test our exit criteria
+     */
+    if (maxStandbyDelay >= 0 &&
+        *currentDelay_ms >= maxStandbyDelay_ms)
+        return true;
+
+    return false;
+}
+
+/*
+ * This is the main executioner for any query backend that conflicts with
+ * recovery processing. Judgement has already been passed on it within
+ * a specific rmgr. Here we just issue the orders to the procs. The procs
+ * then throw the required error as instructed.
+ *
+ * We may ask for a specific cancel_mode, typically ERROR or FATAL.
+ *
+ * If we want an ERROR, we may defer that until the buffer manager
+ * sees a recently changed block. If we want this we must specify a
+ * valid conflict_lsn.
+ */
+void
+ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
+                                        char *reason, int cancel_mode,
+                                        XLogRecPtr conflict_lsn)
+{
+    int                standbyWait_ms;
+    int             currentDelay_ms;
+    bool            logged;
+    int                wontDieWait = 1;
+
+    InitStandbyDelayTimers(¤tDelay_ms, &standbyWait_ms);
+    logged = false;
+
+    while (VirtualTransactionIdIsValid(*waitlist))
+    {
+        /*
+         * log that we have been waiting for a while now...
+         */
+        if (!logged && standbyWait_ms > 500)
+        {
+            elog(trace_recovery(DEBUG5),
+                    "virtual transaction %u/%u is blocking %s",
+                        waitlist->backendId,
+                        waitlist->localTransactionId,
+                        reason);
+            logged = true;
+        }
+
+        if (ConditionalVirtualXactLockTableWait(*waitlist))
+        {
+            waitlist++;
+            InitStandbyDelayTimers(¤tDelay_ms, &standbyWait_ms);
+            logged = false;
+        }
+        else if (WaitExceedsMaxStandbyDelay(¤tDelay_ms,
+                                             &standbyWait_ms))
+        {
+            /*
+             * Now find out who to throw out of the balloon.
+             */
+            PGPROC *proc;
+
+            Assert(VirtualTransactionIdIsValid(*waitlist));
+            proc = VirtualTransactionIdGetProc(*waitlist);
+
+            /*
+             * Kill the pid if it's still here. If not, that's what we wanted
+             * so ignore any errors.
+             */
+            if (proc)
+            {
+                /*
+                 * Startup process debug messages
+                 */
+                switch (cancel_mode)
+                {
+                    case FATAL:
+                        elog(trace_recovery(DEBUG1),
+                            "recovery disconnects session with pid %d "
+                            "because of conflict with %s (current delay %d secs)",
+                                proc->pid,
+                                reason,
+                                currentDelay_ms / 1000);
+                            break;
+                    case ERROR:
+                        elog(trace_recovery(DEBUG1),
+                            "recovery cancels virtual transaction %u/%u pid %d "
+                            "because of conflict with %s (current delay %d secs)",
+                                waitlist->backendId,
+                                waitlist->localTransactionId,
+                                proc->pid,
+                                reason,
+                                currentDelay_ms / 1000);
+                            break;
+                    default:
+                            /* No conflict pending, so fall through */
+                            break;
+                }
+
+                Assert(proc->pid != 0);
+
+                /*
+                 * Issue orders for the proc to read next time it receives SIGINT
+                 */
+                proc->rconflicts.cancelMode = cancel_mode;
+
+                /*
+                 * Do we expect it to talk? No, Mr. Bond, we expect it to die.
+                 */
+                kill(proc->pid, SIGINT);
+
+                /* wait awhile for it to die */
+                pg_usleep(wontDieWait * 5000L);
+                wontDieWait *= 2;
+            }
+        }
+    }
+}
+
+/*
+ * -----------------------------------------------------
+ * Locking in Recovery Mode
+ * -----------------------------------------------------
+ *
+ * All locks are held by the Startup process using a single virtual
+ * transaction. This implementation is both simpler and in some senses,
+ * more correct. The locks held mean "some original transaction held
+ * this lock, so query access is not allowed at this time". So the Startup
+ * process is the proxy by which the original locks are implemented.
+ *
+ * We only keep track of AccessExclusiveLocks, which are only ever held by
+ * one transaction on one relation, and don't worry about lock queuing.
+ *
+ * We keep a single dynamically expandible list of locks in local memory,
+ * RelationLockList, so we can keep track of the various entried made by
+ * the Startup process's virtual xid in the shared lock table.
+ *
+ * List elements use type xl_rel_lock, since the WAL record type exactly
+ * matches the information that we need to keep track of.
+ *
+ * We use session locks rather than normal locks so we don't need
+ * ResourceOwners.
+ */
+
+/* called by relation_redo_lock() */
+static void
+RelationAddRecoveryLock(xl_rel_lock *lockRequest)
+{
+    xl_rel_lock     *newlock;
+    LOCKTAG            locktag;
+    MemoryContext     old_context;
+
+    elog(trace_recovery(DEBUG4),
+            "adding recovery lock: db %d rel %d",
+                lockRequest->dbOid, lockRequest->relOid);
+
+    /*
+     * dbOid is InvalidOid when we are locking a shared relation.
+     */
+    Assert(OidIsValid(lockRequest->relOid));
+
+    if (RelationLockContext == NULL)
+        RelationLockContext = AllocSetContextCreate(TopMemoryContext,
+                                                        "RelationLocks",
+                                                        ALLOCSET_DEFAULT_MINSIZE,
+                                                        ALLOCSET_DEFAULT_INITSIZE,
+                                                        ALLOCSET_DEFAULT_MAXSIZE);
+
+    old_context = MemoryContextSwitchTo(RelationLockContext);
+    newlock = palloc(sizeof(xl_rel_lock));
+    MemoryContextSwitchTo(old_context);
+
+    newlock->xid = lockRequest->xid;
+    newlock->dbOid = lockRequest->dbOid;
+    newlock->relOid = lockRequest->relOid;
+    RecoveryLockList = lappend(RecoveryLockList, newlock);
+
+    /*
+     * Attempt to acquire the lock as requested.
+     */
+    SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
+
+    /*
+     * Waiting for lock to clear or kill anyone in our way. Not a
+     * completely foolproof way of getting the lock, but we cannot
+     * afford to sit and wait for the lock indefinitely. This is
+     * one reason to reduce strengths of various locks in 8.4.
+     */
+    while (LockAcquire(&locktag, AccessExclusiveLock, true, true)
+                                            == LOCKACQUIRE_NOT_AVAIL)
+    {
+        VirtualTransactionId *old_lockholders;
+
+        old_lockholders = GetLockConflicts(&locktag, AccessExclusiveLock);
+        ResolveRecoveryConflictWithVirtualXIDs(old_lockholders,
+                                                "exclusive lock",
+                                                ERROR,
+                                                InvalidXLogRecPtr);
+    }
+}
+
+static void
+RelationRemoveRecoveryLocks(TransactionId xid)
+{
+    ListCell   *l;
+    LOCKTAG        locktag;
+    List        *deletionList = NIL;
+
+    /*
+     * Release all matching locks and identify list elements to remove
+     */
+    foreach(l, RecoveryLockList)
+    {
+        xl_rel_lock *lock = (xl_rel_lock *) lfirst(l);
+
+        elog(trace_recovery(DEBUG4),
+                "releasing recovery lock: xid %u db %d rel %d",
+                        lock->xid, lock->dbOid, lock->relOid);
+
+        if (!TransactionIdIsValid(xid) || lock->xid == xid)
+        {
+            SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
+            if (!LockRelease(&locktag, AccessExclusiveLock, true))
+                elog(trace_recovery(LOG),
+                    "RecoveryLockList contains entry for lock "
+                    "no longer recorded by lock manager "
+                    "xid %u database %d relation %d",
+                        lock->xid, lock->dbOid, lock->relOid);
+            deletionList = lappend(deletionList, lock);
+        }
+    }
+
+    /*
+     * Now remove the elements from RecoveryLockList. We can't navigate
+     * the list at the same time as deleting multiple elements from it.
+     */
+    foreach(l, deletionList)
+    {
+        xl_rel_lock *lock = (xl_rel_lock *) lfirst(l);
+
+        RecoveryLockList = list_delete_ptr(RecoveryLockList, lock);
+        pfree(lock);
+    }
+}
+
+/*
+ * Called during xact_commit_redo() and xact_commit_abort when InArchiveRecovery
+ * to remove any AccessExclusiveLocks requested by a transaction.
+ *
+ * Remove the lock tree, starting at xid down, from the RecoveryLockList.
+ */
+void
+RelationReleaseRecoveryLockTree(TransactionId xid, int nsubxids, TransactionId *subxids)
+{
+    int i;
+
+    RelationRemoveRecoveryLocks(xid);
+
+    for (i = 0; i < nsubxids; i++)
+        RelationRemoveRecoveryLocks(subxids[i]);
+}
+
+/*
+ * Called at end of recovery and when we see a shutdown checkpoint.
+ */
+void
+RelationClearRecoveryLocks(void)
+{
+    elog(trace_recovery(DEBUG2), "clearing recovery locks");
+    RelationRemoveRecoveryLocks(InvalidTransactionId);
+}
+
+/*
+ * --------------------------------------------------
+ *         Recovery handling for Rmgr RM_RELATION_ID
+ * --------------------------------------------------
+ */
+
+/*
+ * Redo for relation lock messages
+ */
+static void
+relation_redo_lock(xl_rel_lock *xlrec)
+{
+    RelationAddRecoveryLock(xlrec);
+}
+
+/*
+ * Redo for relation invalidation messages
+ */
+static void
+relation_redo_inval(xl_rel_inval *xlrec)
+{
+    SharedInvalidationMessage *msgs = &(xlrec->msgs[0]);
+    int        nmsgs = xlrec->nmsgs;
+
+    Assert(nmsgs > 0);        /* else we should not have written a record */
+
+    /*
+     * Smack them straight onto the queue and we're done. This is safe
+     * because the only writer of these messages is non-transactional
+     * invalidation.
+     */
+    SendSharedInvalidMessages(msgs, nmsgs);
+}
+
+void
+relation_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+    uint8        info = record->xl_info & ~XLR_INFO_MASK;
+
+    if (InArchiveRecovery)
+        (void) RecordKnownAssignedTransactionIds(lsn, record->xl_xid);
+
+    if (info == XLOG_RELATION_INVAL)
+    {
+        xl_rel_inval *xlrec = (xl_rel_inval *) XLogRecGetData(record);
+
+        relation_redo_inval(xlrec);
+    }
+    else if (info == XLOG_RELATION_LOCK)
+    {
+        xl_rel_lock *xlrec = (xl_rel_lock *) XLogRecGetData(record);
+
+        relation_redo_lock(xlrec);
+    }
+    else
+        elog(PANIC, "relation_redo: unknown op code %u", info);
+}
+
+static void
+relation_desc_inval(StringInfo buf, xl_rel_inval *xlrec)
+{
+    SharedInvalidationMessage *msgs = &(xlrec->msgs[0]);
+    int                            nmsgs = xlrec->nmsgs;
+
+    appendStringInfo(buf, "nmsgs %d;", nmsgs);
+
+    if (nmsgs > 0)
+    {
+        int i;
+
+        for (i = 0; i < nmsgs; i++)
+        {
+            SharedInvalidationMessage *msg = msgs + i;
+
+            if (msg->id >= 0)
+                appendStringInfo(buf,  "catcache id %d", msg->id);
+            else if (msg->id == SHAREDINVALRELCACHE_ID)
+                appendStringInfo(buf,  "relcache ");
+            else if (msg->id == SHAREDINVALSMGR_ID)
+                appendStringInfo(buf,  "smgr ");
+        }
+    }
+}
+
+void
+relation_desc(StringInfo buf, uint8 xl_info, char *rec)
+{
+    uint8        info = xl_info & ~XLR_INFO_MASK;
+
+    if (info == XLOG_RELATION_INVAL)
+    {
+        xl_rel_inval *xlrec = (xl_rel_inval *) rec;
+
+        appendStringInfo(buf, "inval: ");
+        relation_desc_inval(buf, xlrec);
+    }
+    else if (info == XLOG_RELATION_LOCK)
+    {
+        xl_rel_lock *xlrec = (xl_rel_lock *) rec;
+
+        appendStringInfo(buf, "exclusive relation lock: xid %u db %d rel %d",
+                                xlrec->xid, xlrec->dbOid, xlrec->relOid);
+    }
+    else
+        appendStringInfo(buf, "UNKNOWN");
+}
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index a33c94e..67adc7a 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2579,3 +2579,20 @@ is_log_level_output(int elevel, int log_min_level)

     return false;
 }
+
+/*
+ * If trace_recovery_messages is set to make this visible, then show as LOG,
+ * else display as whatever level is set. It may still be shown, but only
+ * if log_min_messages is set lower than trace_recovery_messages.
+ *
+ * Intention is to keep this for at least the whole of the 8.4 production
+ * release, so we can more easily diagnose production problems in the field.
+ */
+int
+trace_recovery(int trace_level)
+{
+    if (trace_level >= trace_recovery_messages)
+        return LOG;
+
+    return trace_level;
+}
diff --git a/src/backend/utils/init/flatfiles.c b/src/backend/utils/init/flatfiles.c
index 9dbc53c..f81421a 100644
--- a/src/backend/utils/init/flatfiles.c
+++ b/src/backend/utils/init/flatfiles.c
@@ -678,9 +678,10 @@ write_auth_file(Relation rel_authid, Relation rel_authmem)
 /*
  * This routine is called once during database startup, after completing
  * WAL replay if needed.  Its purpose is to sync the flat files with the
- * current state of the database tables.  This is particularly important
- * during PITR operation, since the flat files will come from the
- * base backup which may be far out of sync with the current state.
+ * current state of the database tables.
+ *
+ * In 8.4 we also run this during xact_redo_commit() if the transaction
+ * wrote a new database or auth flat file.
  *
  * In theory we could skip rebuilding the flat files if no WAL replay
  * occurred, but it seems best to just do it always.  We have to
@@ -716,8 +717,6 @@ BuildFlatFiles(bool database_only)
     /*
      * We don't have any hope of running a real relcache, but we can use the
      * same fake-relcache facility that WAL replay uses.
-     *
-     * No locking is needed because no one else is alive yet.
      */
     rel_db = CreateFakeRelcacheEntry(rnode);
     write_database_file(rel_db, true);
@@ -744,6 +743,12 @@ BuildFlatFiles(bool database_only)

     CurrentResourceOwner = NULL;
     ResourceOwnerDelete(owner);
+
+    /*
+     * Signal the postmaster to reload its caches.
+     */
+    if (IsUnderPostmaster)
+        SendPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE);
 }


@@ -832,14 +837,14 @@ AtEOXact_UpdateFlatFiles(bool isCommit)
     /* Okay to write the files */
     if (database_file_update_subid != InvalidSubTransactionId)
     {
-        database_file_update_subid = InvalidSubTransactionId;
+        /* reset database_file_update_subid later during commit */
         write_database_file(drel, false);
         heap_close(drel, NoLock);
     }

     if (auth_file_update_subid != InvalidSubTransactionId)
     {
-        auth_file_update_subid = InvalidSubTransactionId;
+        /* reset auth_file_update_subid later during commit */
         write_auth_file(arel, mrel);
         heap_close(arel, NoLock);
         heap_close(mrel, NoLock);
@@ -859,6 +864,30 @@ AtEOXact_UpdateFlatFiles(bool isCommit)
     ForceSyncCommit();
 }

+/*
+ * Exported to allow transaction commit to set flags to perform flat file
+ * update in redo. Reset per-transaction flags. For abort case they were
+ * already set during AtEOXact_UpdateFlatFiles().
+ */
+bool
+AtEOXact_Database_FlatFile_Update_Needed(void)
+{
+    bool result = TransactionIdIsValid(database_file_update_subid);
+
+    database_file_update_subid = InvalidSubTransactionId;
+
+    return result;
+}
+
+bool
+AtEOXact_Auth_FlatFile_Update_Needed(void)
+{
+    bool result = TransactionIdIsValid(auth_file_update_subid);
+
+    auth_file_update_subid = InvalidSubTransactionId;
+
+    return result;
+}

 /*
  * This routine is called during transaction prepare.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b359395..8bc5909 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -440,7 +440,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
      */
     MyBackendId = InvalidBackendId;

-    SharedInvalBackendInit();
+    SharedInvalBackendInit(false);

     if (MyBackendId > MaxBackends || MyBackendId <= 0)
         elog(FATAL, "bad backend id: %d", MyBackendId);
@@ -452,10 +452,11 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,

     /*
      * Initialize local process's access to XLOG.  In bootstrap case we may
-     * skip this since StartupXLOG() was run instead.
+     * skip this since StartupXLOG() was run instead. InitXLOGAccess() will
+     * be called here if we are not in recovery processing mode.
      */
     if (!bootstrap)
-        InitXLOGAccess();
+        (void) RecoveryInProgress();

     /*
      * Initialize the relation cache and the system catalog caches.  Note that
@@ -489,9 +490,15 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
      * Start a new transaction here before first access to db, and get a
      * snapshot.  We don't have a use for the snapshot itself, but we're
      * interested in the secondary effect that it sets RecentGlobalXmin.
+     * If we are connecting during recovery, make sure the initial
+     * transaction is read only and force all subsequent transactions
+     * that way also.
      */
     if (!bootstrap)
     {
+        if (RecoveryInProgress())
+            SetConfigOption("default_transaction_read_only", "true",
+                PGC_POSTMASTER, PGC_S_OVERRIDE);
         StartTransactionCommand();
         (void) GetTransactionSnapshot();
     }
@@ -515,7 +522,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
      */
     if (!bootstrap)
         LockSharedObject(DatabaseRelationId, MyDatabaseId, 0,
-                         RowExclusiveLock);
+                (RecoveryInProgress() ? AccessShareLock : RowExclusiveLock));

     /*
      * Recheck the flat file copy of pg_database to make sure the target
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90f077a..bd44494 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -115,6 +115,8 @@ extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
 extern bool fullPageWrites;

+int    trace_recovery_messages = DEBUG1; /* XXXHS set to LOG for production */
+
 #ifdef TRACE_SORT
 extern bool trace_sort;
 #endif
@@ -2635,6 +2637,16 @@ static struct config_enum ConfigureNamesEnum[] =
     },

     {
+        {"trace_recovery_messages", PGC_SUSET, LOGGING_WHEN,
+            gettext_noop("Sets the message levels that are logged during recovery."),
+            gettext_noop("Each level includes all the levels that follow it. The later"
+                         " the level, the fewer messages are sent.")
+        },
+        &trace_recovery_messages,
+        DEBUG1, server_message_level_options, NULL, NULL
+    },
+
+    {
         {"track_functions", PGC_SUSET, STATS_COLLECTOR,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
@@ -5501,8 +5513,19 @@ ExecSetVariableStmt(VariableSetStmt *stmt)
                         SetPGVariable("transaction_isolation",
                                       list_make1(item->arg), stmt->is_local);
                     else if (strcmp(item->defname, "transaction_read_only") == 0)
+                    {
+                        A_Const       *con;
+
+                        Assert(IsA(item->arg, A_Const));
+                        con = (A_Const *) item->arg;
+                        Assert(nodeTag(&con->val) == T_Integer);
+
+                        if (!intVal(&con->val))
+                            PreventCommandDuringRecovery();
+
                         SetPGVariable("transaction_read_only",
                                       list_make1(item->arg), stmt->is_local);
+                    }
                     else
                         elog(ERROR, "unexpected SET TRANSACTION element: %s",
                              item->defname);
@@ -5520,8 +5543,19 @@ ExecSetVariableStmt(VariableSetStmt *stmt)
                         SetPGVariable("default_transaction_isolation",
                                       list_make1(item->arg), stmt->is_local);
                     else if (strcmp(item->defname, "transaction_read_only") == 0)
+                    {
+                        A_Const       *con;
+
+                        Assert(IsA(item->arg, A_Const));
+                        con = (A_Const *) item->arg;
+                        Assert(nodeTag(&con->val) == T_Integer);
+
+                        if (!intVal(&con->val))
+                            PreventCommandDuringRecovery();
+
                         SetPGVariable("default_transaction_read_only",
                                       list_make1(item->arg), stmt->is_local);
+                    }
                     else
                         elog(ERROR, "unexpected SET SESSION element: %s",
                              item->defname);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 9992895..7dd8a70 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -27,6 +27,7 @@

 #include "access/transam.h"
 #include "access/xact.h"
+#include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/memutils.h"
@@ -433,7 +434,11 @@ static void
 SnapshotResetXmin(void)
 {
     if (RegisteredSnapshots == 0 && ActiveSnapshot == NULL)
+    {
         MyProc->xmin = InvalidTransactionId;
+        MyProc->rconflicts.nConflicts = 0;
+        /* Don't bother to reset other aspects of RecoveryConflictCache */
+    }
 }

 /*
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index dbfbb02..e54f537 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -86,7 +86,7 @@ static inline void
 SetHintBits(HeapTupleHeader tuple, Buffer buffer,
             uint16 infomask, TransactionId xid)
 {
-    if (TransactionIdIsValid(xid))
+    if (!RecoveryInProgress() && TransactionIdIsValid(xid))
     {
         /* NB: xid must be known committed here! */
         XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
@@ -1252,13 +1252,13 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
         for (j = 0; j < snapshot->subxcnt; j++)
         {
             if (TransactionIdEquals(xid, snapshot->subxip[j]))
-                return true;
+            return true;
         }
-
+
         /* not there, fall through to search xip[] */
-    }
+      }
     else
-    {
+      {
         /* overflowed, so convert xid to top-level */
         xid = SubTransGetTopmostTransaction(xid);

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a5d9769..e3f94ed 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -130,11 +130,13 @@ extern void heap2_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern XLogRecPtr log_heap_move(Relation reln, Buffer oldbuf,
               ItemPointerData from,
               Buffer newbuf, HeapTuple newtup);
+extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+              TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
                OffsetNumber *redirected, int nredirected,
                OffsetNumber *nowdead, int ndead,
                OffsetNumber *nowunused, int nunused,
-               bool redirect_move);
+               TransactionId latestRemovedXid, bool redirect_move);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
                 TransactionId cutoff_xid,
                 OffsetNumber *offsets, int offcnt);
diff --git a/src/include/access/htup.h b/src/include/access/htup.h
index 54264bd..96fb89d 100644
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@@ -580,6 +580,7 @@ typedef HeapTupleData *HeapTuple;
 #define XLOG_HEAP2_FREEZE        0x00
 #define XLOG_HEAP2_CLEAN        0x10
 #define XLOG_HEAP2_CLEAN_MOVE    0x20
+#define XLOG_HEAP2_CLEANUP_INFO 0x30

 /*
  * All what we need to find changed tuple
@@ -668,6 +669,7 @@ typedef struct xl_heap_clean
 {
     RelFileNode node;
     BlockNumber block;
+    TransactionId    latestRemovedXid;
     uint16        nredirected;
     uint16        ndead;
     /* OFFSET NUMBERS FOLLOW */
@@ -675,6 +677,19 @@ typedef struct xl_heap_clean

 #define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))

+/*
+ * Cleanup_info is required in some cases during a lazy VACUUM.
+ * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
+ * see vacuumlazy.c for full explanation
+ */
+typedef struct xl_heap_cleanup_info
+{
+    RelFileNode     node;
+    TransactionId    latestRemovedXid;
+} xl_heap_cleanup_info;
+
+#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+
 /* This is for replacing a page's contents in toto */
 /* NB: this is used for indexes as well as heaps */
 typedef struct xl_heap_newpage
@@ -718,6 +733,9 @@ typedef struct xl_heap_freeze

 #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_xid) + sizeof(TransactionId))

+extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
+                                        TransactionId *latestRemovedXid);
+
 /* HeapTupleHeader functions implemented in utils/time/combocid.c */
 extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
 extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 2df34f5..8028fce 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -214,12 +214,13 @@ typedef struct BTMetaPageData
 #define XLOG_BTREE_SPLIT_R        0x40    /* as above, new item on right */
 #define XLOG_BTREE_SPLIT_L_ROOT 0x50    /* add tuple with split of root */
 #define XLOG_BTREE_SPLIT_R_ROOT 0x60    /* as above, new item on right */
-#define XLOG_BTREE_DELETE        0x70    /* delete leaf index tuple */
+#define XLOG_BTREE_DELETE        0x70    /* delete leaf index tuples for a page */
 #define XLOG_BTREE_DELETE_PAGE    0x80    /* delete an entire page */
 #define XLOG_BTREE_DELETE_PAGE_META 0x90        /* same, and update metapage */
 #define XLOG_BTREE_NEWROOT        0xA0    /* new root page */
 #define XLOG_BTREE_DELETE_PAGE_HALF 0xB0        /* page deletion that makes
                                                  * parent half-dead */
+#define XLOG_BTREE_VACUUM        0xC0    /* delete entries on a page during vacuum */

 /*
  * All that we need to find changed index tuple
@@ -306,16 +307,53 @@ typedef struct xl_btree_split
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page.
+ * single index page when *not* executed by VACUUM.
  */
 typedef struct xl_btree_delete
 {
     RelFileNode node;
     BlockNumber block;
+    TransactionId    latestRemovedXid;
+    int            numItems;         /* number of items in the offset array */
+
     /* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_btree_delete;

-#define SizeOfBtreeDelete    (offsetof(xl_btree_delete, block) + sizeof(BlockNumber))
+#define SizeOfBtreeDelete    (offsetof(xl_btree_delete, latestRemovedXid) + sizeof(TransactionId))
+
+/*
+ * This is what we need to know about vacuum of individual leaf index tuples.
+ * The WAL record can represent deletion of any number of index tuples on a
+ * single index page when executed by VACUUM.
+ *
+ * The correctness requirement for applying these changes during recovery is
+ * that we must do one of these two things for every block in the index:
+ *         * lock the block for cleanup and apply any required changes
+ *        * EnsureBlockUnpinned()
+ * The purpose of this is to ensure that no index scans started before we
+ * finish scanning the index are still running by the time we begin to remove
+ * heap tuples.
+ *
+ * Any changes to any one block are registered on just one WAL record. All
+ * blocks that we need to run EnsureBlockUnpinned() before we touch the changed
+ * block are also given on this record as a variable length array. The array
+ * is compressed by way of storing an array of block ranges, rather than an
+ * actual array of blockids.
+ *
+ * Note that the *last* WAL record in any vacuum of an index is allowed to
+ * have numItems == 0. All other WAL records must have numItems > 0.
+ */
+typedef struct xl_btree_vacuum
+{
+    RelFileNode node;
+    BlockNumber block;
+    BlockNumber lastBlockVacuumed;
+    int            numItems;         /* number of items in the offset array */
+
+    /* TARGET OFFSET NUMBERS FOLLOW */
+} xl_btree_vacuum;
+
+#define SizeOfBtreeVacuum    (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))

 /*
  * This is what we need to know about deletion of a btree page.  The target
@@ -498,6 +536,10 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC            (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST    (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)

+/* XXX probably needs new RMgr call to do this cleanly */
+extern bool btree_is_cleanup_record(uint8 info);
+extern bool btree_needs_cleanup_lock(uint8 info);
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -537,7 +579,8 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems(Relation rel, Buffer buf,
-             OffsetNumber *itemnos, int nitems);
+             OffsetNumber *itemnos, int nitems, bool isVacuum,
+             BlockNumber lastBlockVacuumed);
 extern int _bt_pagedel(Relation rel, Buffer buf,
             BTStack stack, bool vacuum_full);

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 47b95c2..55cb8d3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -68,6 +68,7 @@ typedef struct IndexScanDescData
     /* signaling to index AM about killing index tuples */
     bool        kill_prior_tuple;        /* last-returned tuple is dead */
     bool        ignore_killed_tuples;    /* do not return killed entries */
+    bool        xactStartedInRecovery;    /* prevents killing/seeing killed tuples */

     /* index access method's private state */
     void       *opaque;            /* access-method-specific info */
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 5702f5f..8ab1148 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -23,6 +23,7 @@ typedef uint8 RmgrId;
 #define RM_DBASE_ID                4
 #define RM_TBLSPC_ID            5
 #define RM_MULTIXACT_ID            6
+#define RM_RELATION_ID            8
 #define RM_HEAP2_ID                9
 #define RM_HEAP_ID                10
 #define RM_BTREE_ID                11
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 6ff25fc..6a19621 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -11,6 +11,9 @@
 #ifndef SUBTRANS_H
 #define SUBTRANS_H

+/* included solely to allow recovery-code to access InRecovery state */
+#include "access/xlog.h"
+
 /* Number of SLRU buffers to use for subtrans */
 #define NUM_SUBTRANS_BUFFERS    32

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 2b796b6..b625e3e 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -129,6 +129,9 @@ typedef VariableCacheData *VariableCache;
  * ----------------
  */

+/* in transam/xact.c */
+extern bool TransactionStartedDuringRecovery(void);
+
 /* in transam/varsup.c */
 extern VariableCache ShmemVariableCache;

diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 64dba0c..d09eb24 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -42,6 +42,8 @@ extern void EndPrepare(GlobalTransaction gxact);
 extern TransactionId PrescanPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);

+extern void ProcessTwoPhaseStandbyRecords(TransactionId xid);
+
 extern void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
 extern void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);

diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h
index 5d6c028..890e9c2 100644
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@@ -29,6 +29,7 @@ typedef uint8 TwoPhaseRmgrId;
 #define TWOPHASE_RM_PGSTAT_ID        5
 #define TWOPHASE_RM_MAX_ID            TWOPHASE_RM_PGSTAT_ID

+extern const TwoPhaseCallback twophase_postcommit_standby_callbacks[];
 extern const TwoPhaseCallback twophase_recover_callbacks[];
 extern const TwoPhaseCallback twophase_postcommit_callbacks[];
 extern const TwoPhaseCallback twophase_postabort_callbacks[];
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index f255d88..bcc96a2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -17,6 +17,7 @@
 #include "access/xlog.h"
 #include "nodes/pg_list.h"
 #include "storage/relfilenode.h"
+#include "utils/snapshot.h"
 #include "utils/timestamp.h"


@@ -84,18 +85,60 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_ABORT                0x20
 #define XLOG_XACT_COMMIT_PREPARED    0x30
 #define XLOG_XACT_ABORT_PREPARED    0x40
+#define XLOG_XACT_ASSIGNMENT        0x50
+#define XLOG_XACT_RUNNING_XACTS        0x60
+/* 0x70 can also be used, if required */
+
+typedef struct xl_xact_assignment
+{
+    TransactionId    xtop;        /* assigned xids top-level xid, if any */
+    TransactionId    xsub[1];    /* assigned subxids */
+} xl_xact_assignment;
+
+#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
+
+/*
+ * xl_xact_running_xacts is in utils/snapshot.h so it can be passed
+ * around to the same places as snapshots. Not snapmgr.h
+ */

 typedef struct xl_xact_commit
 {
-    TimestampTz xact_time;        /* time of commit */
-    int            nrels;            /* number of RelFileNodes */
-    int            nsubxacts;        /* number of subtransaction XIDs */
-    /* Array of RelFileNode(s) to drop at commit */
-    RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
-    /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
+      TimestampTz xact_time;        /* time of commit */
+     uint32        xinfo;            /* info flags */
+      int            nrels;            /* number of RelFileForks */
+      int            nsubxacts;        /* number of subtransaction XIDs */
+    int            nmsgs;            /* number of shared inval msgs */
+      /* Array of RelFileFork(s) to drop at commit */
+      RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
+      /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
+    /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
 } xl_xact_commit;

 #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)
+#define OffsetSharedInvalInXactCommit() \
+( \
+    MinSizeOfXactCommit +  \
+    (xlrec->nsubxacts * sizeof(TransactionId)) + \
+    (xlrec->nrels * sizeof(RelFileNode)) \
+)
+
+/*
+ * These flags are set in the xinfo fields of WAL commit records,
+ * indicating a variety of additional actions that need to occur
+ * when emulating transaction effects during recovery.
+ * They are named XactCompletion... to differentiate them from
+ * EOXact... routines which run at the end of the original
+ * transaction completion.
+ */
+#define XACT_COMPLETION_UPDATE_DB_FILE            0x01
+#define XACT_COMPLETION_UPDATE_AUTH_FILE        0x02
+#define XACT_COMPLETION_UPDATE_RELCACHE_FILE    0x04
+
+/* Access macros for above flags */
+#define XactCompletionUpdateDBFile(xlrec)             ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_DB_FILE)
+#define XactCompletionUpdateAuthFile(xlrec)         ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_AUTH_FILE)
+#define XactCompletionRelcacheInitFileInval(xlrec)    ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE)

 typedef struct xl_xact_abort
 {
@@ -106,6 +149,7 @@ typedef struct xl_xact_abort
     RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
     /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
 } xl_xact_abort;
+/* Note the intentional lack of an invalidation message array c.f. commit */

 #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)

@@ -185,6 +229,13 @@ extern TransactionId RecordTransactionCommit(void);

 extern int    xactGetCommittedChildren(TransactionId **ptr);

+extern void LogCurrentRunningXacts(void);
+extern void InitRecoveryTransactionEnvironment(void);
+extern void XactClearRecoveryTransactions(void);
+/* XXX: Double definition, in procarray.h too! */
+extern bool RecordKnownAssignedTransactionIds(XLogRecPtr lsn, TransactionId xid);
+extern bool LatestRemovedXidAdvances(TransactionId latestXid);
+
 extern void xact_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void xact_desc(StringInfo buf, uint8 xl_info, char *rec);

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8720bb..1774172 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -18,7 +18,9 @@
 #include "utils/pg_crc.h"
 #include "utils/timestamp.h"

-
+/* Handy constant for an invalid xlog recptr */
+static const XLogRecPtr InvalidXLogRecPtr = {0, 0};
+#define XLogRecPtrIsValid(xp)    (!(xp.xlogid ==0 && xp.xrecoff == 0))
 /*
  * The overall layout of an XLOG record is:
  *        Fixed-size header (XLogRecord struct)
@@ -46,10 +48,10 @@ typedef struct XLogRecord
     TransactionId xl_xid;        /* xact id */
     uint32        xl_tot_len;        /* total len of entire record */
     uint32        xl_len;            /* total len of rmgr data */
-    uint8        xl_info;        /* flag bits, see below */
+    uint8        xl_info;        /* flag bits, see below (XLR_ entries) */
     RmgrId        xl_rmid;        /* resource manager for this record */

-    /* Depending on MAXALIGN, there are either 2 or 6 wasted bytes here */
+    /* Above structure has 2 bytes spare in both 4 byte and 8 byte alignment */

     /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */

@@ -133,7 +135,17 @@ typedef struct XLogRecData
 } XLogRecData;

 extern TimeLineID ThisTimeLineID;        /* current TLI */
-extern bool InRecovery;
+
+/*
+ * Prior to 8.4, all activity during recovery were carried out by Startup
+ * process. This local variable continues to be used in many parts of the
+ * code to indicate actions taken by RecoveryManagers. Other processes who
+ * potentially perform work during recovery should check
+ * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
+ */
+extern bool InRecovery;
+extern bool InArchiveRecovery;
+extern bool InHotStandby;
 extern XLogRecPtr XactLastRecEnd;

 /* these variables are GUC parameters related to XLOG */
@@ -143,6 +155,7 @@ extern bool XLogArchiveMode;
 extern char *XLogArchiveCommand;
 extern int    XLogArchiveTimeout;
 extern bool log_checkpoints;
+extern int maxStandbyDelay;

 #define XLogArchivingActive()    (XLogArchiveMode)
 #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
@@ -200,6 +213,9 @@ extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

 extern bool RecoveryInProgress(void);
+extern int GetLatestReplicationDelay(void);
+
+extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);

 extern void UpdateControlFile(void);
 extern Size XLOGShmemSize(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 5675bfb..f6b1ca5 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -71,7 +71,7 @@ typedef struct XLogContRecord
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD063    /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0x5352    /* can be used as WAL version indicator */

 typedef struct XLogPageHeaderData
 {
@@ -255,5 +255,18 @@ extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
 extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
 extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
 extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_continue(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_pause(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_pause_cleanup(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_pause_xid(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_pause_time(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_pause_lsn(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_advance(PG_FUNCTION_ARGS);
+extern Datum pg_recovery_stop(PG_FUNCTION_ARGS);
+extern Datum pg_current_recovery_target(PG_FUNCTION_ARGS);
+extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS);
+extern Datum pg_last_recovered_xact_timestamp(PG_FUNCTION_ARGS);
+extern Datum pg_last_recovered_xid(PG_FUNCTION_ARGS);
+extern Datum pg_last_recovered_xlog_location(PG_FUNCTION_ARGS);

 #endif   /* XLOG_INTERNAL_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 400f32c..8174294 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -101,6 +101,10 @@ typedef struct ControlFileData

     CheckPoint    checkPointCopy; /* copy of last check point record */

+    /*
+     * Next two sound very similar, yet are distinct and necessary.
+     * Check comments in xlog.c for a full explanation not easily repeated.
+     */
     XLogRecPtr    minRecoveryPoint;        /* must replay xlog to here */

     /*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 9a054a2..c07989b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3267,6 +3267,33 @@ DESCR("xlog filename and byte offset, given an xlog location");
 DATA(insert OID = 2851 ( pg_xlogfile_name            PGNSP PGUID 12 1 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_
_null_pg_xlogfile_name _null_ _null_ _null_ )); 
 DESCR("xlog filename, given an xlog location");

+DATA(insert OID = 3801 (  pg_recovery_continue        PGNSP PGUID 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_
_null__null_ pg_recovery_continue _null_ _null_ _null_ )); 
+DESCR("if recovery is paused, continue with recovery");
+DATA(insert OID = 3802 (  pg_recovery_pause        PGNSP PGUID 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_ _null_
_null_pg_recovery_pause _null_ _null_ _null_ )); 
+DESCR("pause recovery until recovery target reset");
+
+DATA(insert OID = 3804 (  pg_recovery_pause_xid        PGNSP PGUID 12 1 0 0 f f f t f v 1 0 2278 "23" _null_ _null_
_null__null_ pg_recovery_pause_xid _null_ _null_ _null_ )); 
+DESCR("continue recovery until specified xid completes, if ever seen, then pause recovery");
+DATA(insert OID = 3805 (  pg_recovery_pause_time        PGNSP PGUID 12 1 0 0 f f f t f v 1 0 2278 "1184" _null_ _null_
_null__null_ pg_recovery_pause_time _null_ _null_ _null_ )); 
+DESCR("continue recovery until a transaction with specified timestamp completes, if ever seen, then pause recovery");
+DATA(insert OID = 3806 (  pg_recovery_advance        PGNSP PGUID 12 1 0 0 f f f t f v 1 0 2278 "23" _null_ _null_
_null__null_ pg_recovery_advance _null_ _null_ _null_ )); 
+DESCR("continue recovery exactly specified number of records, then pause recovery");
+DATA(insert OID = 3807 (  pg_recovery_stop        PGNSP PGUID 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_ _null_
_null_pg_recovery_stop _null_ _null_ _null_ )); 
+DESCR("stop recovery immediately");
+DATA(insert OID = 3808 (  pg_current_recovery_target        PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_
_null__null_ pg_current_recovery_target _null_ _null_ _null_ )); 
+DESCR("get current recovery target state and target values, if any");
+DATA(insert OID = 3809 (  pg_recovery_pause_lsn        PGNSP PGUID 12 1 0 0 f f f t f v 2 0 2278 "23 23" _null_ _null_
_null__null_ pg_recovery_pause_lsn _null_ _null_ _null_ )); 
+DESCR("continue recovery until a transaction with specified timestamp completes, if ever seen, then pause recovery");
+
+DATA(insert OID = 3810 (  pg_is_in_recovery     PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_
pg_is_in_recovery_null_ _null_ _null_ )); 
+DESCR("true if server is in recovery");
+DATA(insert OID = 3811 (  pg_last_recovered_xact_timestamp     PGNSP PGUID 12 1 0 0 f f f t f v 0 0 1184 "" _null_
_null__null_ _null_ pg_last_recovered_xact_timestamp _null_ _null_ _null_ )); 
+DESCR("timestamp of last commit or abort xlog record that arrived during recovery, if any");
+DATA(insert OID = 3812 (  pg_last_recovered_xid     PGNSP PGUID 12 1 0 0 f f f t f v 0 0 28 "" _null_ _null_ _null_
_null_pg_last_recovered_xid _null_ _null_ _null_ )); 
+DESCR("xid of last commit or abort xlog record that arrived during recovery, if any");
+DATA(insert OID = 3813 (  pg_last_recovered_xlog_location     PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_
_null__null_ pg_last_recovered_xlog_location _null_ _null_ _null_ )); 
+DESCR("xlog location of last xlog record that arrived during recovery, if any");
+
 DATA(insert OID = 2621 ( pg_reload_conf            PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_
_null_pg_reload_conf _null_ _null_ _null_ )); 
 DESCR("reload configuration files");
 DATA(insert OID = 2622 ( pg_rotate_logfile        PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_
_null_pg_rotate_logfile _null_ _null_ _null_ )); 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 465261a..de849ca 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -235,6 +235,12 @@ extern bool VacuumCostActive;
 /* in tcop/postgres.c */
 extern void check_stack_depth(void);

+/* in tcop/utility.c */
+extern void PreventCommandDuringRecovery(void);
+
+/* in utils/misc/guc.c */
+extern int trace_recovery_messages;
+int trace_recovery(int trace_level);

 /*****************************************************************************
  *      pdir.h --                                                                 *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42766e9..39ab4e8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -68,6 +68,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE        1
 #define BUFFER_LOCK_EXCLUSIVE    2

+/* Not used by LockBuffer, but is used by XLogReadBuffer... */
+#define BUFFER_LOCK_CLEANUP        3
+
 /*
  * These routines are beaten on quite heavily, hence the macroization.
  */
@@ -169,6 +172,8 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
                      BlockNumber blockNum);

+extern bool SetBufferRecoveryConflictLSN(XLogRecPtr conflict_LSN);
+
 extern void InitBufferPool(void);
 extern void InitBufferPoolAccess(void);
 extern void InitBufferPoolBackend(void);
@@ -200,6 +205,10 @@ extern bool ConditionalLockBuffer(Buffer buffer);
 extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);

+extern void StartCleanupDelayStats(void);
+extern void EndCleanupDelayStats(void);
+extern void ReportCleanupDelayStats(void);
+
 extern void AbortBufferIO(void);

 extern void BufmgrCommit(void);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 21b1e90..4f0432e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -24,7 +24,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_RECOVERY_CONSISTENT, /* recovery has reached consistent state */
-    PMSIGNAL_RECOVERY_COMPLETED, /* recovery has completed */
     PMSIGNAL_PASSWORD_CHANGE,    /* pg_auth file has changed */
     PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 53a5c05..5c2cb2d 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,7 @@
 #ifndef _PROC_H_
 #define _PROC_H_

+#include "access/xlog.h"
 #include "storage/lock.h"
 #include "storage/pg_sema.h"

@@ -38,6 +39,34 @@ struct XidCache
     TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
 };

+/*
+ * Recovery conflict cache
+ */
+#define PGPROC_MAX_CACHED_CONFLICT_RELS 8
+
+struct ConflictCache
+{
+    /*
+     * nConflicts == 0 if no conflicts have been set, which must only
+     *                     ever occur during recovery.
+     * nConflicts > PGPROC_MAX_CACHED_CONFLICT_RELS means cache has overflowed
+     *                     and the entries can be ignored.
+     */
+    int            nConflicts;
+
+    /* Array of relNode Oids to confirm which rels are in conflict */
+    Oid            rels[PGPROC_MAX_CACHED_CONFLICT_RELS];
+
+    /*
+     * LSN of the first conflict (only). Any block with changes after this
+     * LSN after this must be canceled.
+     */
+    XLogRecPtr     lsn;
+
+    /* CancelMode is only used for non-buffer recovery conflicts */
+    int            cancelMode;
+};
+
 /* Flags for PGPROC->vacuumFlags */
 #define        PROC_IS_AUTOVACUUM    0x01    /* is it an autovac worker? */
 #define        PROC_IN_VACUUM        0x02    /* currently running lazy vacuum */
@@ -93,6 +122,14 @@ struct PGPROC

     uint8        vacuumFlags;    /* vacuum-related flags, see above */

+    /*
+     * The LSN field exists to allow procs to be used during recovery
+     * for managing snapshot data for standby servers. The lsn allows
+     * us to disambiguate any incoming information so we always respect
+     * the latest info.
+     */
+    XLogRecPtr    lsn;    /* Last LSN which maintained state of Recovery Proc */
+
     /* Info about LWLock the process is currently waiting for, if any. */
     bool        lwWaiting;        /* true if waiting for an LW lock */
     bool        lwExclusive;    /* true if waiting for exclusive access */
@@ -113,7 +150,8 @@ struct PGPROC
      */
     SHM_QUEUE    myProcLocks[NUM_LOCK_PARTITIONS];

-    struct XidCache subxids;    /* cache for subtransaction XIDs */
+    struct XidCache         subxids;    /* cache for subtransaction XIDs */
+    struct ConflictCache     rconflicts;    /* cache for recovery conflicts */
 };

 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -131,8 +169,13 @@ typedef struct PROC_HDR
     PGPROC       *freeProcs;
     /* Head of list of autovacuum's free PGPROC structures */
     PGPROC       *autovacFreeProcs;
+    /* Head of list of free recovery PGPROC structures */
+    PGPROC       *freeRecoveryProcs;
     /* Current shared estimate of appropriate spins_per_delay value */
     int            spins_per_delay;
+    /* The proc of the Startup process, since not in ProcArray */
+    PGPROC       *startupProc;
+    int       startupProcPid;
 } PROC_HDR;

 /*
@@ -163,6 +206,11 @@ extern void InitProcGlobal(void);
 extern void InitProcess(void);
 extern void InitProcessPhase2(void);
 extern void InitAuxiliaryProcess(void);
+
+extern void PublishStartupProcessInformation(void);
+extern void ProcSetRecoveryConflict(PGPROC *proc, XLogRecPtr conflict_LSN, int cancel_mode);
+extern XLogRecPtr ProcGetRecoveryConflict(int *cancel_mode);
+
 extern bool HaveNFreeProcs(int n);
 extern void ProcReleaseLocks(bool isCommit);

diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 065a9b9..95f26ea 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -14,18 +14,30 @@
 #ifndef PROCARRAY_H
 #define PROCARRAY_H

+#include "access/xact.h"
 #include "storage/lock.h"
 #include "utils/snapshot.h"


 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
-extern void ProcArrayAdd(PGPROC *proc);
-extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
+extern void ProcArrayAdd(PGPROC *proc, bool need_lock);
+extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid,
+                                        int nsubxids, TransactionId *subxids);
+
+extern void ProcArrayInitRecoveryEnvironment(void);
+extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);

-extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
+extern void ProcArrayUpdateRecoveryTransactions(XLogRecPtr lsn,
+                                                xl_xact_running_xacts *xlrec);
+extern bool RecordKnownAssignedTransactionIds(XLogRecPtr lsn, TransactionId xid);
+extern void RecordKnownAssignedSubTransactionIds(TransactionId latestXid,
+                            int nsubxacts, TransactionId *sub_xids);
+extern bool IsRunningXactDataValid(void);
+extern PGPROC *CreateRecoveryProcessForTransactionId(TransactionId xid);

+extern RunningTransactions GetRunningTransactionData(void);
 extern Snapshot GetSnapshotData(Snapshot snapshot);

 extern bool TransactionIdIsInProgress(TransactionId xid);
@@ -36,19 +48,40 @@ extern int    GetTransactionsInCommit(TransactionId **xids_p);
 extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids);

 extern PGPROC *BackendPidGetProc(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
 extern int    BackendXidGetPid(TransactionId xid);
 extern bool IsBackendPid(int pid);

-extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
-                      bool allDbs, int excludeVacuum);
+extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
+                    Oid    dbOid, int excludeVacuum);
+extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin,
+                    Oid dbOid, Oid roleId);
+extern void SetDeferredRecoveryConflicts(TransactionId latestRemovedXid, RelFileNode node,
+                             XLogRecPtr conflict_lsn);
+extern PGPROC *VirtualTransactionIdGetProc(VirtualTransactionId vxid);
+
 extern int    CountActiveBackends(void);
 extern int    CountDBBackends(Oid databaseid);
 extern int    CountUserBackends(Oid roleid);
 extern bool CountOtherDBBackends(Oid databaseId,
                                  int *nbackends, int *nprepared);

-extern void XidCacheRemoveRunningXids(TransactionId xid,
+extern void XidCacheRemoveRunningXids(PGPROC *proc, TransactionId xid,
                           int nxids, const TransactionId *xids,
                           TransactionId latestXid);

+/* Primitives for UnobservedXids array handling for standby */
+extern void UnobservedTransactionsAddRange(TransactionId firstXid,
+                                           TransactionId lastXid);
+extern void UnobservedTransactionsAddXids(TransactionId xid, int nsubxids,
+                                          TransactionId *subxid);
+extern void UnobservedTransactionsRemoveXids(TransactionId xid,
+                                             int nsubxacts,
+                                             TransactionId *subxids,
+                                             bool missing_is_error);
+extern void UnobservedTransactionsPruneXids(TransactionId limitXid);
+extern void UnobservedTransactionsClearXids(void);
+extern void UnobservedTransactionsDisplay(int trace_level);
+extern void AdvanceLastOverflowedUnobservedXid(TransactionId xid);
+
 #endif   /* PROCARRAY_H */
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index 4a07e0c..5bf1be1 100644
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -89,6 +89,44 @@ extern void ReceiveSharedInvalidMessages(
                       void (*invalFunction) (SharedInvalidationMessage *msg),
                              void (*resetFunction) (void));

+extern int xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
+                                        bool *RelcacheInitFileInval);
+
+/*
+ * Relation Rmgr (RM_RELATION_ID)
+ *
+ * Relation recovery manager exists to allow locks and certain kinds of
+ * invalidation message to be passed across to a standby server.
+ */
+extern void RelationReleaseRecoveryLockTree(TransactionId xid,
+                                        int nsubxids, TransactionId *subxids);
+extern void RelationClearRecoveryLocks(void);
+
+/* Recovery handlers for the Relation Rmgr (RM_RELATION_ID) */
+extern void relation_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void relation_desc(StringInfo buf, uint8 xl_info, char *rec);
+
+/*
+ * XLOG message types
+ */
+#define XLOG_RELATION_INVAL            0x00
+#define XLOG_RELATION_LOCK            0x10
+
+typedef struct xl_rel_inval
+{
+    int                            nmsgs;        /* number of shared inval msgs */
+    SharedInvalidationMessage    msgs[1];    /* VARIABLE LENGTH ARRAY */
+} xl_rel_inval;
+
+#define MinSizeOfRelationInval offsetof(xl_rel_inval, msgs)
+
+typedef struct xl_rel_lock
+{
+    TransactionId    xid;    /* xid of holder of AccessExclusiveLock */
+    Oid        dbOid;
+    Oid        relOid;
+} xl_rel_lock;
+
 /* signal handler for catchup events (SIGUSR1) */
 extern void CatchupInterruptHandler(SIGNAL_ARGS);

diff --git a/src/include/storage/sinvaladt.h b/src/include/storage/sinvaladt.h
index 3c4c030..c612ee7 100644
--- a/src/include/storage/sinvaladt.h
+++ b/src/include/storage/sinvaladt.h
@@ -29,7 +29,7 @@
  */
 extern Size SInvalShmemSize(void);
 extern void CreateSharedInvalidationState(void);
-extern void SharedInvalBackendInit(void);
+extern void SharedInvalBackendInit(bool sendOnly);
 extern bool BackendIdIsActive(int backendID);

 extern void SIInsertDataEntries(const SharedInvalidationMessage *data, int n);
diff --git a/src/include/utils/flatfiles.h b/src/include/utils/flatfiles.h
index 36f47b8..f9569a2 100644
--- a/src/include/utils/flatfiles.h
+++ b/src/include/utils/flatfiles.h
@@ -27,6 +27,13 @@ extern void AtEOSubXact_UpdateFlatFiles(bool isCommit,
                             SubTransactionId mySubid,
                             SubTransactionId parentSubid);

+/*
+ * Called by RecordTransactionCommit to allow it to set xinfo flags
+ * on the commit record. Used for standby invalidation of flat files.
+ */
+extern bool AtEOXact_Database_FlatFile_Update_Needed(void);
+extern bool AtEOXact_Auth_FlatFile_Update_Needed(void);
+
 extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS);

 extern void flatfile_twophase_postcommit(TransactionId xid, uint16 info,
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 42fd8ba..9c6eca4 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -15,6 +15,8 @@
 #define INVAL_H

 #include "access/htup.h"
+#include "access/xact.h"
+#include "storage/lock.h"
 #include "utils/relcache.h"


@@ -60,4 +62,8 @@ extern void CacheRegisterRelcacheCallback(RelcacheCallbackFunction func,
 extern void inval_twophase_postcommit(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);

+extern void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
+                                        char *reason, int cancel_mode,
+                                        XLogRecPtr conflict_LSN);
+
 #endif   /* INVAL_H */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0af1f6f..3cb0a83 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -63,6 +63,73 @@ typedef struct SnapshotData
 } SnapshotData;

 /*
+ * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
+ * not quite. This has nothing at all to do with visibility on this server,
+ * so this is completely separate from snapmgr.c and snapmgr.h
+ * This data is important for creating the initial snapshot state on a
+ * standby server. We need lots more information than a normal snapshot,
+ * hence we use a specific data structure for our needs. This data
+ * is written to WAL as a separate record immediately after each
+ * checkpoint. That means that wherever we start a standby from we will
+ * almost immediately see the data we need to begin executing queries.
+ */
+typedef struct RunningXact
+{
+    /* Items matching PGPROC entries */
+    TransactionId    xid;            /* xact ID in progress */
+
+    /* Items matching XidCache */
+    bool            overflowed;
+    int                nsubxids;        /* # of subxact ids for this xact only */
+
+    /* Additional info */
+    uint32             subx_offset;    /* array offset of start of subxip,
+                                     * zero if nsubxids == 0
+                                     */
+} RunningXact;
+
+typedef struct RunningXactsData
+{
+    uint32            xcnt;                /* # of xact ids in xrun[] */
+    uint32            subxcnt;            /* # of xact ids in subxip[] */
+    TransactionId     latestRunningXid;    /* Initial setting of LatestObservedXid */
+    TransactionId    oldestRunningXid;    /* *not* oldestXmin */
+    TransactionId     latestCompletedXid;
+
+    RunningXact    *xrun;            /* array of RunningXact structs */
+
+    /*
+     * subxip is held as a single contiguous array, so no space is wasted,
+     * plus it helps it fit into one XLogRecord.  We continue to keep track
+     * of which subxids go with each top-level xid by tracking the start
+     * offset, held on each RunningXact struct.
+     */
+    TransactionId *subxip;        /* array of subxact IDs in progress */
+
+} RunningXactsData;
+
+typedef RunningXactsData *RunningTransactions;
+
+/*
+ * When we write running xact data to WAL, we use this structure.
+ */
+typedef struct xl_xact_running_xacts
+{
+    int                xcnt;                /* # of xact ids in xrun[] */
+    int                subxcnt;            /* # of xact ids in subxip[] */
+    TransactionId    latestRunningXid;    /* Initial setting of LatestObservedXid */
+    TransactionId    oldestRunningXid;    /* *not* oldestXmin */
+    TransactionId    latestCompletedXid;
+
+    /* Array of RunningXact(s)  */
+    RunningXact    xrun[1];        /* VARIABLE LENGTH ARRAY */
+
+    /* ARRAY OF RUNNING SUBTRANSACTION XIDs FOLLOWS */
+} xl_xact_running_xacts;
+
+#define MinSizeOfXactRunningXacts offsetof(xl_xact_running_xacts, xrun)
+
+/*
  * Result codes for HeapTupleSatisfiesUpdate.  This should really be in
  * tqual.h, but we want to avoid including that file elsewhere.
  */

Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2009-02-24 at 21:59 +0200, Heikki Linnakangas wrote:
> 
>>> I think if I had not made those into procs you would have said that they
>>> are so similar it would aid code readability to have them be the same.
>> And in fact I suggested earlier that we get rid of the unobserved xids 
>> array, and only use recovery procs.
> 
> Last week, I think. Why are these tweaks so important?

Heh, actually, I went searching my mail for when I had suggested that, 
and found that in fact I proposed this exact same method of using the 
unobserved xids array only back in October:

http://archives.postgresql.org/message-id/48F76342.5070407@enterprisedb.com

I had since forgotten all about, but now came up with the same idea 
again during review.

In the first reply in that thread you said that "The main problem is 
fatal errors that don't write abort records. By reusing the PROC entries 
we can keep those to a manageable limit". We're not worried about that 
anymore.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Thu, 2009-02-26 at 10:04 +0200, Heikki Linnakangas wrote:

> we keep track of which xids 
> have already been "reported" in the WAL (similar to what you had in an
> earlier version of the patch)

You objected to doing exactly that earlier. Why is it OK to do it now
that you are proposing it?

You haven't even given a good reason to make these changes.

We don't have time to make this change and then shake out everything
else that will break as a result. Are you suggesting that you will make
these changes and then follow up on all other breakages? Forcing this
request seems like a great way to cancel this patch, since it will be
marked as "author refused to make change".

You have spotted a problem elsewhere and I am working to fix that now.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-26 at 10:04 +0200, Heikki Linnakangas wrote:
> 
>> we keep track of which xids 
>> have already been "reported" in the WAL (similar to what you had in an
>> earlier version of the patch)
> 
> You objected to doing exactly that earlier.

I'm talking about the "xidMarkedInWAL" and "hasUnMarkedSubXids" fields 
you had in TransactionState, at least still in version 
hs.v7.20090112_1.tar.bz2 of the patch. I objected to adding the 
corresponding flags in the WAL header, and that made tracking the status 
in TransactionState obsolete in the patch too, since it wasn't used for 
anything anymore. There's nothing wrong per se about tracking the 
"marked" or "reported" status in master.

> You haven't even given a good reason to make these changes.

Simplicity.

> We don't have time to make this change and then shake out everything
> else that will break as a result. Are you suggesting that you will make
> these changes and then follow up on all other breakages? Forcing this
> request seems like a great way to cancel this patch, since it will be
> marked as "author refused to make change".

I'm not suggesting anything to be canceled. I simply think these are 
changes that should be made. I wish you could make them, because that 
means less work for me. But if you're not willing to, I can pick it up 
myself.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Thu, 2009-02-26 at 11:36 +0200, Heikki Linnakangas wrote:

> > You haven't even given a good reason to make these changes.
> 
> Simplicity.

You used that argument in January to explain why the coupling should be
reduced and now the same argument to put it back again.

> > We don't have time to make this change and then shake out everything
> > else that will break as a result. Are you suggesting that you will make
> > these changes and then follow up on all other breakages? Forcing this
> > request seems like a great way to cancel this patch, since it will be
> > marked as "author refused to make change".
> 
> I'm not suggesting anything to be canceled. I simply think these are 
> changes that should be made. I wish you could make them, because that 
> means less work for me. But if you're not willing to, I can pick it up 
> myself.

When you review my code, you make many useful suggestions and I am very
thankful. Testing can't find out some of those things. My feeling is
that you are now concentrating on things that are optional, yet will
have a huge potential for negative impact. If I could please draw your
review efforts to other parts of the patch, I would be happy to return
to these parts later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot standby, recovery procs

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2009-02-26 at 11:36 +0200, Heikki Linnakangas wrote:
> 
>>> You haven't even given a good reason to make these changes.
>> Simplicity.
> 
> You used that argument in January to explain why the coupling should be
> reduced and now the same argument to put it back again.

That was in reference to the slot ids, I'm not suggesting to put that 
back. If anything, removing the need for the the xl_topxid field in WAL 
record will further reduce the coupling between master and standby.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby, recovery procs

From
Simon Riggs
Date:
On Thu, 2009-02-26 at 12:19 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-26 at 11:36 +0200, Heikki Linnakangas wrote:
> > 
> >>> You haven't even given a good reason to make these changes.
> >> Simplicity.
> > 
> > You used that argument in January to explain why the coupling should be
> > reduced and now the same argument to put it back again.
> 
> That was in reference to the slot ids, I'm not suggesting to put that 
> back. If anything, removing the need for the the xl_topxid field in WAL 
> record will further reduce the coupling between master and standby.

OK, well, if you feel those changes are necessary prior to commit then I
would ask you do that in your public repo and we'll test and provide
helpful comments on it from there as quickly as we can. Too many cooks
spoil the git.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support