Thread: Hot Standby 0.2.1

Hot Standby 0.2.1

From
Simon Riggs
Date:
OK, here is the latest version of the Hot Standby patchset. This is
about version 30+ by now, but we should regard this as 0.2.1
Patch against CVS HEAD (now): clean apply, compile, no known bugs.

OVERVIEW

You can download PDF versions of the fine manual is here
http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf
http://wiki.postgresql.org/images/0/01/Hot_Standby_Recovery_Functions.pdf

Also available via the project Wiki, which is here
http://wiki.postgresql.org/wiki/Hot_Standby

Patch should be attached to this email. If problems, get this and other
versions from Wiki please. No offlist comments, questions etc please.

PATCH VERSIONING & STATUS

0 meaning its not fully released yet
2 meaning this is a major new re-write
1 meaning this is the first release

Patch is still in testing and will be for next few days at least.
Released now only because I promised to do so.

Is this ready for commit? Yes, it is in the shape I want it to be in,
but also, No, I can't say it's been through a wide enough range of tests
as yet to be considered immediately ready for commit.

Further bug fixing and minor cosmetic development will take place via my
GIT repo, uploaded soon. Patch included here to meet deadlines and code
inclusion. BSD.

PATCH SUMMARY

* 76 files changed, 5160 insertions(+), 59 deletions(-), 1251 mods(!)
* 7 files with more than 100 lines changed
  procarray.c (1200+ additions)
  xlog.c (600+ additions)
  xact.c (500+ additions)
  inval.c (650 additions)
  lock.c
  heapam.c
  nbtxlog.c
* 29 files with 10 or fewer lines changed
* Applies cleanly to CVS HEAD as of now


CHANGES

The rough changes since version 1 series of patches.

* Full documentation included. Many, but not all nuances of SGML tagging
followed, but sufficient aspects there to allow for proofreading before
we do final changes. Some undocumented functions now correctly
documented. Recovery functions now split into user and control functions
in docs to make it clearer.

* Starting conditions in GetRunningTransactionData() are now much
stricter and holds more lwlocks. There are few cases where any not-found
xids are allowed during xid processing, so code is more robust. Please
check for race conditions.

* GetRunningTransactionData() now handles initialisation of
AccessExclusiveLocks correctly. Locks are counted in a low-contention
approach that avoids taking holding lock partition locks, if possible.

* Start-up conditions now recoded to allow faster start in cases where
many subtransactions are present. Recovery connections are only enabled
when the snapshot is valid.

* max_connections needs to be correctly set or HS will not allow
connections. Once snapshots are enabled they will continue to be
available always.

* RecordKnownAssignedTransactions() now contains a test for xid
wraparound threat which invokes conflict processing should that occur.

* Boolean states now clarified and corrected. Hot Standby can be turned
off completely if not required or if problems effect production. That
causes many changes but there is no change in the intention of those
sections of code.

* UnobservedXids processing follows Heikki's proposal, but has been
renamed to KnownAssignedXids. It has also been modularised and
completely re-written using a hash table approach. So far it has been
much more stable than the previous sorted array coding, which I am happy
to see in the shredder for all the problems it caused. Fully detailed
comments all through.

* All record types now respect max_standby_delay.

* All deferred conflict processing has been removed - conflict
processing itself is still enabled.

* A few other functions have been renamed and/or moved around to
rationalise their exact purpose/position within their modules.

* Prepared transactions holding AccessExclusiveLocks at the end of
recovery are now handled.

* Hash indexes are now safely handled. That was removed at request, but
we need it to avoid silent data loss for queries near those types of
index.

* Hint bits are now set, in appropriate circumstances.

* Flat file logic removed

* Large swathes of unused code removed.

* All code comments addressed and/or re-explained in more detail

* CHECKPOINT now works during recovery but performs restartpoint
instead.

* Tweaked max_standby_delay code to avoid long duration waits. Added
dynamic function to control delay during recovery. Added code for stats
collection and ps display. Set default to sensible production values.


RECENT BUGS

* Found and fixed missing relcache init file invalidation

* Found and fixed more serious VACUUM FULL-related weirdness <sigh>

* Recently discovered bug has resulted in changes in
XidInMVCCSnapshot(). Heikki's earlier approach did not correctly allow
for the maximum size of a snapshot. The simplicity of Heikki's proposal
is good, but hid a flaw in where snapshots would put their xids. I've
fully solved the problem though I expect further discussion.

I've looked through every change and verified it, but fixing all the
bugs means there's areas of new code added in last few days. I accept
that any bugs herein are my responsibility.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: Hot Standby 0.2.1

From
David Fetter
Date:
On Tue, Sep 15, 2009 at 10:41:59PM +0100, Simon Riggs wrote:
> 
> OK, here is the latest version of the Hot Standby patchset.  This is
> about version 30+ by now, but we should regard this as 0.2.1 Patch
> against CVS HEAD (now): clean apply, compile, no known bugs.

Kudos!!!!

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Hot Standby 0.2.1

From
Bernd Helmle
Date:

--On 15. September 2009 22:41:59 +0100 Simon Riggs <simon@2ndQuadrant.com> 
wrote:

> http://wiki.postgresql.org/images/0/01/Hot_Standby_Recovery_Functions.pdf

This doesn't work for me, it seems the correct link is

<http://wiki.postgresql.org/images/1/10/Hot_Standby_Recovery_Functions.pdf> 
?

-- 
Thanks
Bernd


Re: Hot Standby 0.2.1

From
Josh Berkus
Date:
All,

Now that Simon has submitted this, can some of the heavy-hitters here
review it?  Heikki?

Nobody's name is next to it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Hot Standby 0.2.1

From
Robert Haas
Date:
On Wed, Sep 16, 2009 at 6:05 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Now that Simon has submitted this, can some of the heavy-hitters here
> review it?  Heikki?
>
> Nobody's name is next to it.

I don't think anyone is planning to ignore this patch, but it wasn't
included in the first round of round-robin reviewing assignments
because it wasn't submitted until the following day, after the
announced deadline for submissions had already passed.  So most people
are probably busy with with some other patch at the moment, but that's
a temporary phenomenon.  This is a pretty small CommitFest, so there
shouldn't be any shortage of reviewers, though Heikki's time may be
stretched a little thin, since Streaming Replication is also in the
queue, and he is working on index-only scans.  That's really for him
to comment on, though.

...Robert


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Robert Haas wrote:
> On Wed, Sep 16, 2009 at 6:05 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Now that Simon has submitted this, can some of the heavy-hitters here
>> review it?  Heikki?
>>
>> Nobody's name is next to it.
> 
> I don't think anyone is planning to ignore this patch, but it wasn't
> included in the first round of round-robin reviewing assignments
> because it wasn't submitted until the following day, after the
> announced deadline for submissions had already passed.  So most people
> are probably busy with with some other patch at the moment, but that's
> a temporary phenomenon.

Right, I've added myself as reviewer now.

>  This is a pretty small CommitFest, so there
> shouldn't be any shortage of reviewers, though Heikki's time may be
> stretched a little thin, since Streaming Replication is also in the
> queue, and he is working on index-only scans.  That's really for him
> to comment on, though.

I'm going to put the index-only scans aside for now to focus on hot
standby and streaming replication. Both are big patches, so there's
plenty of work in those two alone, and not only for me.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Thu, 2009-09-17 at 09:54 +0300, Heikki Linnakangas wrote:

> >  This is a pretty small CommitFest, so there
> > shouldn't be any shortage of reviewers, though Heikki's time may be
> > stretched a little thin, since Streaming Replication is also in the
> > queue, and he is working on index-only scans.  That's really for him
> > to comment on, though.
> 
> I'm going to put the index-only scans aside for now to focus on hot
> standby and streaming replication. Both are big patches, so there's
> plenty of work in those two alone, and not only for me.

That's very good of you, thanks.

It was already clear to a few people that your time would bottleneck
trying to review both at the same time. I personally wasn't expecting
you to jump into immediate action on HS. We have time.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Robert Haas
Date:
On Thu, Sep 17, 2009 at 2:54 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Robert Haas wrote:
>> On Wed, Sep 16, 2009 at 6:05 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> Now that Simon has submitted this, can some of the heavy-hitters here
>>> review it?  Heikki?
>>>
>>> Nobody's name is next to it.
>>
>> I don't think anyone is planning to ignore this patch, but it wasn't
>> included in the first round of round-robin reviewing assignments
>> because it wasn't submitted until the following day, after the
>> announced deadline for submissions had already passed.  So most people
>> are probably busy with with some other patch at the moment, but that's
>> a temporary phenomenon.
>
> Right, I've added myself as reviewer now.
>
>>  This is a pretty small CommitFest, so there
>> shouldn't be any shortage of reviewers, though Heikki's time may be
>> stretched a little thin, since Streaming Replication is also in the
>> queue, and he is working on index-only scans.  That's really for him
>> to comment on, though.
>
> I'm going to put the index-only scans aside for now to focus on hot
> standby and streaming replication. Both are big patches, so there's
> plenty of work in those two alone, and not only for me.

What is the best way to attack this?  Should I keep reviewing
index-only scans so that you have feedback for when you get back to
it, or should I move on to something else?  If something else, does it
make more sense for me to look at HS since I did a bit of work with a
previous version, or would it be better for me to just pick one of the
other patches from the CommitFest and work on that?

Also, stepping back from me personally, should we try to assign some
additional reviewers to these patches?  Is there some way we can
divide up review tasks among multiple people so that we're not
repeating each others work?

Thoughts appreciated, from Heikki, Simon, or others.

...Robert


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Thu, 2009-09-17 at 19:01 -0400, Robert Haas wrote:
> >
> > I'm going to put the index-only scans aside for now to focus on hot
> > standby and streaming replication. Both are big patches, so there's
> > plenty of work in those two alone, and not only for me.
> 
> What is the best way to attack this?  Should I keep reviewing
> index-only scans so that you have feedback for when you get back to
> it, or should I move on to something else?  If something else, does it
> make more sense for me to look at HS since I did a bit of work with a
> previous version, or would it be better for me to just pick one of the
> other patches from the CommitFest and work on that?
> 
> Also, stepping back from me personally, should we try to assign some
> additional reviewers to these patches?  Is there some way we can
> divide up review tasks among multiple people so that we're not
> repeating each others work?
> 
> Thoughts appreciated, from Heikki, Simon, or others.

I think this is a great opportunity to widen the pool of people
contributing to reviews.

I suggest the creation of a second group of people, performing
round-robin testing of patches. These people would be able to verify
* documentation matches implemented features (does it do what it says on
the tin?)
* usability of explicit features (do the features work well?)
* usability gap of unimplemented features (what else do we need?)
* are there any bugs?

These questions are often quickly answered for smaller patches, but HS's
scope mean that such a task properly executed could take a full week, if
not longer.

Second group of people are just as skilled Postgres people as reviewers,
in some cases more so, apart from they have less detailed knowledge of
the codebase. We have many such people and it would be good to encourage
them to perform thorough reviews rather than "tire kicking".

I'm not sure that Heikki needs additional reviewers. He now has
significant knowledge of the patch and is good at focusing on key
aspects of the internals. Other code reviewers are welcome, of course.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Dimitri Fontaine
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Also, stepping back from me personally, should we try to assign some
> additional reviewers to these patches?  Is there some way we can
> divide up review tasks among multiple people so that we're not
> repeating each others work?
>
> Thoughts appreciated, from Heikki, Simon, or others.

How about this proposal: http://archives.postgresql.org/pgsql-hackers/2009-08/msg00764.php

Regards,
-- 
dim


Re: Hot Standby 0.2.1

From
Jeff Janes
Date:
On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

OK, here is the latest version of the Hot Standby patchset. This is
about version 30+ by now, but we should regard this as 0.2.1
Patch against CVS HEAD (now): clean apply, compile, no known bugs.


Hi Simon,

Is there a reason that you remove the WAL_DEBUG shown below?


*************** begin:;
*** 899,923 ****
        FIN_CRC32(rdata_crc);
        record->xl_crc = rdata_crc;

- #ifdef WAL_DEBUG
-       if (XLOG_DEBUG)
-       {
-               StringInfoData buf;
-
-               initStringInfo(&buf);
-               appendStringInfo(&buf, "INSERT @ %X/%X: ",
-                                                RecPtr.xlogid, RecPtr.xrecoff);
-               xlog_outrec(&buf, record);
-               if (rdata->data != NULL)
-               {
-                       appendStringInfo(&buf, " - ");
-                       RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
-               }
-               elog(LOG, "%s", buf.data);
-               pfree(buf.data);
-       }
- #endif
-
        /* Record begin of record in appropriate places */
        ProcLastRecPtr = RecPtr;
        Insert->PrevRecord = RecPtr;
--- 947,952 ----



Thanks,

Jeff

Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-18 at 07:23 -0700, Jeff Janes wrote:
> On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com>
> wrote:
>         
>         OK, here is the latest version of the Hot Standby patchset.
>         This is
>         about version 30+ by now, but we should regard this as 0.2.1
>         Patch against CVS HEAD (now): clean apply, compile, no known
>         bugs.

> Is there a reason that you remove the WAL_DEBUG shown below?

WAL_DEBUG is not removed by the patch, though that section of code is
removed, as you observe. I recall an earlier bug report by
me/conversation on hackers about how that section of code was
irrecoverably broken, since it's calling an rmgr routine while not in
recovery and also assuming the data is fully accessible at that point,
which it is not.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2009-09-18 at 07:23 -0700, Jeff Janes wrote:
>> Is there a reason that you remove the WAL_DEBUG shown below?

> WAL_DEBUG is not removed by the patch, though that section of code is
> removed, as you observe. I recall an earlier bug report by
> me/conversation on hackers about how that section of code was
> irrecoverably broken, since it's calling an rmgr routine while not in
> recovery and also assuming the data is fully accessible at that point,
> which it is not.

Wouldn't it be sufficient to remove the rm_desc() call?  I agree
that that's broken, but the rest doesn't seem to be.
        regards, tom lane


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-18 at 11:14 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Fri, 2009-09-18 at 07:23 -0700, Jeff Janes wrote:
> >> Is there a reason that you remove the WAL_DEBUG shown below?
> 
> > WAL_DEBUG is not removed by the patch, though that section of code is
> > removed, as you observe. I recall an earlier bug report by
> > me/conversation on hackers about how that section of code was
> > irrecoverably broken, since it's calling an rmgr routine while not in
> > recovery and also assuming the data is fully accessible at that point,
> > which it is not.
> 
> Wouldn't it be sufficient to remove the rm_desc() call?  I agree
> that that's broken, but the rest doesn't seem to be.

That would make sense also. Previous action just because that was
earlier consensus. Will change.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Marcos Luis Ortiz Valmaseda
Date:
I want to help on this area, but I need a mentor for this.
For example, Heikki will be a excellent mentor for me.

Following the theme, I think that we have to wide all questions for the process of the acceptance of a patch on the
sameway that you Simon. 

We could write new requirements with all these ideas. Don´t you think?

Regards

"The hurry is enemy of the success: for that reason.......Be patient"

Ing. Marcos L. Ortiz Valmaseda
Línea Soporte y Despliegue
Centro de Tecnologías de Almacenamiento y Análisis de Datos (CENTALAD)

Linux User # 418229
PostgreSQL User
http://www.postgresql.org
http://www.planetpostgresql.org/
http://www.postgresql-es.org/


----- Mensaje original -----
De: "Simon Riggs" <simon@2ndQuadrant.com>
Para: "Robert Haas" <robertmhaas@gmail.com>
CC: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Josh Berkus" <josh@agliodbs.com>,
pgsql-hackers@postgresql.org
Enviados: Jueves, 17 de Septiembre 2009 20:53:24 GMT -10:00 Hawai
Asunto: Re: [HACKERS] Hot Standby 0.2.1


On Thu, 2009-09-17 at 19:01 -0400, Robert Haas wrote:
> >
> > I'm going to put the index-only scans aside for now to focus on hot
> > standby and streaming replication. Both are big patches, so there's
> > plenty of work in those two alone, and not only for me.
>
> What is the best way to attack this?  Should I keep reviewing
> index-only scans so that you have feedback for when you get back to
> it, or should I move on to something else?  If something else, does it
> make more sense for me to look at HS since I did a bit of work with a
> previous version, or would it be better for me to just pick one of the
> other patches from the CommitFest and work on that?
>
> Also, stepping back from me personally, should we try to assign some
> additional reviewers to these patches?  Is there some way we can
> divide up review tasks among multiple people so that we're not
> repeating each others work?
>
> Thoughts appreciated, from Heikki, Simon, or others.

I think this is a great opportunity to widen the pool of people
contributing to reviews.

I suggest the creation of a second group of people, performing
round-robin testing of patches. These people would be able to verify
* documentation matches implemented features (does it do what it says on
the tin?)
* usability of explicit features (do the features work well?)
* usability gap of unimplemented features (what else do we need?)
* are there any bugs?

These questions are often quickly answered for smaller patches, but HS's
scope mean that such a task properly executed could take a full week, if
not longer.

Second group of people are just as skilled Postgres people as reviewers,
in some cases more so, apart from they have less detailed knowledge of
the codebase. We have many such people and it would be good to encourage
them to perform thorough reviews rather than "tire kicking".

I'm not sure that Heikki needs additional reviewers. He now has
significant knowledge of the patch and is good at focusing on key
aspects of the internals. Other code reviewers are welcome, of course.

-- Simon Riggs           www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> OK, here is the latest version of the Hot Standby patchset. This is
> about version 30+ by now, but we should regard this as 0.2.1
> Patch against CVS HEAD (now): clean apply, compile, no known bugs.

Thanks! Attached is some minor comment and fixes, and some dead code
removal. Also available in my git repository, branch 'hs-riggs'.

The documentation talks about setting and checking
default_transaction_read_only, but I think it doesn't say anything about
transaction_read_only, which I find odd. This in particular:

> Users will be able to tell whether their session is read-only by
> +   issuing SHOW default_transaction_read_only

seems misleading, as you might have default_transaction_read_only=on,
but still be able to do "SET transaction_read_only", so the *session*
isn't necessarily read-only.

The only bug I've found is this that we seem to be missing conflict
resolution for GiST index tuples deleted by the kill_prior_tuples
mechanism. Unless I'm missing something, we need similar handling there
that we have in b-tree.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 91917cf..2257ec6 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -2099,9 +2099,9 @@ if (!triggered)

    <para>
     In recovery, transactions will not be permitted to take any lock higher
-    other than AccessShareLock or AccessExclusiveLock. In addition,
-    transactions may never assign a TransactionId and may never write WAL.
-    The LOCK TABLE command by default applies an AccessExclusiveLock.
+    than AccessShareLock. In addition, transactions may never assign a
+        TransactionId and may never write WAL.
+    The LOCK TABLE command by default applies an AccessExclusiveLock.
     Any LOCK TABLE command that runs on the standby and requests a specific
     lock type other than AccessShareLock will be rejected.
    </para>
@@ -2168,8 +2168,8 @@ if (!triggered)

    <para>
     An example of the above would be an Administrator on Primary server
-    runs a DROP TABLE command that refers to a table currently in use by
-    a User query on the standby server.
+    runs a DROP TABLE command on a table that's currently being queried
+    in the standby server.
    </para>

    <para>
@@ -2198,9 +2198,9 @@ if (!triggered)
    <para>
     We have a number of choices for resolving query conflicts.  The default
     is that we wait and hope the query completes. If the recovery is not paused,
-    then the server will wait automatically until the server the lag between
+    then the server will wait automatically until the lag between
     primary and standby is at most max_standby_delay seconds. Once that grace
-    period expires, we then take one of the following actions:
+    period expires, we take one of the following actions:

       <itemizedlist>
        <listitem>
@@ -2213,7 +2213,7 @@ if (!triggered)
         <para>
          If the conflict is caused by cleanup records we tell the standby query
           that a conflict has occurred and that it must cancel itself to avoid the
-          risk that it attempts to silently fails to read relevant data because
+         risk that it silently fails to read relevant data because
           that data has been removed. (This is very similar to the much feared
          error message "snapshot too old").
         </para>
@@ -2222,7 +2222,7 @@ if (!triggered)
          Note also that this means that idle-in-transaction sessions are never
          canceled except by locks. Users should be clear that tables that are
          regularly and heavily updated on primary server will quickly cause
-         cancellation of any longer running queries made against those tables.
+         cancellation of any longer running queries in the standby.
         </para>

         <para>
@@ -2235,7 +2235,7 @@ if (!triggered)
    </para>

    <para>
-    Other remdial actions exist if the number of cancelations is unacceptable.
+    Other remedial actions exist if the number of cancelations is unacceptable.
     The first option is to connect to primary server and keep a query active
     for as long as we need to run queries on the standby. This guarantees that
     a WAL cleanup record is never generated and we don't ever get query
@@ -2283,7 +2283,7 @@ if (!triggered)
    <title>Administrator's Overview</title>

    <para>
-    If there is a recovery.conf file present then the will start in Hot Standby
+    If there is a recovery.conf file present the server will start in Hot Standby
     mode by default, though this can be disabled by setting
     "recovery_connections = off" in recovery.conf. The server may take some
     time to enable recovery connections since the server must first complete
@@ -2329,7 +2329,7 @@ LOG:  database system is ready to accept read only connections
     A set of functions allow superusers to control the flow of recovery
     are described in <xref linkend="functions-recovery-control-table">.
     These functions allow you to pause and continue recovery, as well
-    as dynamically set new recovery targets wile recovery progresses.
+    as dynamically set new recovery targets while recovery progresses.
     Note that when a server is paused the apparent delay between primary
     and standby will continue to increase.
    </para>
@@ -2354,7 +2354,7 @@ LOG:  database system is ready to accept read only connections
    </para>

    <para>
-    The following types of administrator command will not be accepted
+    The following types of administrator command are not accepted
     during recovery mode

       <itemizedlist>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 06243c0..b38b344 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -372,9 +372,9 @@ SET ENABLE_SEQSCAN TO OFF;
        </para>

        <para>
-        When running a standby server, it is strongly recommended that you
-        set this parameter to be the same or higher than the master server.
-        Otherwise, queries on the standby server may fail.
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries on
+        will not be allowed in the standby server.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 94b7202..c6612b8 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -622,7 +622,7 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
     uint8        info = record->xl_info & ~XLR_INFO_MASK;

     /*
-     * GIN indexes do not require any conflict processing. XXX really?
+     * GIN indexes do not require any conflict processing.
      */
     if (InHotStandby)
         RecordKnownAssignedTransactionIds(record->xl_xid);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3e5f3b6..8ed49e0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -397,7 +397,8 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record)
     MemoryContext oldCxt;

     /*
-     * GIST indexes do not require any conflict processing. XXX really?
+     * GIST indexes do not require any conflict processing. XXX what about
+     * killed tuples?
      */
     if (InHotStandby)
         RecordKnownAssignedTransactionIds(record->xl_xid);
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c58d2fc..d7dfbdd 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -410,7 +410,7 @@ situation the locking requirements can be relaxed and we do not need
 double locking during block splits. Each WAL record makes changes to a
 single level of the btree using the correct locking sequence and so
 is safe for concurrent readers. Some readers may observe a block split
-in progress as they descend the tree, but they will simple move right
+in progress as they descend the tree, but they will simpye move right
 onto the correct page.

 During recovery all index scans start with ignore_killed_tuples = false
@@ -419,8 +419,8 @@ on the standby server can be older than the oldest xmin on the master
 server, which means tuples can be marked as killed even when they are
 still visible on the standby. We don't WAL log tuple killed bits, but
 they can still appear in the standby because of full page writes. So
-we must always ignore them and that means it's not worth setting them
-either.
+we must always ignore them in standby, and that means it's not worth
+setting them either.

 Other Things That Are Handy to Know
 -----------------------------------
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index eefa888..7871524 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -721,6 +721,7 @@ _bt_delitems(Relation rel, Buffer buf,
              * We would like to set an accurate latestRemovedXid, but there
              * is no easy way of obtaining a useful value. So we use the
              * probably far too conservative value of RecentGlobalXmin instead.
+             * XXX: this comment is bogus? We don't use RecentGlobalXmin
              */
             xlrec_delete.latestRemovedXid = InvalidTransactionId;
             rdata[0].data = (char *) &xlrec_delete;
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 864c41c..6b962b6 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -23,7 +23,7 @@
 #include "miscadmin.h"

 /*
- * We must keep track of expected insertions due to page spl    its, and apply
+ * We must keep track of expected insertions due to page splits, and apply
  * them manually if they are not seen in the WAL log during replay.  This
  * makes it safe for page insertion to be a multiple-WAL-action process.
  *
@@ -503,8 +503,8 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
     }

     /*
-     * We need to take a cleanup lock to apply these changes.
-     * See nbtree/README for details.
+     * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
+     * page. See nbtree/README for details.
      */
     buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
     if (!BufferIsValid(buffer))
@@ -805,11 +805,11 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)

     /*
      * Btree delete records can conflict with standby queries. You might
-     * think that vacuum records would conflict as well, but they don't.
-     * XLOG_HEAP2_CLEANUP_INFO records provide the highest xid cleaned
-     * by the vacuum of the heap and so we can resolve any conflicts just
-     * once when that arrives. After that any we know that no conflicts exist
-     * from individual btree vacuum records on that index.
+     * think that vacuum records would conflict as well, but we've handled
+     * that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+     * cleaned by the vacuum of the heap and so we can resolve any conflicts
+     * just once when that arrives. After that any we know that no conflicts
+     * exist from individual btree vacuum records on that index.
      */
     if (InHotStandby)
     {
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index fc7ecfd..46b48f0 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -195,8 +195,7 @@ they first do something that requires one --- typically, insert/update/delete
 a tuple, though there are a few other places that need an XID assigned.
 If a subtransaction requires an XID, we always first assign one to its
 parent.  This maintains the invariant that child transactions have XIDs later
-than their parents, which is assumed in a number of places. In 8.5 onwards,
-some corner cases exist that require XID assignment to be WAL logged.
+than their parents, which is assumed in a number of places.

 The subsidiary actions of obtaining a lock on the XID and and entering it into
 pg_subtrans and PG_PROC are done at the time it is assigned.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 28d1cf0..84c0d54 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -376,7 +376,6 @@ ProcArrayClearTransaction(PGPROC *proc)
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
     proc->recoveryConflictMode = 0;
-    proc->recoveryConflictLSN = InvalidXLogRecPtr;

     /* redundant, but just in case */
     proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -1171,7 +1170,7 @@ void
 GetRunningTransactionData(void)
 {
     ProcArrayStruct *arrayP = procArray;
-    static RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData;
+    RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData;
     RunningXact    *rxact;
     TransactionId *subxip;
     TransactionId latestRunningXid = InvalidTransactionId;
@@ -1274,14 +1273,8 @@ GetRunningTransactionData(void)
         numHeldLocks += proc->numHeldLocks;

         /*
-         * Save subtransaction XIDs.
-         *
-         * The other backend can add more subxids concurrently, but cannot
-         * remove any.    Hence it's important to fetch nxids just once. Should
-         * be safe to use memcpy, though.  (We needn't worry about missing any
-         * xids added concurrently, because they must postdate xmax.)
-         *
-         * Again, our own XIDs *are* included in the snapshot.
+         * Save subtransaction XIDs. Other backends can't add or remove entries
+         * while we're holding XidGenLock.
          */
         nxids = proc->subxids.nxids;

@@ -1536,41 +1529,6 @@ BackendPidGetProc(int pid)
 }

 /*
- * BackendXidGetProc -- get a backend's PGPROC given its XID
- *
- * Returns NULL if not found.  Note that it is up to the caller to be
- * sure that the question remains meaningful for long enough for the
- * answer to be used ...
- */
-PGPROC *
-BackendXidGetProc(TransactionId xid)
-{
-    PGPROC       *result = NULL;
-    ProcArrayStruct *arrayP = procArray;
-    int            index;
-
-    if (xid == InvalidTransactionId)    /* never match invalid xid */
-        return 0;
-
-    LWLockAcquire(ProcArrayLock, LW_SHARED);
-
-    for (index = 0; index < arrayP->numProcs; index++)
-    {
-        PGPROC       *proc = arrayP->procs[index];
-
-        if (proc->xid == xid)
-        {
-            result = proc;
-            break;
-        }
-    }
-
-    LWLockRelease(ProcArrayLock);
-
-    return result;
-}
-
-/*
  * BackendXidGetPid -- get a backend's pid given its XID
  *
  * Returns 0 if not found or it's a prepared transaction.  Note that
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index d05e7a2..48d6553 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1586,15 +1586,11 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
                 /*
                  * Issue orders for the proc to read next time it receives SIGINT
                  * Note that this is an atomic change and requires no locking.
+                 *
+                 * XXX: Huh?
                  */
                 if (proc->recoveryConflictMode < cancel_mode)
-                {
-                    if (cancel_mode == ERROR &&
-                        XLByteLT(proc->recoveryConflictLSN, conflict_lsn))
-                        proc->recoveryConflictLSN = conflict_lsn;
-
                     proc->recoveryConflictMode = cancel_mode;
-                }

                 /*
                  * Do we expect it to talk? No, Mr. Bond, we expect it to die.
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index bfcb75c..91b4e23 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -536,10 +536,6 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC            (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST    (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)

-/* XXX probably needs new RMgr call to do this cleanly */
-extern bool btree_is_cleanup_record(uint8 info);
-extern bool btree_needs_cleanup_lock(uint8 info);
-
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index f674547..895532d 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -294,7 +294,6 @@ typedef struct LOCKTAG
  * nRequested -- total requested locks of all types.
  * granted -- count of each lock type currently granted on the lock.
  * nGranted -- total granted locks of all types.
- * xid -- xid of current/only lock holder for use by GetLockStatusData()
  *
  * Note: these counts count 1 for each backend.  Internally to a backend,
  * there may be multiple grabs on a particular lock, but this is not reflected
@@ -375,6 +374,11 @@ typedef struct PROCLOCK
 #define PROCLOCK_LOCKMETHOD(proclock) \
     LOCK_LOCKMETHOD(*((proclock).tag.myLock))

+/*
+ * Does acquisition of this lock need to be replayed in a standby server?
+ * Only AccessExclusiveLocks can conflict with lock types that read-only
+ * transactions can acquire in a standby server.
+ */
 #define IsProcLockLoggable(proclock)    \
             ((proclock->holdMask & LOCKBIT_ON(AccessExclusiveLock)) && \
              (proclock->tag.myLock)->tag.locktag_type == LOCKTAG_RELATION)
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5729482..0df2529 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -99,12 +99,9 @@ struct PGPROC
     int            numHeldLocks;    /* Number of AccessExclusiveLocks held by
                                  * current backend. */
     /*
-     * InHotStandby mode, the lsn of the first conflict, if any.
-     * Any block seen with changes made after this lsn will trigger
-     * query cancelation. Always set recoveryConflictCancelMode after
-     * setting conflictLSN so we can check this without spin locking.
+     * While in hot standby mode, setting recoveryConflictMode instructs
+     * the backend to commit suicide.
      */
-    XLogRecPtr     recoveryConflictLSN;
     int            recoveryConflictMode;

     /* Info about LWLock the process is currently waiting for, if any. */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b882795..36d5d52 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -54,7 +54,6 @@ extern int    GetTransactionsInCommit(TransactionId **xids_p);
 extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids);

 extern PGPROC *BackendPidGetProc(int pid);
-extern PGPROC *BackendXidGetProc(TransactionId xid);
 extern int    BackendXidGetPid(TransactionId xid);
 extern bool IsBackendPid(int pid);

diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 92feaf8..67e8112 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -17,6 +17,7 @@
 #include "storage/buf.h"
 #include "storage/sinval.h"

+
 typedef struct SnapshotData *Snapshot;

 #define InvalidSnapshot        ((Snapshot) NULL)
@@ -51,7 +52,8 @@ typedef struct SnapshotData
     /* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
     int32        subxcnt;        /* # of xact ids in subxip[], -1 if overflow */
     TransactionId *subxip;        /* array of subxact IDs in progress */
-    bool        takenDuringRecovery;    /* Recovery-shaped snapshot? */
+    bool        takenDuringRecovery;    /* recovery-shaped snapshot? */
+
     /*
      * note: all ids in subxip[] are >= xmin, but we don't bother filtering
      * out any that are >= xmax

Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote:

> The only bug I've found 

!

> is this that we seem to be missing conflict
> resolution for GiST index tuples deleted by the kill_prior_tuples
> mechanism. Unless I'm missing something, we need similar handling there
> that we have in b-tree.

OK, I agree with that. Straightforward change. Thanks very much.

I marked the comment to indicate that the handling for GIST and GIN
indexes looked dubious to me also. I had the earlier "it is safe"
comments but that was before we looked at the kill prior tuples issue.

Re-reading code for GIN also, I note that there isn't any further work
because we don't kill prior tuples ever. Also, there is no special
handling of the GIN b-tree posting tree because VACUUM scans that in
logical sequence, rather than the physical sequence in nbtree.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote:
> The documentation talks about setting and checking
> default_transaction_read_only, but I think it doesn't say anything
> about
> transaction_read_only, which I find odd. This in particular:
> 
> > Users will be able to tell whether their session is read-only by
> > +   issuing SHOW default_transaction_read_only
> 
> seems misleading, as you might have default_transaction_read_only=on,
> but still be able to do "SET transaction_read_only", so the *session*
> isn't necessarily read-only.

Yes, clearly missing a check there. Those two operations should be
blocked at higher level, using PreventCommandDuringRecovery() and I
confess that I thought they already were. Doesn't crash because of the
other checks in place, but gives wrong error message.

Thanks for penetration testing the patch.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Mon, 2009-09-21 at 14:01 +0100, Simon Riggs wrote:
> On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote:

> > is this that we seem to be missing conflict
> > resolution for GiST index tuples deleted by the kill_prior_tuples
> > mechanism. Unless I'm missing something, we need similar handling there
> > that we have in b-tree.
> 
> OK, I agree with that. Straightforward change. Thanks very much.
> 
> I marked the comment to indicate that the handling for GIST and GIN
> indexes looked dubious to me also. I had the earlier "it is safe"
> comments but that was before we looked at the kill prior tuples issue.

ISTM I looked at this too quickly.

kill_prior_tuple is only ever set by these lines, after scan starts:
   if (!scan->xactStartedInRecovery)       scan->kill_prior_tuple = scan->xs_hot_dead;

which is set in indexam.c, not within any particular am. So the coding,
as submitted, covers all index types, current and future.

AFAICS there is no bug, unless you have a test case or can explain
further?

Worth raising as a query because it forced me to re-check how GIST and
GIN work and am happy again now.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Robert Haas
Date:
On Mon, Sep 21, 2009 at 9:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote:
>
>> The only bug I've found
>
> !

Yeah, wow.

...Robert


Re: Hot Standby 0.2.1

From
Bruce Momjian
Date:
Simon Riggs wrote:
> 
> OK, here is the latest version of the Hot Standby patchset. This is
> about version 30+ by now, but we should regard this as 0.2.1
> Patch against CVS HEAD (now): clean apply, compile, no known bugs.

Wow, great!  Simon has allowed us to pass a great milestone in Postgres
development.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Hot Standby 0.2.1

From
Jeff Janes
Date:
On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> OK, here is the latest version of the Hot Standby patchset. This is
> about version 30+ by now, but we should regard this as 0.2.1
> Patch against CVS HEAD (now): clean apply, compile, no known bugs.
>
> OVERVIEW
>
> You can download PDF versions of the fine manual is here
> http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf


From this doc:

"In recovery, transactions will not be permitted to take any lock
higher other than
AccessShareLock or AccessExclusiveLock. In addition, transactions may never
assign a TransactionId and may never write WAL. The LOCK TABLE command by
default applies an AccessExclusiveLock. Any LOCK TABLE command that runs on
the standby and requests a specific lock type other than AccessShareLock will be
rejected."

The first sentence seems to say that clients on the stand-by can take
ACCESS EXCLUSIVE, while the last sentence seems to say that they
cannot do so.

I did a little experiment on a hot standby instance.  I expected that
either I would be denied the lock altogether, or the lock would cause
WAL replay to be paused until either I committed or was forcibly
canceled.  But neither happened, I was granted the lock but WAL replay
continued anyway.

jjanes=# begin;
BEGIN
jjanes=# lock table pgbench_history in access exclusive mode;
LOCK TABLE
jjanes=# select count(*) from pgbench_history;
 count
--------
 519104
(1 row)

jjanes=# select count(*) from pgbench_history;
 count
--------
 527814
(1 row)

Is this the expected behavior?

Thanks,

Jeff


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2009-09-21 at 14:01 +0100, Simon Riggs wrote:
>> On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote:
> 
>>> is this that we seem to be missing conflict
>>> resolution for GiST index tuples deleted by the kill_prior_tuples
>>> mechanism. Unless I'm missing something, we need similar handling there
>>> that we have in b-tree.
>> OK, I agree with that. Straightforward change. Thanks very much.
>>
>> I marked the comment to indicate that the handling for GIST and GIN
>> indexes looked dubious to me also. I had the earlier "it is safe"
>> comments but that was before we looked at the kill prior tuples issue.
> 
> ISTM I looked at this too quickly.
> 
> kill_prior_tuple is only ever set by these lines, after scan starts:
> 
>     if (!scan->xactStartedInRecovery)
>         scan->kill_prior_tuple = scan->xs_hot_dead;
> 
> which is set in indexam.c, not within any particular am. So the coding,
> as submitted, covers all index types, current and future.

That just stops more tuples from being killed in the standby. I was
thinking that we need similar conflict resolution in GiST that we do
with b-tree delete records, to stop killed tuples from being deleted
while they might still be needed in the standby.

But looking closer at GiST, it seems that GiST doesn't actually do that;
killed tuples are not removed at page splits, but only by VACUUM. So
that's not an issue after all.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote:
> On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > OK, here is the latest version of the Hot Standby patchset. This is
> > about version 30+ by now, but we should regard this as 0.2.1
> > Patch against CVS HEAD (now): clean apply, compile, no known bugs.
> >
> > OVERVIEW
> >
> > You can download PDF versions of the fine manual is here
> > http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf
> 
> 
> >From this doc:
> 
> "In recovery, transactions will not be permitted to take any lock
> higher other than
> AccessShareLock or AccessExclusiveLock. In addition, transactions may never
> assign a TransactionId and may never write WAL. The LOCK TABLE command by
> default applies an AccessExclusiveLock. Any LOCK TABLE command that runs on
> the standby and requests a specific lock type other than AccessShareLock will be
> rejected."
> 
> The first sentence seems to say that clients on the stand-by can take
> ACCESS EXCLUSIVE, while the last sentence seems to say that they
> cannot do so.

You are right to pick up that discrepancy, as Heikki did also.

The root cause of that discrepancy is a change in patch behaviour
between January and now that I will use your post to highlight and
discuss, if you don't mind. (and yes, the docs need to be corrected)

Initially, it seemed that it was certain that a read-only backend could
not take an AccessExclusiveLock. On further thought, there is no
particular reason to block AccessExclusiveLocks themselves, just that
most things you might do while holding one are banned. But the lock
itself is fine. (Any challenge on that?)

AccessExclusiveLocks can be used to serialize the actions of other
backends. That is a very common use case, so my concern was that LOCK
TABLE would be a no-op unless we allowed AccessExclusiveLock, so the
patch does allow it.

> I did a little experiment on a hot standby instance.  I expected that
> either I would be denied the lock altogether, or the lock would cause
> WAL replay to be paused until either I committed or was forcibly
> canceled.  But neither happened, I was granted the lock but WAL replay
> continued anyway.
> 
> jjanes=# begin;
> BEGIN
> jjanes=# lock table pgbench_history in access exclusive mode;
> LOCK TABLE
> jjanes=# select count(*) from pgbench_history;
>  count
> --------
>  519104
> (1 row)
> 
> jjanes=# select count(*) from pgbench_history;
>  count
> --------
>  527814
> (1 row)
> 
> Is this the expected behavior?

By me, yes. WAL replay does not require a table lock to progress. Any
changes are protected with block-level locks. It does acquire a table
lock and cancel conflicting queries when it is about to replay something
that would cause a query to explode, such as dropping a table, as
explained in docs.

So this is not a bug.

The explanation of how the above sequence of events occurs is that the
backend acquires AccessExclusiveLock - please check on other session in
pg_locks. WAL replay continues by the Startup process, inserting further
rows into the pgbench_history table as a series of transactions. The
second select takes a later snapshot than the first and so sees more
data than the first select, hence a larger count. (And I am pleased to
see that recovery is progressing quickly even while your queries run).

So not a bug, but just one of many possible behaviours we could enforce.
1. Allow AccessExclusiveLocks yet they do not interrupt WAL replay
2. Allow AccessExclusiveLocks but have them pause WAL replay
3. Disallow AccessExclusiveLocks (and so LOCK TABLE is effectively a
no-op because it will not be able to serialize anything)

So the patch originally implemented (3) but now implements (1).

I would say that (2) is very undesirable because it puts WAL replay in
the control of non-superusers. That could mean LOCK TABLE implicitly
alters the high availability of the standby, and might even be used
maliciously to do that.

I'm open to views on whether we should use (1) or (3). Comments?

Implementing either is no problem and we have a straight choice. We may
even wish to review that again later from additional feedback.

(Jeff, you have also helped me understand that there is a bug in the way
serializable transactions are cancelled, which is easily corrected.
Thanks for that unexpected windfall, but it is otherwise unrelated to
your comments.)

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote:
>> jjanes=# begin;
>> BEGIN
>> jjanes=# lock table pgbench_history in access exclusive mode;
>> LOCK TABLE
>> jjanes=# select count(*) from pgbench_history;
>>  count
>> --------
>>  519104
>> (1 row)
>>
>> jjanes=# select count(*) from pgbench_history;
>>  count
>> --------
>>  527814
>> (1 row)
>>
>> Is this the expected behavior?
> 
> By me, yes. WAL replay does not require a table lock to progress. Any
> changes are protected with block-level locks. It does acquire a table
> lock and cancel conflicting queries when it is about to replay something
> that would cause a query to explode, such as dropping a table, as
> explained in docs.

That is rather surprising. You can't get that result in a normal server,
which I think is enough of a reason to disallow it. If you do LOCK TABLE
ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your
nose.

> So not a bug, but just one of many possible behaviours we could enforce.
> 1. Allow AccessExclusiveLocks yet they do not interrupt WAL replay
> 2. Allow AccessExclusiveLocks but have them pause WAL replay
> 3. Disallow AccessExclusiveLocks (and so LOCK TABLE is effectively a
> no-op because it will not be able to serialize anything)
> 
> So the patch originally implemented (3) but now implements (1).
> 
> I would say that (2) is very undesirable because it puts WAL replay in
> the control of non-superusers. That could mean LOCK TABLE implicitly
> alters the high availability of the standby, and might even be used
> maliciously to do that.

I don't see a problem with (2) as long as the locker is kicked out after
max_standby_delay, like a long-running query. That's what I would
expect. I'm fine with (3) as well, but (1) does seem rather suprising
behavior.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Tue, 2009-09-22 at 11:04 +0300, Heikki Linnakangas wrote:
> > 
> > By me, yes. WAL replay does not require a table lock to progress. Any
> > changes are protected with block-level locks. It does acquire a table
> > lock and cancel conflicting queries when it is about to replay something
> > that would cause a query to explode, such as dropping a table, as
> > explained in docs.
> 
> That is rather surprising. You can't get that result in a normal server,
> which I think is enough of a reason to disallow it. If you do LOCK TABLE
> ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your
> nose.

OK, "normality" is a reasonable argument against. So (1) is only a
partial implementation of serializing the table.

> > So not a bug, but just one of many possible behaviours we could enforce.
> > 1. Allow AccessExclusiveLocks yet they do not interrupt WAL replay
> > 2. Allow AccessExclusiveLocks but have them pause WAL replay
> > 3. Disallow AccessExclusiveLocks (and so LOCK TABLE is effectively a
> > no-op because it will not be able to serialize anything)
> > 
> > So the patch originally implemented (3) but now implements (1).
> > 
> > I would say that (2) is very undesirable because it puts WAL replay in
> > the control of non-superusers. That could mean LOCK TABLE implicitly
> > alters the high availability of the standby, and might even be used
> > maliciously to do that.
> 
> I don't see a problem with (2) as long as the locker is kicked out after
> max_standby_delay, like a long-running query. That's what I would
> expect. I'm fine with (3) as well, but (1) does seem rather suprising
> behavior.

(2) gives other problems because it would force us to check for
conflicting locks for each heap & index WAL record to ensure that the
lock was honoured. We could optimize that but it's still going to cost.

I'd rather leave things at (3) for now and wait for further feedback.
"Start strict, relax later".

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
In testing, it looks like there's still something wrong with the
subtransaction handling. I created a test function to create a large
number of subtransactions:

CREATE LANGUAGE plpgsql;
CREATE TABLE bar (id int4);

CREATE OR REPLACE FUNCTION subxids (n integer) RETURNS void LANGUAGE
plpgsql AS $$   BEGIN     IF n <= 0 THEN RETURN; END IF;     INSERT INTO bar VALUES (n);     PERFORM subxids(n - 1);
RETURN;   EXCEPTION WHEN raise_exception THEN NULL; END;
 
$$;

And used that to created 100 nested subtransactions in the primary server:

SELECT subxids(100);


I got this in the standby:

...
LOG:  REDO @ 0/A000EA8; LSN 0/A000FCC: prev 0/A000E6C; xid 2662; len
264: Transaction - xid assignment xtop 2599 nsubxids 64; 2600 2601 2602
2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616
2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630
2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644
2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658
2659 2660 2661 2662 0
LOG:  extend subtrans  xid 2600 page 1 last_page 0
CONTEXT:  rmgr 1 redo xid assignment xtop 2599 nsubxids 64; 2600 2601
2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615
2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629
2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643
2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657
2658 2659 2660 2661 2662 0
TRAP: FailedAssertion("!(((newestXact) >= ((TransactionId) 3)))", File:
"subtrans.c", Line: 299)
LOG:  startup process (PID 28594) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

Apparently an InvalidXid is sneaking into the unreported xid array
collected in the primary.

I actually bumped into this while I was testing a simpler implementation
of the logic to collect the unreported xids in the primary. As the patch
stands, we keep track of how many of the childXids at each
subtransaction level have already been reported, but it occurs to me
that it would be a lot simpler to populate a separate array of
unreported xids on-the-fly, as xids are assigned. With the simpler
implementation I bumped into another issue (see below), which is why I
wanted to test if it worked without the simplification.

So with the simpler logic, I had this problem. When I do this in the
primary:

postgres=# SELECT subxids(10000);
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth", after
ensuring the platform's stack depth limit is adequate.
CONTEXT:  SQL statement "SELECT  subxids( $1  - 1)"
PL/pgSQL function "subxids" line 4 at PERFORM
SQL statement "SELECT  subxids( $1  - 1)"
...

I get this in the standby:

...
LOG:  REDO @ 0/1000B6C4; LSN 0/1000B6F0: prev 0/1000B698; xid 4325; len
16: Transaction - abort: 2009-09-22 12:45:00.938243+03
LOG:  record known xact 4325 latestObservedXid 4334
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938243+03
LOG:  remove KnownAssignedXid 4325
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938243+03
LOG:  REDO @ 0/1000B6F0; LSN 0/1000B71C: prev 0/1000B6C4; xid 4324; len
16: Transaction - abort: 2009-09-22 12:45:00.938438+03
LOG:  record known xact 4324 latestObservedXid 4334
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938438+03
LOG:  remove KnownAssignedXid 4324
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938438+03
LOG:  REDO @ 0/1000B71C; LSN 0/1000B748: prev 0/1000B6F0; xid 4323; len
16: Transaction - abort: 2009-09-22 12:45:00.938632+03
LOG:  record known xact 4323 latestObservedXid 4334
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03
LOG:  remove KnownAssignedXid 4323
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03
LOG:  1 KnownAssignedXids 3619
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03
FATAL:  cannot remove KnownAssignedXid 4323
CONTEXT:  rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03
LOG:  startup process (PID 31284) exited with exit code 1
LOG:  terminating any other active server processes

It looks like the standby tries to remove XID 4323 from the
known-assigned hash table, but it's not there because it was removed and
set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I guess we
should just not throw an error in that case, but is there a way we could
catch that narrow case and still keep the check in
KnownAssignedXidsRemove()? It seems like the check might help catch
other bugs, so I'd rather keep it if possible.

I've pushed the simplified code to my git repository if you want to take
a look.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote:

> In testing, it looks like there's still something wrong with the
> subtransaction handling. I created a test function to create a large
> number of subtransactions:

OK, looking at this now. Thanks for the report.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote:

> It looks like the standby tries to remove XID 4323 from the
> known-assigned hash table, but it's not there because it was removed
> and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I
> guess we should just not throw an error in that case, but is there a
> way we could catch that narrow case and still keep the check in
> KnownAssignedXidsRemove()? It seems like the check might help catch
> other bugs, so I'd rather keep it if possible.

Yes, a certain test is important and I'm glad we've (almost) achieved
it.

ISTM we can say if not in KnownAssignedXids then check pg_subtrans. If
not in either then report error. May need to add info to the abort
record to check xtop also.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote:
> In testing, it looks like there's still something wrong with the
> subtransaction handling. I created a test function to create a large
> number of subtransactions:
> 
> CREATE LANGUAGE plpgsql;
> CREATE TABLE bar (id int4);
> 
> CREATE OR REPLACE FUNCTION subxids (n integer) RETURNS void LANGUAGE
> plpgsql AS $$
>     BEGIN
>       IF n <= 0 THEN RETURN; END IF;
>       INSERT INTO bar VALUES (n);
>       PERFORM subxids(n - 1);
>       RETURN;
>     EXCEPTION WHEN raise_exception THEN NULL; END;
> $$;
> 
> And used that to created 100 nested subtransactions in the primary server:
> 
> SELECT subxids(100);
> 
> 
> I got this in the standby:
> 
> ...
> LOG:  REDO @ 0/A000EA8; LSN 0/A000FCC: prev 0/A000E6C; xid 2662; len
> 264: Transaction - xid assignment xtop 2599 nsubxids 64; 2600 2601 2602
> 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616
> 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630
> 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644
> 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658
> 2659 2660 2661 2662 0
> LOG:  extend subtrans  xid 2600 page 1 last_page 0
> CONTEXT:  rmgr 1 redo xid assignment xtop 2599 nsubxids 64; 2600 2601
> 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615
> 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629
> 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643
> 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657
> 2658 2659 2660 2661 2662 0
> TRAP: FailedAssertion("!(((newestXact) >= ((TransactionId) 3)))", File:
> "subtrans.c", Line: 299)
> LOG:  startup process (PID 28594) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> 
> Apparently an InvalidXid is sneaking into the unreported xid array
> collected in the primary.
> 
> I actually bumped into this while I was testing a simpler implementation
> of the logic to collect the unreported xids in the primary. As the patch
> stands, we keep track of how many of the childXids at each
> subtransaction level have already been reported, but it occurs to me
> that it would be a lot simpler to populate a separate array of
> unreported xids on-the-fly, as xids are assigned. With the simpler
> implementation I bumped into another issue (see below), which is why I
> wanted to test if it worked without the simplification.

The bug seems an off-by-one error and would seem easily corrected in any
case; there is no fundamental problem revealed.

I prefer your suggested approach, since it avoids that rather complex
looking code that did the grovelling thru the nested xact states.

I'll get back to you with more on this tomorrow.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:
> Simon Riggs wrote:
> > On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote:
> >> jjanes=# begin;
> >> BEGIN
> >> jjanes=# lock table pgbench_history in access exclusive mode;
> >> LOCK TABLE
> >> jjanes=# select count(*) from pgbench_history;
> >>  count
> >> --------
> >>  519104
> >> (1 row)
> >>
> >> jjanes=# select count(*) from pgbench_history;
> >>  count
> >> --------
> >>  527814
> >> (1 row)
> >>
> >> Is this the expected behavior?
> > 
> > By me, yes. WAL replay does not require a table lock to progress. Any
> > changes are protected with block-level locks. It does acquire a table
> > lock and cancel conflicting queries when it is about to replay something
> > that would cause a query to explode, such as dropping a table, as
> > explained in docs.
> 
> That is rather surprising. You can't get that result in a normal server,
> which I think is enough of a reason to disallow it. If you do LOCK TABLE
> ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your
> nose.

I think the fallout from that argument is that WAL replay should hold
table-level locks (otherwise the workaround to this problem is too
special-casey).  I'm not sure about that.  If I really wanted to get
consistent results, I'd use a serializable transaction.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Hot Standby 0.2.1

From
Bruce Momjian
Date:
Simon Riggs wrote:
> 
> OK, here is the latest version of the Hot Standby patchset. This is
> about version 30+ by now, but we should regard this as 0.2.1
> Patch against CVS HEAD (now): clean apply, compile, no known bugs.
> 
> OVERVIEW

Anyone who is interested in how the hot standby behaves should read the
following excellent PDF Simon produced.  It goes into great detail of
the slave's read-only transactions and how the standby behaves during
continuous slave recovery:

> You can download PDF versions of the fine manual is here
> http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf

The function call docs are at:

> http://wiki.postgresql.org/images/1/10/Hot_Standby_Recovery_Functions.pdf

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Looking at the way cache invalidations are handled in two-phase
transactions, it would be simpler if we store the shared cache
invalidation messages in the twophase state file header, like we store
deleted relations and subxids. This allows them to be copied to the
COMMIT_PREPARED WAL record, so that we don't need treat twophase commits
differently than regular ones in xact_redo_commit. As the patch stands,
the new
xactGetCommittedInvalidationMessages/MakeSharedInvalidMessagesArray
mechanism is duplicated functionality with
AtPrepare_Inval/-PersistInvalidationMessage - both materialize the
pending shared invalidation messages so that they can be written to
disk. I did that in my git branch.

I wonder if we might have alignment issues with the
SharedInvalidationMessages being stored in WAL records, following the
subxids. All the data stored in that record have 4-byte alignment at the
moment, but if SharedInvalidationMessage ever needs 8-byte alignment, we
would have trouble. Probably not worth changing code, it's highly
unlikely that SharedInvalidationMessage will ever need 8-byte alignment,
but perhaps a comment somewhere would be in order.

I note that we don't emit RunningXacts after a shutdown checkpoint. So
if recovery starts at a shutdown checkpoint, we don't let read-only
backends in until the first online checkpoint. Could we treat a shutdown
checkpoint as a snapshot with no transactions running? Or do prepared
transactions screw that up?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
The logic in the lock manager to track the number of held
AccessExclusiveLocks (with ProcArrayIncrementNumHeldLocks and
ProcArrayDecrementNumHeldLocks) seems to be broken. I added an Assertion
into ProcArrayDecrementNumHeldLocks:

--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1401,6 +1401,7 @@ ProcArrayIncrementNumHeldLocks(PGPROC *proc)voidProcArrayDecrementNumHeldLocks(PGPROC *proc){
+   Assert(proc->numHeldLocks > 0);   proc->numHeldLocks--;}

This tripped the assertion:

postgres=# CREATE TABLE foo (id int4 primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing
therequest.
 

Making matters worse, the primary server refuses to startup up after
that, tripping the assertion again in crash recovery:

$ bin/postmaster -D data
LOG:  database system was interrupted while in recovery at 2009-09-23
11:56:15 EEST
HINT:  This probably means that some data is corrupted and you will have
to use the last backup for recovery.
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/32000070
LOG:  REDO @ 0/32000070; LSN 0/320000AC: prev 0/32000020; xid 0; len 32:
Heap2 - clean: rel 1663/11562/1249; blk 32 remxid 4352
LOG:  consistent recovery state reached
LOG:  REDO @ 0/320000AC; LSN 0/320000CC: prev 0/32000070; xid 0; len 4:
XLOG - nextOid: 24600
LOG:  REDO @ 0/320000CC; LSN 0/320000F4: prev 0/320000AC; xid 0; len 12:
Storage - file create: base/11562/16408
LOG:  REDO @ 0/320000F4; LSN 0/3200011C: prev 0/320000CC; xid 4364; len
12: Relation - exclusive relation lock: xid 4364 db 11562 rel 16408
LOG:  REDO @ 0/3200011C; LSN 0/320001D8: prev 0/320000F4; xid 4364; len
159: Heap - insert: rel 1663/11562/1259; tid 5/4
...
LOG:  REDO @ 0/32004754; LSN 0/32004878: prev 0/320046A8; xid 4364; len
264: Transaction - commit: 2009-09-23 11:55:51.888398+03; 15 inval
msgs:catcache id38 catcache id37 catcache id38 catcache id37 catcache
id38 catcache id37 catcache id7 catcache id6 catcache id26 smgr relcache
smgr relcache smgr relcache
TRAP: FailedAssertion("!(proc->numHeldLocks > 0)", File: "procarray.c",
Line: 1404)
LOG:  startup process (PID 27430) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

I'm sure that's just a simple bug somewhere, but it highlights that we
need be careful to avoid putting any extra work into the normal recovery
path. Otherwise bugs in hot standby related code can cause crash
recovery to fail.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote:

> Looking at the way cache invalidations are handled in two-phase
> transactions, it would be simpler if we store the shared cache
> invalidation messages in the twophase state file header, like we store
> deleted relations and subxids. This allows them to be copied to the
> COMMIT_PREPARED WAL record, so that we don't need treat twophase commits
> differently than regular ones in xact_redo_commit. As the patch stands,
> the new
> xactGetCommittedInvalidationMessages/MakeSharedInvalidMessagesArray
> mechanism is duplicated functionality with
> AtPrepare_Inval/-PersistInvalidationMessage - both materialize the
> pending shared invalidation messages so that they can be written to
> disk. I did that in my git branch.

We could, but the prepared transaction path already contains special
case code anyway, so we aren't reducing number of test cases required.
This looks like a possible area for refactoring, but I don't see the
need for pre-factoring. i.e. don't rework before commit, rework after.

> I wonder if we might have alignment issues with the
> SharedInvalidationMessages being stored in WAL records, following the
> subxids. All the data stored in that record have 4-byte alignment at the
> moment, but if SharedInvalidationMessage ever needs 8-byte alignment, we
> would have trouble. Probably not worth changing code, it's highly
> unlikely that SharedInvalidationMessage will ever need 8-byte alignment,
> but perhaps a comment somewhere would be in order.

It's a possible source of bugs, but there are no issues there AFAICS.
The technique of multiple arrays on a WAL record wasn't invented by this
patch.

> I note that we don't emit RunningXacts after a shutdown checkpoint. So
> if recovery starts at a shutdown checkpoint, we don't let read-only
> backends in until the first online checkpoint. Could we treat a shutdown
> checkpoint as a snapshot with no transactions running? Or do prepared
> transactions screw that up?

We could, but I see no requirement for starting HS from a backup taken
on a shutdown database. It's just another special case to test and since
we already have significant number of important test cases I'd say add
this later.


That seems to have reflected all of your points on this post, though
thanks for the comments. I'm keen to reduce complexity in areas that
have caused bugs, but for stuff that is working I am tempted to leave
alone on such a big patch. Anything we can do to avoid re-testing
sections of code and/or reduce the number of tests required is going to
increase stability. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote:
> it highlights that we
> need be careful to avoid putting any extra work into the normal
> recovery
> path. Otherwise bugs in hot standby related code can cause crash
> recovery to fail.

Excellent point. I will put in additional protective code there.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote:
> seems to be broken

Agreed. 

Patch withdrawn for correction and rework. Nothing serious, but not much
point doing further testing to all current issues resolved.

Tracking of issues raised and later solved via Wiki.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote:
>> I note that we don't emit RunningXacts after a shutdown checkpoint. So
>> if recovery starts at a shutdown checkpoint, we don't let read-only
>> backends in until the first online checkpoint. Could we treat a shutdown
>> checkpoint as a snapshot with no transactions running? Or do prepared
>> transactions screw that up?
> 
> We could, but I see no requirement for starting HS from a backup taken
> on a shutdown database. It's just another special case to test and since
> we already have significant number of important test cases I'd say add
> this later.

There's also a related issue that if a backend holding
AccessExclusiveLock crashes without writing an abort WAL record, the
lock is never released in the standby. We handle the expiration of xids
at replay of running-xacts records, but AFAICS we don't do that for locks.

It shouldn't be much code to clear those states at shutdown checkpoint,
just a few lines to call the right functions methinks.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote:
>>> I note that we don't emit RunningXacts after a shutdown checkpoint. So
>>> if recovery starts at a shutdown checkpoint, we don't let read-only
>>> backends in until the first online checkpoint. Could we treat a shutdown
>>> checkpoint as a snapshot with no transactions running? Or do prepared
>>> transactions screw that up?
>> We could, but I see no requirement for starting HS from a backup taken
>> on a shutdown database. It's just another special case to test and since
>> we already have significant number of important test cases I'd say add
>> this later.
> 
> There's also a related issue that if a backend holding
> AccessExclusiveLock crashes without writing an abort WAL record, the
> lock is never released in the standby. We handle the expiration of xids
> at replay of running-xacts records, but AFAICS we don't do that for locks.

Ah, scratch that, I now see that we do call
XactClearRecoveryTransactions() when we see a shutdown checkpoint, which
clears all recovery locks. But doesn't that prematurely clear all locks
belonging to prepared transactions as well?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Jeff Janes
Date:
On Tue, Sep 22, 2009 at 10:02 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Heikki Linnakangas escribió:
>> Simon Riggs wrote:
>> > On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote:
>> >> jjanes=# begin;
>> >> BEGIN
>> >> jjanes=# lock table pgbench_history in access exclusive mode;
>> >> LOCK TABLE
>> >> jjanes=# select count(*) from pgbench_history;
>> >>  count
>> >> --------
>> >>  519104
>> >> (1 row)
>> >>
>> >> jjanes=# select count(*) from pgbench_history;
>> >>  count
>> >> --------
>> >>  527814
>> >> (1 row)
>> >>
>> >> Is this the expected behavior?
>> >
>> > By me, yes. WAL replay does not require a table lock to progress. Any
>> > changes are protected with block-level locks. It does acquire a table
>> > lock and cancel conflicting queries when it is about to replay something
>> > that would cause a query to explode, such as dropping a table, as
>> > explained in docs.
>>
>> That is rather surprising. You can't get that result in a normal server,
>> which I think is enough of a reason to disallow it. If you do LOCK TABLE
>> ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your
>> nose.

Right.  I'd rather be denied the lock than have it granted to me but
not do the same thing it does on a primary---prevent all changes to
the locked table.

> I think the fallout from that argument is that WAL replay should hold
> table-level locks (otherwise the workaround to this problem is too
> special-casey).  I'm not sure about that.  If I really wanted to get
> consistent results, I'd use a serializable transaction.


Unfortunately, isolation level "serializable" is not truly
serializable.  Usually it is good enough, but when it isn't good
enough and you need an explicit table lock (a very rare but not
nonexistent situation), I think it should either lock the table in the
manner it would do on the primary, or throw an error.  I think that
silently changing the behavior between primary and standby is not a
good thing.

Cheers,

Jeff


Re: Hot Standby 0.2.1

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> Unfortunately, isolation level "serializable" is not truly
> serializable.  Usually it is good enough, but when it isn't good
> enough and you need an explicit table lock (a very rare but not
> nonexistent situation), I think it should either lock the table in the
> manner it would do on the primary, or throw an error.  I think that
> silently changing the behavior between primary and standby is not a
> good thing.

+1 --- this proposal made me acutely uncomfortable, too.
        regards, tom lane


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote:
>> seems to be broken
> 
> Agreed. 

Looking at the relation lock stuff a bit more...

When someone tries to acquire an AccessExclusiveLock, but can't get it
immediately, we sleep while holding RecoveryInfoLock. That doesn't
affect normal queries, but it will in turn block any attempt to get a
running-xacts snapshot, and will thus block checkpoint from finishing.

It could take a very long time until you get an AccessExclusiveLock on a
busy table, so that seems pretty bad. I think we need to think a bit
harder about that locking.

The problem becomes a lot easier if we accept that it's OK to have a
lock included in the running-xacts snapshot and also appear in a
XLOG_RELATION_LOCK record later. The standby should handle that
gracefully already. If we just remove RecoveryInfoLock, that can happen,
but it still won't be possible for a lock to be missed out which is what
we really care about.


Rather than keep the numHeldLocks counters per-proc in proc array, I
think it would be simpler to have a single (or one per lock partition)
counter in shared memory in lock.c. It's just an optimization to make it
faster to find out that there is no loggable AccessExclusiveLocks in the
system, so it really rather belongs into the lock manager.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Josh Berkus
Date:
Simon,

> Patch withdrawn for correction and rework. Nothing serious, but not much
> point doing further testing to all current issues resolved.

:-(

Good thing we went for 4 CFs.

Is there a GIT branch of Simon's current working version up somewhere?

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Josh Berkus wrote:
>> Patch withdrawn for correction and rework. Nothing serious, but not much
>> point doing further testing to all current issues resolved.
> 
> :-(
> 
> Good thing we went for 4 CFs.

I think we should try to hammer this in in this commitfest. None of the
issues found this far are too serious, nothing that requires major rewrites.

> Is there a GIT branch of Simon's current working version up somewhere?

I have my git repository at git.postgresql.org, branch 'hs-riggs'. I've
pushed a number of small changes there over Simon's patch.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Josh Berkus
Date:
Heikki,

> I think we should try to hammer this in in this commitfest. None of the
> issues found this far are too serious, nothing that requires major rewrites.

It would certainly be valuable to get users testing it starting with
Alpha2 instead of waiting 2 months.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> The problem becomes a lot easier if we accept that it's OK to have a
> lock included in the running-xacts snapshot and also appear in a
> XLOG_RELATION_LOCK record later. The standby should handle that
> gracefully already. If we just remove RecoveryInfoLock, that can happen,
> but it still won't be possible for a lock to be missed out which is what
> we really care about.

I see the problem with that now. Without the lock, it's possible that
the XLOG_RELATION_LOCK WAL record is written before the
XLOG_RUNNING_XACTS record. If the lock is not included in the snapshot,
we want the lock WAL record to be after the snapshot record.

So i guess we'll need the RecoveryInfoLock. But we don't need to hold itacross the wait. I think it's enough to acquire
itjust before writing
 
the WAL record in LockAcquire. That will ensure that the WAL record
isn't written until the snapshot is completely finished.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Looking at the startup sequence now.

I see that you modified ExtendSUBTRANS so that it doesn't wipe out
previously set values if it's called with out-of-order xids. I guess
that works, although I think it can leave pages unzeroed if it's called
with a large enough gap between xids, so that the previous one was on
e.g page 10 and the next one on page 12. Page 11 would not be zeroed,
AFAICS. Not sure if such big gaps in the xid sequence are possible, but
seems so if you have a very large max_connections setting and a lot of
subtransactions.

Nevertheless, I'd feel better if we kept ExtendSUBTRANS unmodified, and
instead arranged things so that ExtendSUBTRANS is always called in
xid-order. We can call it from RecordKnownAssignedTransactionIds, which
handles the out-of-order problem for other purposes already.

I think we have similar problem with clog. ExtendCLOG is currently not
called during recovery, so isn't it possible for the problem with
subtransaction commits that's described in the comments in StartupCLOG
to arise during hot standby? Again, I think we should call ExtendCLOG()
in RecordKnownAssignedTransactionIds. RecordKnownAssignedTransactionIds
is the hot standby version of GetNewTransactionId(), so to speak.

If we go with that, I think we'll need to call StartupSUBTRANS/CLOG
earlier in the startup sequence too, when we establish the startup
snapshot and let backends in.

Thoughts?

Other stuff:

- I removed the new DBConsistentUpToLSN() function and moved its
functionality into XLogNeedsFlush(). Since XLogFlush() updates
minRecoveryPoint instead of flushing WAL during recovery, it makes sense
for XLogNeedsFlush() to correspondingly check if minRecoveryPoint needs
to be updated when in recovery. That's a better definition for the call
in bufmgr.c too.

- I went ahead with the changes with RecoveryInfoLock and tracking the
number of held AccessExclusive locks in lmgr.c instead of proc array.

Can we find a better name for "loggable locks"? It means locks that need
to be WAL logged when acquired, for hot standby, and "loggable lock"
doesn't sound right for that. "Loggable" implies that it can be logged,
but it really means that it *must* be logged.

Keep an eye on my git repository for latest changes.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Thu, 2009-09-24 at 13:33 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > The problem becomes a lot easier if we accept that it's OK to have a
> > lock included in the running-xacts snapshot and also appear in a
> > XLOG_RELATION_LOCK record later. The standby should handle that
> > gracefully already. If we just remove RecoveryInfoLock, that can happen,
> > but it still won't be possible for a lock to be missed out which is what
> > we really care about.
> 
> I see the problem with that now. Without the lock, it's possible that
> the XLOG_RELATION_LOCK WAL record is written before the
> XLOG_RUNNING_XACTS record. If the lock is not included in the snapshot,
> we want the lock WAL record to be after the snapshot record.
> 
> So i guess we'll need the RecoveryInfoLock. But we don't need to hold it
>  across the wait. I think it's enough to acquire it just before writing
> the WAL record in LockAcquire. That will ensure that the WAL record
> isn't written until the snapshot is completely finished.

I will think some more on that. I remember thinking there was a reason
why that wasn't enough, possibly to do with no-wait locks which I
remember caused me to change that code.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-09-23 at 19:07 +0300, Heikki Linnakangas wrote:

> Rather than keep the numHeldLocks counters per-proc in proc array, I
> think it would be simpler to have a single (or one per lock partition)
> counter in shared memory in lock.c. It's just an optimization to make it
> faster to find out that there is no loggable AccessExclusiveLocks in the
> system, so it really rather belongs into the lock manager.

What lock would protect that value? The whole purpose is to avoid taking
the LockMgrLocks and to give something that is accessible by the locks
already held by GetRunningTransactionData().

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-09-23 at 17:45 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > Simon Riggs wrote:
> >> On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote:
> >>> I note that we don't emit RunningXacts after a shutdown checkpoint. So
> >>> if recovery starts at a shutdown checkpoint, we don't let read-only
> >>> backends in until the first online checkpoint. Could we treat a shutdown
> >>> checkpoint as a snapshot with no transactions running? Or do prepared
> >>> transactions screw that up?
> >> We could, but I see no requirement for starting HS from a backup taken
> >> on a shutdown database. It's just another special case to test and since
> >> we already have significant number of important test cases I'd say add
> >> this later.
> > 
> > There's also a related issue that if a backend holding
> > AccessExclusiveLock crashes without writing an abort WAL record, the
> > lock is never released in the standby. We handle the expiration of xids
> > at replay of running-xacts records, but AFAICS we don't do that for locks.
> 
> Ah, scratch that, I now see that we do call
> XactClearRecoveryTransactions() when we see a shutdown checkpoint, which
> clears all recovery locks. But doesn't that prematurely clear all locks
> belonging to prepared transactions as well?

Much better to read your second post(s). :-)

Yes, you have found a(nother) issue. This was the first one that gave me
pause to think of the answer. The locks currently aren't tracked as to
whether they are 2PC or not, so we would need to store that info also so
that we can selectively release locks later.

Question: is it possible to do a fast shutdown when we have a prepared
transaction? Would it be better to take a different approach there for
prepared transactions? It seems strange to write a shutdown checkpoint
when the system isn't yet "clean".

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2009-09-23 at 19:07 +0300, Heikki Linnakangas wrote:
> 
>> Rather than keep the numHeldLocks counters per-proc in proc array, I
>> think it would be simpler to have a single (or one per lock partition)
>> counter in shared memory in lock.c. It's just an optimization to make it
>> faster to find out that there is no loggable AccessExclusiveLocks in the
>> system, so it really rather belongs into the lock manager.
> 
> What lock would protect that value? The whole purpose is to avoid taking
> the LockMgrLocks and to give something that is accessible by the locks
> already held by GetRunningTransactionData().

The lock partition lock (so we really need one counter per partition, a
single counter would need additional locking). We're already holding
that in LockAcquire/LockRelease when we need to increment/decrement the
counter.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-25 at 11:49 +0300, Heikki Linnakangas wrote:
> Looking at the startup sequence now.
> 
> I see that you modified ExtendSUBTRANS so that it doesn't wipe out
> previously set values if it's called with out-of-order xids. I guess
> that works, although I think it can leave pages unzeroed if it's called
> with a large enough gap between xids, so that the previous one was on
> e.g page 10 and the next one on page 12. Page 11 would not be zeroed,
> AFAICS. Not sure if such big gaps in the xid sequence are possible, but
> seems so if you have a very large max_connections setting and a lot of
> subtransactions.

Yeh, it happens and I've seen it - which is why the code is there.

> Nevertheless, I'd feel better if we kept ExtendSUBTRANS unmodified, and
> instead arranged things so that ExtendSUBTRANS is always called in
> xid-order. We can call it from RecordKnownAssignedTransactionIds, which
> handles the out-of-order problem for other purposes already.
> 
> I think we have similar problem with clog. ExtendCLOG is currently not
> called during recovery, so isn't it possible for the problem with
> subtransaction commits that's described in the comments in StartupCLOG
> to arise during hot standby? Again, I think we should call ExtendCLOG()
> in RecordKnownAssignedTransactionIds. RecordKnownAssignedTransactionIds
> is the hot standby version of GetNewTransactionId(), so to speak.

OK. We record xids in sequence, so it is now a much more natural place
to do this. Zeroing them makes them dirty, unfortunately, but that is
another issue.

RecordKnownAssignedTransactionIds() only works once the snapshot has
been initialised. That could leave a gap, so we will need to run it
continuously if InHotStandby. Which may have knock-on changes also.

> If we go with that, I think we'll need to call StartupSUBTRANS/CLOG
> earlier in the startup sequence too, when we establish the startup
> snapshot and let backends in.

Yes, I think an earlier patch had that, but it turns out that there
really isn't anything for those functions to do. Really we could rename
those functions EndOfRecoverySUBTRANS and EndOfRecoveryCLOG to
illustrate their role better.

> - I removed the new DBConsistentUpToLSN() function and moved its
> functionality into XLogNeedsFlush(). Since XLogFlush() updates
> minRecoveryPoint instead of flushing WAL during recovery, it makes sense
> for XLogNeedsFlush() to correspondingly check if minRecoveryPoint needs
> to be updated when in recovery. That's a better definition for the call
> in bufmgr.c too.
> 
> - I went ahead with the changes with RecoveryInfoLock and tracking the
> number of held AccessExclusive locks in lmgr.c instead of proc array.
> 
> Can we find a better name for "loggable locks"? It means locks that need
> to be WAL logged when acquired, for hot standby, and "loggable lock"
> doesn't sound right for that. "Loggable" implies that it can be logged,
> but it really means that it *must* be logged.

The distinction of "loggable" is somewhat false since we rely on the
fact that only one person is holding it. So we may as well just call em
what they are: AccessExclusiveLocks.

> Keep an eye on my git repository for latest changes.

No, I'm not doing that. If you want to submit changes, please do so to
the list or just mention what needs work and I will do it. Having
multiple versions of a patch isn't helpful, as I have already said. I
have already been burned multiple times by accepting trial code and you
yourself have re-written particular parts multiple times. I am very,
very grateful for your reviews and detailed suggestions, so this comment
is only about coherency and not tripping each other up. (If you want to
"editorialize", it needs to be just prior to commit, but making changes
to a patch just prior to commit has historically shown to introduce bugs
where there weren't any before).

There's enough changes already to demand a full re-test, so everything
discovered so far plus testing is about 2 weeks work. I will commit
things onto git as agreed as I complete coding but that won't imply full
testing.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2009-09-23 at 17:45 +0300, Heikki Linnakangas wrote:
>> XactClearRecoveryTransactions() when we see a shutdown checkpoint, which
>> clears all recovery locks. But doesn't that prematurely clear all locks
>> belonging to prepared transactions as well?
> 
> Much better to read your second post(s). :-)
> 
> Yes, you have found a(nother) issue. This was the first one that gave me
> pause to think of the answer. The locks currently aren't tracked as to
> whether they are 2PC or not, so we would need to store that info also so
> that we can selectively release locks later.
> 
> Question: is it possible to do a fast shutdown when we have a prepared
> transaction?

Yes.

> Would it be better to take a different approach there for
> prepared transactions? It seems strange to write a shutdown checkpoint
> when the system isn't yet "clean".

Hmm, I guess you could define prepared transactions as active backends
from the shutdown point of view, and wait for them to finish. I can see
one problem, though: Once you issue shutdown, fast or smart, we no
longer accept new connections. So you can't connect to issue the
ROLLBACK/COMMIT PREPARED anymore. Anyway, it would be a change from the
current behavior, so it would be better to cope with prepared
transactions in the standby.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-25 at 13:14 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2009-09-23 at 19:07 +0300, Heikki Linnakangas wrote:
> > 
> >> Rather than keep the numHeldLocks counters per-proc in proc array, I
> >> think it would be simpler to have a single (or one per lock partition)
> >> counter in shared memory in lock.c. It's just an optimization to make it
> >> faster to find out that there is no loggable AccessExclusiveLocks in the
> >> system, so it really rather belongs into the lock manager.
> > 
> > What lock would protect that value? The whole purpose is to avoid taking
> > the LockMgrLocks and to give something that is accessible by the locks
> > already held by GetRunningTransactionData().
> 
> The lock partition lock (so we really need one counter per partition, a
> single counter would need additional locking). We're already holding
> that in LockAcquire/LockRelease when we need to increment/decrement the
> counter.

Again: The whole purpose is to avoid taking those locks. Why would we
put something behind a lock we are trying to avoid taking?

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-25 at 13:23 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2009-09-23 at 17:45 +0300, Heikki Linnakangas wrote:
> >> XactClearRecoveryTransactions() when we see a shutdown checkpoint, which
> >> clears all recovery locks. But doesn't that prematurely clear all locks
> >> belonging to prepared transactions as well?
> > 
> > Much better to read your second post(s). :-)
> > 
> > Yes, you have found a(nother) issue. This was the first one that gave me
> > pause to think of the answer. The locks currently aren't tracked as to
> > whether they are 2PC or not, so we would need to store that info also so
> > that we can selectively release locks later.
> > 
> > Question: is it possible to do a fast shutdown when we have a prepared
> > transaction?
> 
> Yes.
> 
> > Would it be better to take a different approach there for
> > prepared transactions? It seems strange to write a shutdown checkpoint
> > when the system isn't yet "clean".
> 
> Hmm, I guess you could define prepared transactions as active backends
> from the shutdown point of view, and wait for them to finish. I can see
> one problem, though: Once you issue shutdown, fast or smart, we no
> longer accept new connections. So you can't connect to issue the
> ROLLBACK/COMMIT PREPARED anymore. Anyway, it would be a change from the
> current behavior, so it would be better to cope with prepared
> transactions in the standby.

Definitely need to cope with them for Hot Standby. My point was general
one to say that behaviour is very non-useful for users with prepared
transactions. It just causes manual effort by a DBA each time the system
is shutdown.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Definitely need to cope with them for Hot Standby. My point was general
> one to say that behaviour is very non-useful for users with prepared
> transactions. It just causes manual effort by a DBA each time the system
> is shutdown.

The transaction manager is supposed to reconnect and finish any in-doubt
transactions, eventually.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
In XidInMVCCSnapshot:
>     if (snapshot->takenDuringRecovery)
>     {
>         /*
>          * If the snapshot contains full subxact data, the fastest way to check
>          * things is just to compare the given XID against both subxact XIDs and
>          * top-level XIDs.    If the snapshot overflowed, we have to use pg_subtrans
>          * to convert a subxact XID to its parent XID, but then we need only look
>          * at top-level XIDs not subxacts.
>          */
...
>     }
>     else
>     {
>         int32        j;
> 
>         /*
>          * 
>          * In recovery we store all xids in the subxact array because this
>          * is by far the bigger array and we mostly don't know which xids
>          * are top-level and which are subxacts. The xip array is empty.
>          *
>          * We start by searching subtrans, if we overflowed.
>          */
...
>     }

Hang on, isn't this 180 degrees backwards?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-25 at 15:50 +0300, Heikki Linnakangas wrote:

> Hang on, isn't this 180 degrees backwards?

A witty riposte escapes me; yes, the if test is !correct.

What I find amazing is that it passed the test where I put it doing make
installcheck in an infinite loop for a long time. I guess that means the
regression tests hardly touch the concurrency code at all, which now I
think about it makes sense but I still find that very worrying. :-(

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> What I find amazing is that it passed the test where I put it doing make
> installcheck in an infinite loop for a long time. I guess that means the
> regression tests hardly touch the concurrency code at all, which now I
> think about it makes sense but I still find that very worrying. :-(

Try installcheck-parallel, which should be a bit better.  It's probably
not yet good enough though because it always runs the same tests
concurrently.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Fri, 2009-09-25 at 16:30 -0400, Alvaro Herrera wrote:
> Simon Riggs wrote:
> 
> > What I find amazing is that it passed the test where I put it doing make
> > installcheck in an infinite loop for a long time. I guess that means the
> > regression tests hardly touch the concurrency code at all, which now I
> > think about it makes sense but I still find that very worrying. :-(
> 
> Try installcheck-parallel, which should be a bit better.  It's probably
> not yet good enough though because it always runs the same tests
> concurrently.

Good tip, thanks.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Andrew Dunstan
Date:

Alvaro Herrera wrote:
> Simon Riggs wrote:
>
>   
>> What I find amazing is that it passed the test where I put it doing make
>> installcheck in an infinite loop for a long time. I guess that means the
>> regression tests hardly touch the concurrency code at all, which now I
>> think about it makes sense but I still find that very worrying. :-(
>>     
>
> Try installcheck-parallel, which should be a bit better.  It's probably
> not yet good enough though because it always runs the same tests
> concurrently.
>
>   

It is also quite easy to set up your own schedule.

cheers

andrew


Re: Hot Standby 0.2.1

From
Alvaro Herrera
Date:
Andrew Dunstan wrote:
> 
> Alvaro Herrera wrote:
>
> >Try installcheck-parallel, which should be a bit better.  It's probably
> >not yet good enough though because it always runs the same tests
> >concurrently.
> 
> It is also quite easy to set up your own schedule.

... except you have to be careful with hidden test dependencies :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
The locking in smgr_redo_commit and smgr_redo_abort doesn't look right.
First of all, smgr_redo_abort is not holding XidGenLock and
ProcArrayLock while modifying ShmemVariableCache->nextXid and
ShmemVariableCache->latestCompletedXid, respectively, like
smgr_redo_commit is. Attached patch fixes that.

But I think there's a little race condition in there anyway, as they
remove the finished xids from known-assigned-xids list by calling
ExpireTreeKnownAssignedTransactionIds, and
ShmemVariableCache->latestCompletedXid is updated only after that. We're
not holding ProcArrayLock between those two steps, like we do during
normal transaction commit. I'm not sure what kind of issues that can
lead to, but it seems like it can lead to broken snapshots if someone
calls GetSnapshotData() between those steps.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
>From 9d6ef0a3eb901b1d2182b3f2efd5c75d52e88cba Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki@enterprisedb.com>
Date: Sun, 27 Sep 2009 14:51:43 +0300
Subject: [PATCH] Make locking in smgr_redo_abort reflect that in smgr_redo_commit.

---
 src/backend/access/transam/xact.c |   22 ++++++++++++++++------
 1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a696857..92a7c43 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4502,10 +4502,7 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid,
         /*
          * Release locks, if any. We do this for both two phase and normal
          * one phase transactions. In effect we are ignoring the prepare
-         * phase and just going straight to lock release. This explains
-         * why the twophase_postcommit_standby_callbacks[] do not invoke
-         * a special routine to handle locks - that is performed here
-         * instead.
+         * phase and just going straight to lock release.
          */
         RelationReleaseRecoveryLockTree(xid, xlrec->nsubxacts, sub_xids);
     }
@@ -4593,12 +4590,25 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
     }

     /* Make sure nextXid is beyond any XID mentioned in the record */
+    /* We don't expect anyone else to modify nextXid, hence we
+     * don't need to hold a lock while checking this. We still acquire
+     * the lock to modify it, though.
+     */
     if (TransactionIdFollowsOrEquals(max_xid,
                                      ShmemVariableCache->nextXid))
     {
+        LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
         ShmemVariableCache->nextXid = max_xid;
-        ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-        TransactionIdAdvance(ShmemVariableCache->nextXid);
+        LWLockRelease(XidGenLock);
+    }
+
+    /* Same here, don't use lock to test, but need one to modify */
+    if (TransactionIdFollowsOrEquals(max_xid,
+                                     ShmemVariableCache->latestCompletedXid))
+    {
+        LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+        ShmemVariableCache->latestCompletedXid = max_xid;
+        LWLockRelease(ProcArrayLock);
     }

     /* Make sure files supposed to be dropped are dropped */
--
1.6.3.3


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
TransactionIdIsInProgress() doesn't consult the known-assigned-xids
structure. That's a problem: in the standby, TransactionIdIsInProgress()
can return false for a transaction that is still running in the master.
HeapTupleSatisfies* functions can incorrectly set HEAP_XMIN/XMAX_INVALID
hint bits for xids that commit later on.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Sun, 2009-09-27 at 15:35 +0300, Heikki Linnakangas wrote:
> TransactionIdIsInProgress() doesn't consult the known-assigned-xids
> structure. That's a problem: in the standby, TransactionIdIsInProgress()
> can return false for a transaction that is still running in the master.
> HeapTupleSatisfies* functions can incorrectly set HEAP_XMIN/XMAX_INVALID
> hint bits for xids that commit later on.

Agreed. Will fix.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Sun, 2009-09-27 at 14:59 +0300, Heikki Linnakangas wrote:
> The locking in smgr_redo_commit and smgr_redo_abort doesn't look right.
> First of all, smgr_redo_abort is not holding XidGenLock and
> ProcArrayLock while modifying ShmemVariableCache->nextXid and
> ShmemVariableCache->latestCompletedXid, respectively, like
> smgr_redo_commit is. Attached patch fixes that.
> 
> But I think there's a little race condition in there anyway, as they
> remove the finished xids from known-assigned-xids list by calling
> ExpireTreeKnownAssignedTransactionIds, and
> ShmemVariableCache->latestCompletedXid is updated only after that. We're
> not holding ProcArrayLock between those two steps, like we do during
> normal transaction commit. I'm not sure what kind of issues that can
> lead to, but it seems like it can lead to broken snapshots if someone
> calls GetSnapshotData() between those steps.

I confess I didn't know what you were talking about when you wrote this
and was expecting you to have a coffee and retract. I realise now you
meant xact_redo_commit() rather than smgr_ and it makes sense at last.

I've just committed about half of your patch exactly, but not all.

I've avoided making the changes to
ShmemVariableCache->latestCompletedXid directly and moved the update to
ExpireTreeKnownAssignedTransactionIds() which already had access to
max_xid while holding ProcArrayLock. Hopefully that resolves your
comment about race condition.

Also, I noticed that you removed the line TransactionIdAdvance(ShmemVariableCache->nextXid);
in xact_redo_abort(). That looks like an error to me, since this follows
neither the comment nor how it is coded in xact_redo_commit(). Let me
know if there was some other reason for that line removal and I'll make
the change again.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote:
> It looks like the standby tries to remove XID 4323 from the
> known-assigned hash table, but it's not there because it was removed
> and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I
> guess we should just not throw an error in that case, but is there a
> way we could catch that narrow case and still keep the check in
> KnownAssignedXidsRemove()? It seems like the check might help catch
> other bugs, so I'd rather keep it if possible.

Fix attached.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote:
> we need be careful to avoid putting any extra work into the normal
> recovery path. Otherwise bugs in hot standby related code can cause
> crash recovery to fail.

Re-checked code and found a couple of additional places that needed
tests
    if (InHotStandby)

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote:
>> It looks like the standby tries to remove XID 4323 from the
>> known-assigned hash table, but it's not there because it was removed
>> and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I
>> guess we should just not throw an error in that case, but is there a
>> way we could catch that narrow case and still keep the check in
>> KnownAssignedXidsRemove()? It seems like the check might help catch
>> other bugs, so I'd rather keep it if possible.
> 
> Fix attached.

I fixed that on Friday already, in a slightly different manner. Do you
see a problem with that approach?

I've made public the version I'm working on. That's the version I'm
ultimately going to commit. It would be a lot more helpful if you
provided these patches over that version. Otherwise I have to refactor
them over that codebase, possibly introducing new bugs.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> I've made public the version I'm working on. That's the version I'm
> ultimately going to commit. It would be a lot more helpful if you
> provided these patches over that version. Otherwise I have to refactor
> them over that codebase, possibly introducing new bugs.

Actually, it makes most sense if you push your changes directly to my
git repository. I just granted you write access to it. Please do these
changes against the version currently there, and push any fixes there
directly.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Wed, 2009-10-07 at 23:19 -0400, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > I've made public the version I'm working on. That's the version I'm
> > ultimately going to commit. It would be a lot more helpful if you
> > provided these patches over that version. Otherwise I have to refactor
> > them over that codebase, possibly introducing new bugs.
> 
> Actually, it makes most sense if you push your changes directly to my
> git repository. I just granted you write access to it. Please do these
> changes against the version currently there, and push any fixes there
> directly.

OK, that makes sense. Thank you.

I would still like you to make a clear statement that the contents of
that repository are BSD licenced open source contributions. I have a
contractual responsibility to ensure the code I write is licenced in
that way. I have already mentioned I'm not looking at it yet for that
reason, so I am unaware of any changes not posted to the list.

You have posted patches that I have said I don't agree with. My name is
going to be on this when it goes in, so I don't think it makes any sense
to force that commit to include changes I don't agree with. I cannot
prevent you making changes afterwards, nor would I wish to. I'd like you
to respond sensibly to comments on those. We should work together on a
consensus basis, especially since I know you have not fully tested your
changes (either). Your error rate might be lower than mine, but it is
non-zero.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> I would still like you to make a clear statement that the contents of
> that repository are BSD licenced open source contributions.

Ok. All the content in the repository at
git://git.postgresql.org/git/users/heikki/postgres.git is released under
the same BSD license as PostgreSQL.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Thu, 2009-10-08 at 07:33 -0400, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I would still like you to make a clear statement that the contents of
> > that repository are BSD licenced open source contributions.
> 
> Ok. All the content in the repository at
> git://git.postgresql.org/git/users/heikki/postgres.git is released under
> the same BSD license as PostgreSQL.

Thank you. I presume that the copyright is novated also.

I'll get cracking on some changes.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Simon Riggs
Date:
On Thu, 2009-10-08 at 12:49 +0100, Simon Riggs wrote:

> I'll get cracking on some changes.

This will probably be next week now, just in case you're wondering when
I'll start adding patches.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby 0.2.1

From
Bruce Momjian
Date:
Simon Riggs wrote:
> You have posted patches that I have said I don't agree with. My name is
> going to be on this when it goes in, so I don't think it makes any sense
> to force that commit to include changes I don't agree with. I cannot
> prevent you making changes afterwards, nor would I wish to. I'd like you
> to respond sensibly to comments on those. We should work together on a
> consensus basis, especially since I know you have not fully tested your
> changes (either). Your error rate might be lower than mine, but it is
> non-zero.

The commit message and release notes mention might have just Simon's
name, or multiple people.

The hot patch commit is going to have multiple people involved before it
is committed, so if Simon is worried that the patch will have ideas in
it he does not agree with, perhaps we can make sure the commit and
release note items include Heikki's name as well.  Normally if a
committer makes signficant changes to a patch, the committer's name is
also added to the commmit message, and I suggest we do the same thing
here with hot standby.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Hot Standby 0.2.1

From
Robert Haas
Date:
On Oct 9, 2009, at 1:21 PM, Bruce Momjian <bruce@momjian.us> wrote:

> Simon Riggs wrote:
>> You have posted patches that I have said I don't agree with. My  
>> name is
>> going to be on this when it goes in, so I don't think it makes any  
>> sense
>> to force that commit to include changes I don't agree with. I cannot
>> prevent you making changes afterwards, nor would I wish to. I'd  
>> like you
>> to respond sensibly to comments on those. We should work together  
>> on a
>> consensus basis, especially since I know you have not fully tested  
>> your
>> changes (either). Your error rate might be lower than mine, but it is
>> non-zero.
>
> The commit message and release notes mention might have just Simon's
> name, or multiple people.
>
> The hot patch commit is going to have multiple people involved  
> before it
> is committed, so if Simon is worried that the patch will have ideas in
> it he does not agree with, perhaps we can make sure the commit and
> release note items include Heikki's name as well.  Normally if a
> committer makes signficant changes to a patch, the committer's name is
> also added to the commmit message, and I suggest we do the same thing
> here with hot standby.

I think this is a weakness of our current style of heavy-weight  
commits.  I don't have a great suggestion for fixing it, though.  Even  
if we move to git, a major feature like this has such a complex  
development history that I'm queasy about slurping it in unsquashed.  
But at least for simple features I think that there would be a value  
in separating the patch author's work from the committer's adjustments.

I realize (now) that this would complicate the release note generation  
process somewhat, based on our current process, and there might be  
other downsides as well. All the same, I think it has enough value to  
make it worth thinking about whether there's some way to make it work.

...Robert


Re: Hot Standby 0.2.1

From
Andrew Dunstan
Date:

Robert Haas wrote:
> But at least for simple features I think that there would be a value 
> in separating the patch author's work from the committer's adjustments.
>
>

That is just going to make life harder for committers.

There are plenty of things with my name on them that are not exactly 
what I submitted. I think that's true of just about everybody. Mostly 
things changed hae improved, but not always. I don't think we should be 
too proprietary about patches. As far as I'm concerned, credit goes to 
the submitter and blame if any to the committer.

cheers

andrew


Re: Hot Standby 0.2.1

From
"Joshua D. Drake"
Date:
On Fri, 2009-10-09 at 14:05 -0400, Andrew Dunstan wrote:
>
> Robert Haas wrote:
> > But at least for simple features I think that there would be a value
> > in separating the patch author's work from the committer's adjustments.
> >
> >
>
> That is just going to make life harder for committers.
>
> There are plenty of things with my name on them that are not exactly
> what I submitted. I think that's true of just about everybody. Mostly
> things changed hae improved, but not always. I don't think we should be
> too proprietary about patches. As far as I'm concerned, credit goes to
> the submitter and blame if any to the committer.

+1

>
> cheers
>
> andrew
>
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Hot Standby 0.2.1

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> Robert Haas wrote:
> > But at least for simple features I think that there would be a value 
> > in separating the patch author's work from the committer's adjustments.
> >
> >
> 
> That is just going to make life harder for committers.
> 
> There are plenty of things with my name on them that are not exactly 
> what I submitted. I think that's true of just about everybody. Mostly 
> things changed hae improved, but not always. I don't think we should be 
> too proprietary about patches. As far as I'm concerned, credit goes to 
> the submitter and blame if any to the committer.

Agreed.  

Simon is right that if only his name is on the commit, there is an
assumption that the committer made no changes, or only cosmetic ones.
For hot standby, I think the committer is making significant changes
(that could lead to bugs) and hence the committer's name should be in
the commit.  Sometimes we say "adjusted by" committer, but in this case
I think Heikki is doing more than adjustmants --- nothing wrong with
that --- it should just be documented in the commit message.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Hot Standby 0.2.1

From
"Joshua D. Drake"
Date:
On Fri, 2009-10-09 at 14:05 -0400, Andrew Dunstan wrote:
> 
> Robert Haas wrote:
> > But at least for simple features I think that there would be a value 
> > in separating the patch author's work from the committer's adjustments.
> >
> >
> 
> That is just going to make life harder for committers.
> 
> There are plenty of things with my name on them that are not exactly 
> what I submitted. I think that's true of just about everybody. Mostly 
> things changed hae improved, but not always. I don't think we should be 
> too proprietary about patches. As far as I'm concerned, credit goes to 
> the submitter and blame if any to the committer.

+1

> 
> cheers
> 
> andrew
> 
-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander