Thread: Hot Standby (v9d)

Hot Standby (v9d)

From
Simon Riggs
Date:
Latest version of Hot Standby includes new deferred cancellation for
conflict resolution and further refactoring.

http://wiki.postgresql.org/wiki/Hot_Standby#Usage

[Please note that max_standby_delay parameter has been set to default to
zero (0), to allow investigation of usability. i.e. queries will be
cancelled fairly quickly if they conflict. Please set this in
recovery.conf to any value you consider reasonable and report findings.]

GENERAL PLANS

* Bug fix v9 over next few days
* Put corrected version into GIT
* Produce outstanding items as patch-on-patch via GIT
* Work with reviewers to nibble away until fully committed

OUTSTANDING ITEMS (as of Version 9d)

Reported problems (Identified by) (Target release)

* None (as yet!) - please post all bugs to list
All v9 bugs have been fixed. This version has passed initial bash and
function tests and we will continue testing this version tomorrow.

Main work/required additional items (Proposed by) (Target release)

      * Complete Prepared Transaction support (Simon)

Performance tuning opportunities (Proposed by) (Target release)

      * tuning of RecordKnownAssignedTransactionIds() using hash table
        (Heikki)

Usability enhancements

* don't cancel all currently connected users every time we drop a
tablespace that has temp files being used within that tablespace -
should be able to parse the filenames to get pids to cancel

* don't kill all currently connected users every time we drop a user -
should be able to match list of current users against connected roles

* more work possible on btree deletes to reduce impact - should be able
to work out a reasonable latestRemovedXid


Testing of any kind welcome, the heavier the better. Save the database
and logs if it crashes, please. Performance data especially valuable.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> * Put corrected version into GIT
> * Produce outstanding items as patch-on-patch via GIT

I've applied the hot standby patch and recovery infra v9 patch to
branches in my git repository, and pushed those to:

git://git.postgresql.org/git/~hlinnaka/pgsql.git

You can clone that to get started.

I've set those branches up so that the hot standby branch is branched
off from the recovery infra branch. I'd like to keep the two separate,
as the recovery infra patch is useful on it's own, and can be reviewed
separately.

As a teaser, I made a couple of minor changes after importing your
patches. For the sake of the archives, I've also included those changes
as patches here.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
>From 6d583356063c9d3c8d0e69233a40065bc5d7bde1 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki@enterprisedb.com>
Date: Fri, 23 Jan 2009 14:37:05 +0200
Subject: [PATCH] Remove padding in XLogCtl; might be a good idea, but let's keep it simple
 for now.

Also remove the XLogCtl->mode_lck spinlock, which is pretty pointless for
a single boolean that's only written by one process.
---
 src/backend/access/transam/xlog.c |   24 +-----------------------
 1 files changed, 1 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcf5657..42c4552 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -340,18 +340,11 @@ typedef struct XLogCtlWrite

 /*
  * Total shared-memory state for XLOG.
- *
- * This small structure is accessed by many backends, so we take care to
- * pad out the parts of the structure so they can be accessed by separate
- * CPUs without causing false sharing cache flushes. Padding is generous
- * to allow for a wide variety of CPU architectures.
  */
-#define    XLOGCTL_BUFFER_SPACING    128
 typedef struct XLogCtlData
 {
     /* Protected by WALInsertLock: */
     XLogCtlInsert Insert;
-    char    InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)];

     /* Protected by info_lck: */
     XLogwrtRqst LogwrtRqst;
@@ -359,16 +352,9 @@ typedef struct XLogCtlData
     uint32        ckptXidEpoch;    /* nextXID & epoch of latest checkpoint */
     TransactionId ckptXid;
     XLogRecPtr    asyncCommitLSN; /* LSN of newest async commit */
-    /* add data structure padding for above info_lck declarations */
-    char    InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst)
-                                            - sizeof(XLogwrtResult)
-                                            - sizeof(uint32)
-                                            - sizeof(TransactionId)
-                                            - sizeof(XLogRecPtr)];

     /* Protected by WALWriteLock: */
     XLogCtlWrite Write;
-    char    WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)];

     /*
      * These values do not change after startup, although the pointed-to pages
@@ -390,11 +376,8 @@ typedef struct XLogCtlData
      * always during Recovery Processing Mode. This allows us to identify
      * code executed *during* Recovery Processing Mode but not necessarily
      * by Startup process itself.
-     *
-     * Protected by mode_lck
      */
     bool        SharedRecoveryProcessingMode;
-    slock_t        mode_lck;

     /*
      * recovery target control information
@@ -410,8 +393,6 @@ typedef struct XLogCtlData
     TransactionId     recoveryLastXid;
     XLogRecPtr        recoveryLastRecPtr;

-    char        InfoLockPadding[XLOGCTL_BUFFER_SPACING];
-
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;

@@ -4399,7 +4380,6 @@ XLOGShmemInit(void)
     XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
     XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
     SpinLockInit(&XLogCtl->info_lck);
-    SpinLockInit(&XLogCtl->mode_lck);

     /*
      * If we are not in bootstrap mode, pg_control should already exist. Read
@@ -6183,9 +6163,7 @@ IsRecoveryProcessingMode(void)
         if (xlogctl == NULL)
             return false;

-        SpinLockAcquire(&xlogctl->mode_lck);
-        LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
-        SpinLockRelease(&xlogctl->mode_lck);
+        LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode;
     }

     knownProcessingMode = true;
--
1.5.6.5

>From 4061ac8f84cc699bf0f417689f853791089ed472 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki@enterprisedb.com>
Date: Fri, 23 Jan 2009 15:55:33 +0200
Subject: [PATCH] Remove knownProcessingMode variable.

---
 src/backend/access/transam/xlog.c |   19 ++++++++-----------
 1 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e480e2..e64fb48 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -126,9 +126,11 @@ bool        InRecovery = false;
 /* Are we recovering using offline XLOG archives? */
 static bool InArchiveRecovery = false;

-/* Local copy of shared RecoveryProcessingMode state */
+/*
+ * Local copy of shared RecoveryProcessingMode state. True actually
+ * means "not known, need to check the shared state"
+ */
 static bool LocalRecoveryProcessingMode = true;
-static bool knownProcessingMode = false;

 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
@@ -5608,21 +5610,16 @@ StartupXLOG(void)
 bool
 IsRecoveryProcessingMode(void)
 {
-    if (knownProcessingMode && !LocalRecoveryProcessingMode)
+    if (!LocalRecoveryProcessingMode)
         return false;
-
+    else
     {
         /* use volatile pointer to prevent code rearrangement */
         volatile XLogCtlData *xlogctl = XLogCtl;

-        SpinLockAcquire(&xlogctl->mode_lck);
-        LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
-        SpinLockRelease(&xlogctl->mode_lck);
+        LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode;
+        return LocalRecoveryProcessingMode;
     }
-
-    knownProcessingMode = true;
-
-    return LocalRecoveryProcessingMode;
 }

 /*
--
1.5.6.5


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > * Put corrected version into GIT
> > * Produce outstanding items as patch-on-patch via GIT
> 
> I've applied the hot standby patch and recovery infra v9 patch to 
> branches in my git repository, and pushed those to:
> 
> git://git.postgresql.org/git/~hlinnaka/pgsql.git
> 
> You can clone that to get started.
> 
> I've set those branches up so that the hot standby branch is branched 
> off from the recovery infra branch. I'd like to keep the two separate, 
> as the recovery infra patch is useful on it's own, and can be reviewed 
> separately.
> 
> As a teaser, I made a couple of minor changes after importing your 
> patches. For the sake of the archives, I've also included those changes 
> as patches here.

OK, I'll look at those. I've fixed 6 bugs today (!), so I'd rather go
for the new version coming out in an hour. That's too many to unpick
unfortunately.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:

> I made a couple of minor changes after importing your
> patches.

I've applied them both to v9g, attached here.

If you wouldn't mind redoing the initial step, I will promise not to do
anything else to the code, except via patch on GIT.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
> 
>> I made a couple of minor changes after importing your 
>> patches. 
> 
> I've applied them both to v9g, attached here.
> 
> If you wouldn't mind redoing the initial step, I will promise not to do
> anything else to the code, except via patch on GIT.

No problem. I don't need to do it from scratch, I'll just apply the 
changes from that patch as an incremental commit. Done, you can see it 
in my git repo now too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
> @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
>         {
>                 XLogRecPtr      bufLSN = BufferGetLSN(bufHdr);
>  
> +               /*
> +                * If the buffer is recent we may need to cancel ourselves
> +                * rather than risk returning a wrong answer. This test is
> +                * too conservative, but it is correct.
> +                *
>>> +                * We only need to cancel the current subtransaction.
> +                * Once we've handled the error then other subtransactions can
> +                * continue processing. Note that we do *not* reset the
> +                * BufferRecoveryConflictLSN at subcommit/abort, but we do
> +                * reset it if we release our last remaining sbapshot.
> +                * see SnapshotResetXmin()
> +                *

Is it really enough to cancel just the current subtransaction? What if 
it's a serializable transaction?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Fri, 2009-01-23 at 18:22 +0200, Heikki Linnakangas wrote:
> > @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
> >         {
> >                 XLogRecPtr      bufLSN = BufferGetLSN(bufHdr);
> >  
> > +               /*
> > +                * If the buffer is recent we may need to cancel ourselves
> > +                * rather than risk returning a wrong answer. This test is
> > +                * too conservative, but it is correct.
> > +                *
> >>> +                * We only need to cancel the current subtransaction.
> > +                * Once we've handled the error then other subtransactions can
> > +                * continue processing. Note that we do *not* reset the
> > +                * BufferRecoveryConflictLSN at subcommit/abort, but we do
> > +                * reset it if we release our last remaining sbapshot.
> > +                * see SnapshotResetXmin()
> > +                *
> 
> Is it really enough to cancel just the current subtransaction? What if 
> it's a serializable transaction?

I did originally think that when I first looked at the problem. I'm
sorry if I say that a lot. 

If you have a serializable transaction with subtransactions that suffers
a serializability error it only cancels the current subtransaction. That
means it's snapshot is still valid and can be used again. By analogy, as
long as a transaction does not see any data that is inconsistent with
its snapshot it seems OK for it to continue. So I think it is correct.

(Bizarrely, this might mean that if we did this programatically in a
loop we might keep the system busy for some time while it continually
re-reads data and fails. But that's another story).

You remind me that we can now do what Kevin has requested and throw a
errcode(ERRCODE_T_R_SERIALIZATION_FAILURE) at this point, which I agree
is the most easily understood way of describing this error.

(I was sorely tempted to make it "snapshot too old", as a joke).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> If you have a serializable transaction with subtransactions that suffers
> a serializability error it only cancels the current subtransaction. That
> means it's snapshot is still valid and can be used again. By analogy, as
> long as a transaction does not see any data that is inconsistent with
> its snapshot it seems OK for it to continue. So I think it is correct.

Yeah, you're right. How bizarre.

> (I was sorely tempted to make it "snapshot too old", as a joke).

Yeah, that is a very describing message, actually. If there wasn't any 
precedence to that, I believe we might have came up with exactly that 
message ourselves.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby (v9d)

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> If you have a serializable transaction with subtransactions that suffers
>> a serializability error it only cancels the current subtransaction. That
>> means it's snapshot is still valid and can be used again. By analogy, as
>> long as a transaction does not see any data that is inconsistent with
>> its snapshot it seems OK for it to continue. So I think it is correct.
>
> Yeah, you're right. How bizarre.

It was argued this way to me way back when subtransactions were written.
Originally I had written it in such a way as to abort the whole
transaction, on the rationale that if you're blindly restarting the
subtransaction after a serialization error, it would result in a (maybe
infinite) loop.  I think the reason it only aborts the subxact is that
that's what all other errors do, so why would this one act differently.

Hmm, now that I think about it, I think it was deadlock errors not
serialization errors ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Hot Standby (v9d)

From
Greg Stark
Date:
>>>
>>
> If you have a serializable transaction with subtransactions that  
> suffers
> a serializability error it only cancels the current subtransaction.

This is a complete tangent to your work, but I wonder if this is  
really right. I mean it's not as if we could have srrialized the  
transaction as a whole with respect to whatever other transaction we  
failed.


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote:

> * Bug fix v9 over next few days

version 9g - please use this for testing now

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Hot Standby (v9d)

From
Jeff Davis
Date:
On Fri, 2009-01-23 at 20:13 +0000, Greg Stark wrote:
> > If you have a serializable transaction with subtransactions that  
> > suffers
> > a serializability error it only cancels the current subtransaction.
> 
> This is a complete tangent to your work, but I wonder if this is  
> really right. I mean it's not as if we could have srrialized the  
> transaction as a whole with respect to whatever other transaction we  
> failed.

Isn't this back to the discussion about PostgreSQL serializability
versus true serializability?

I agree that's not true serializability.

Regards,Jeff Davis



Re: Hot Standby (v9d)

From
Mark Kirkwood
Date:
Simon Riggs wrote:
> On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote:
>
>   
>> * Bug fix v9 over next few days
>>     
>
> version 9g - please use this for testing now
>
>   

I'm doing some test runs with this now. I notice an old flatfiles 
related bug has reappeared:

master:

=# create database test;

slave:

=# select datname from pg_database where datname like 'test';datname
---------test
(1 row)

postgres=# \c test
FATAL:  database "test" does not exist
Previous connection kept



Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:

> > version 9g - please use this for testing now

> I'm doing some test runs with this now. I notice an old flatfiles 
> related bug has reappeared:

I'm seeing an off-by-one error on xmax, in some cases. That then causes
the flat file update to not pick up correct info, even though it
executed in other ways as intended. If you run two create databases and
then test only the first, it appears to have worked as intended.

These bugs are result of recent refactoring and it will take a few days
to shake some of them out. We've had more than 20 already so we're
beating them back, but we're not done yet. 

Thanks for the report, will publish this and other fixes on Monday.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote:
> On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
> 
> > > version 9g - please use this for testing now
> 
> > I'm doing some test runs with this now. I notice an old flatfiles 
> > related bug has reappeared:
> 
> I'm seeing an off-by-one error on xmax, in some cases. That then causes
> the flat file update to not pick up correct info, even though it
> executed in other ways as intended. If you run two create databases and
> then test only the first, it appears to have worked as intended.
> 
> These bugs are result of recent refactoring and it will take a few days
> to shake some of them out. We've had more than 20 already so we're
> beating them back, but we're not done yet. 

I was at a loss to explain how this could have slipped through our
tests. It appears that the error was corrected following each checkpoint
as a result of ProcArrayUpdateRunningXacts(). Our tests were performed
after a short delay, which typically would be greater than the
deliberately short setting of checkpoint_timeout/archive_timeout and so
by the time we looked the error was gone and masked the problem. We're
setting checkpoint_timeout to 30 mins now to avoid the delay...

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Mark Kirkwood
Date:
Simon Riggs wrote:
> On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote:
>   
>> On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
>>
>>     
>>>> version 9g - please use this for testing now
>>>>         
>>> I'm doing some test runs with this now. I notice an old flatfiles 
>>> related bug has reappeared:
>>>       
>> I'm seeing an off-by-one error on xmax, in some cases. That then causes
>> the flat file update to not pick up correct info, even though it
>> executed in other ways as intended. If you run two create databases and
>> then test only the first, it appears to have worked as intended.
>>
>> These bugs are result of recent refactoring and it will take a few days
>> to shake some of them out. We've had more than 20 already so we're
>> beating them back, but we're not done yet. 
>>     
>
> I was at a loss to explain how this could have slipped through our
> tests. It appears that the error was corrected following each checkpoint
> as a result of ProcArrayUpdateRunningXacts(). Our tests were performed
> after a short delay, which typically would be greater than the
> deliberately short setting of checkpoint_timeout/archive_timeout and so
> by the time we looked the error was gone and masked the problem. We're
> setting checkpoint_timeout to 30 mins now to avoid the delay...
>
>   
That makes sense - I had archive_timeout set at 5 minutes, and I would 
have checked before 5 minutes were up.

Cheers

Mark


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:

> I'm doing some test runs with this now. I notice an old flatfiles
> related bug has reappeared:

Bug fix patch against git repo.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Hot Standby (v9d)

From
Gregory Stark
Date:
I skimmed through the Hot Standby patch for a preliminary review. I noted the
following things, some minor tweaks, some just questions. None of the things I
noted are big issues unless some of the questions uncover issues.


1) This code is obviously a cut-pasto:

+       else if (strcmp(tok1, "max_standby_delay") == 0)
+       {
+           errno = 0;
+           maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
+           if (errno == EINVAL || errno == ERANGE)
+               ereport(FATAL,
+                (errmsg("max_standby_delay is not a valid number: \"%s\"",
+                        tok2)));

Have you ever tested the behaviour with max_standby_delay = -1 ?

Also, the default max_standby_delay is currently 0 -- ie, kill queries
immediately upon detecting any conflicts at all -- which I don't really think
anyone would be happy with. I still *strongly* feel the default has to be the
non-destructive conservative -1.


2) This hard coded constant of 500ms seems arbitrary to me. If your use case
is a heavily-used reporting engine you might get this message all the time. I
think this either has to be configurable (warn_standby_delay?) or based on
max_standby_delay or some other parameter somehow.

+       /*
+        * log that we have been waiting for a while now...
+        */
+       if (!logged && standbyWait_ms > 500)


3) These two blocks of code seem unsatisfactory:

!           /*
!            * Keep track of the block number of the lastBlockVacuumed, so
!            * we can scan those blocks as well during WAL replay. This then
!            * provides concurrency protection and allows btrees to be used
!            * while in recovery.
!            */
!           if (lastBlockVacuumed > vstate->lastBlockVacuumed)
!               vstate->lastBlockVacuumed = lastBlockVacuumed;
! 


+           /*
+            * XXXHS we don't actually need to read the block, we
+            * just need to confirm it is unpinned. If we had a special call
+            * into the buffer manager we could optimise this so that
+            * if the block is not in shared_buffers we confirm it as unpinned.
+            */
+           buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+           if (BufferIsValid(buffer))
+           {
+               LockBufferForCleanup(buffer);           
+               UnlockReleaseBuffer(buffer);
+           }

Aside from the XXX comment (I thought we actually had such a call now, but if
not shouldn't we just add one instead of carping?) I'm not convinced this
handles all the cases that can arise. Notable, what happens if a previous
vacuum died in the middle of the scan?

I think we have a vacuum id which we use already with btrees to the same
issue. It seems to me this be more robust if you stamped the xlog record with
that id number and ensured it matched the id of the lastBloockVacuumed? Then
you could start from 0 if the id changes.


4) Why is this necessary?

+   if (IsRecoveryProcessingMode() && 
+       locktag->locktag_type == LOCKTAG_OBJECT &&
+       lockmode > AccessShareLock)
+       ereport(ERROR,
+               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", 
+                                   lockMethodTable->lockModeNames[lockmode]),
+                errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+ 

Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
access. But shouldn't we be able to do manual LOCK TABLE calls?

I hope this isn't the only interlock we have against trying to do write i/o or
DDL against tables...?


5) The code for killing off conflicting transactions looks odd to me but I
haven't really traced through it to see what it's doing. It seems to kill off
just one process? What if there are several holding the lock in question?
Also, it doesn't seem to take into account how long the various transactions
have held the lock?

Is there a risk of, for example, if you have a long report running against a
table and lots of OLTP requests against the same table it seems the conflict
resolution code would fire randomly into the crowd and hit mostly innocent
OLTP transactions until eventually it found the culprit report?


Also, it kills of backends using SIGINT which I *think* Tom went through and
made safe earlier this release, right?

In any case if the backend doesn't die off promptly we wait forever with no
further warnings or log messages. I would think we should at least print some
sort of message here, even if it's a "can't happen" elog. The doubling thing
is probably unnecessary too in this case.

+               if (!XLogRecPtrIsValid(conflict_LSN))
+               {
+                   /* wait awhile for it to die */
+                   pg_usleep(wontDieWait * 5000L);
+                   wontDieWait *= 2;
+               }
+           }


6) I still don't understand why you need unobserved_xids. We don't need this
in normal running, an xid we don't know for certain is committed is exactly
the same as a transaction we know is currently running or aborted. So why do
you need it during HS?

The comment says:

+  * This is very important because if we leave 
+  * those xids out of the snapshot then they will appear to be already complete. 
+  * Later, when they have actually completed this could lead to confusion as to 
+  * whether those xids are visible or not, blowing a huge hole in MVCC. 
+  * We need 'em.

But that doesn't sound rational to me. I'm not sure what "confusion" this
would cause. If they later actually complete then any existing snapshots would
still not see them. And any later snapshots wouldn't be confused by the
earlier conclusions.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:

> I skimmed through the Hot Standby patch for a preliminary review. I noted the
> following things, some minor tweaks, some just questions. None of the things I
> noted are big issues unless some of the questions uncover issues.

Thanks for your attention.

> 1) This code is obviously a cut-pasto:
> 
> +       else if (strcmp(tok1, "max_standby_delay") == 0)
> +       {
> +           errno = 0;
> +           maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
> +           if (errno == EINVAL || errno == ERANGE)
> +               ereport(FATAL,
> +                (errmsg("max_standby_delay is not a valid number: \"%s\"",
> +                        tok2)));
> 
> Have you ever tested the behaviour with max_standby_delay = -1 ?

> Also, the default max_standby_delay is currently 0 -- ie, kill queries
> immediately upon detecting any conflicts at all -- which I don't really think
> anyone would be happy with. 

Agreed. As explained when I published that patch it is deliberately
severe to allow testing of conflict resolution and feedback on it.

> I still *strongly* feel the default has to be the
> non-destructive conservative -1.

I don't. Primarily, we must support high availability. It is much better
if we get people saying "I get my queries cancelled" and we say RTFM and
change parameter X, than if people say "my failover was 12 hours behind
when I needed it to be 10 seconds behind and I lost a $1 million because
of downtime of Postgres" and we say RTFM and change parameter X.

> 2) This hard coded constant of 500ms seems arbitrary to me. If your use case
> is a heavily-used reporting engine you might get this message all the time. I
> think this either has to be configurable (warn_standby_delay?) or based on
> max_standby_delay or some other parameter somehow.
> 
> +       /*
> +        * log that we have been waiting for a while now...
> +        */
> +       if (!logged && standbyWait_ms > 500)

Greg, that's a DEBUG5 message.

> 3) These two blocks of code seem unsatisfactory:
> 
> !           /*
> !            * Keep track of the block number of the lastBlockVacuumed, so
> !            * we can scan those blocks as well during WAL replay. This then
> !            * provides concurrency protection and allows btrees to be used
> !            * while in recovery.
> !            */
> !           if (lastBlockVacuumed > vstate->lastBlockVacuumed)
> !               vstate->lastBlockVacuumed = lastBlockVacuumed;
> ! 
> 
> 
> +           /*
> +            * XXXHS we don't actually need to read the block, we
> +            * just need to confirm it is unpinned. If we had a special call
> +            * into the buffer manager we could optimise this so that
> +            * if the block is not in shared_buffers we confirm it as unpinned.
> +            */
> +           buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
> +           if (BufferIsValid(buffer))
> +           {
> +               LockBufferForCleanup(buffer);           
> +               UnlockReleaseBuffer(buffer);
> +           }
> 
> Aside from the XXX comment (I thought we actually had such a call now, but if
> not shouldn't we just add one instead of carping?)

We should add many things. The equivalent optimisation of VACUUM in
normal running has not been done either. Considering we have both HOT
and visibility map enhanced VACUUM, its not a priority.

>  I'm not convinced this
> handles all the cases that can arise. Notable, what happens if a previous
> vacuum died in the middle of the scan?

Nothing, because it won't then have removed any heap rows.

> I think we have a vacuum id which we use already with btrees to the same
> issue. It seems to me this be more robust if you stamped the xlog record with
> that id number and ensured it matched the id of the lastBloockVacuumed? Then
> you could start from 0 if the id changes.

> 4) Why is this necessary?
> 
> +   if (IsRecoveryProcessingMode() && 
> +       locktag->locktag_type == LOCKTAG_OBJECT &&
> +       lockmode > AccessShareLock)
> +       ereport(ERROR,
> +               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", 
> +                                   lockMethodTable->lockModeNames[lockmode]),
> +                errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
> + 
> 
> Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
> access. But shouldn't we be able to do manual LOCK TABLE calls?

Read the rest of the comments on the locking code section.

> I hope this isn't the only interlock we have against trying to do write i/o or
> DDL against tables...?

No it's not.

> 5) The code for killing off conflicting transactions looks odd to me but I
> haven't really traced through it to see what it's doing. It seems to kill off
> just one process? 

No, why do you think that?

> What if there are several holding the lock in question?

Covered.

> Also, it doesn't seem to take into account how long the various transactions
> have held the lock?

True. Why would that matter?

> Is there a risk of, for example, if you have a long report running against a
> table and lots of OLTP requests against the same table 

> it seems the conflict resolution code would fire randomly into the
> crowd 

OK, that's where I stop bothering to respond.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
"Joshua D. Drake"
Date:
On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:

> Agreed. As explained when I published that patch it is deliberately
> severe to allow testing of conflict resolution and feedback on it.
> 
> > I still *strongly* feel the default has to be the
> > non-destructive conservative -1.
> 
> I don't. Primarily, we must support high availability. It is much better
> if we get people saying "I get my queries cancelled" and we say RTFM and
> change parameter X, than if people say "my failover was 12 hours behind
> when I needed it to be 10 seconds behind and I lost a $1 million because
> of downtime of Postgres" and we say RTFM and change parameter X.

If the person was stupid enough to configure it for such as thing they
deserve to the lose the money. Not to mention we have already lost them
as a user because they will blame postgresql regardless of reality as
evidenced by their inability to RTFM (or have a vendor that RTFMs) in
the first place.

I got to vote with Greg on this one.


Sincerely,

Joshua D. Drake


-- 
PostgreSQL - XMPP: jdrake@jabber.postgresql.org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
 



Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> 6) I still don't understand why you need unobserved_xids. We don't need this
> in normal running, an xid we don't know for certain is committed is exactly
> the same as a transaction we know is currently running or aborted. So why do
> you need it during HS?

In normal operation, any transaction that's in-progress has an entry in 
ProcArray. GetSnapshot() gathers the xids of all those in-progress 
transactions, so that they're seen as not-committed even when the 
they're later marked as committed in clog.

In HS, we might see the first WAL record of transaction 10 before we see 
the first WAL record of transaction 9. Without unobserved_xids, if you 
take a snapshot in the standby between those two WAL records, xid 10 is 
included in the snapshot, but 9 is not. If xact 9 later commits, it's 
marked in clog as committed, and it will suddenly appear as visible to 
the snapshot. To avoid that, when we replay the first WAL record of xact 
10, we also add 9 to the unobserved xid array, so that it's included in 
snapshots.

So, you can think of the unobserved xids array as an extension of 
ProcArray. The entries are like light-weight PGPROC entries. In fact I 
proposed earlier to simply create dummy PGPROC entries instead.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
> I still don't understand why you need unobserved_xids. We don't need
> this in normal running, an xid we don't know for certain is committed
> is exactly the same as a transaction we know is currently running or
> aborted. So why do you need it during HS?

All running transactions need to be part of the snapshot. This is true
on both standby and primary.

On primary we allocate new xids from a single counter, so there are no
gaps. Newly assigned xids go straight into the procarray and then are
removed later when they commit/abort.

In standby we only know what is in WAL. Xid assignment is not currently
WAL logged, so either we choose to WAL log each new xid and face the
performance consequences (IMHO, unacceptable), or we choose a different
strategy. UnobservedXids is that different strategy. Florian Pflug had a
different strategy, but he did have a strategy.

> The comment says:
> 
> +  * This is very important because if we leave 
> +  * those xids out of the snapshot then they will appear to be
> already complete. 
> +  * Later, when they have actually completed this could lead to
> confusion as to 
> +  * whether those xids are visible or not, blowing a huge hole in
> MVCC. 
> +  * We need 'em.
> 
> But that doesn't sound rational to me. I'm not sure what "confusion"
> this would cause. If they later actually complete then any existing
> snapshots would still not see them. And any later snapshots wouldn't
> be confused by the earlier conclusions.

If a xact is running when we take a snapshot, yet we do not include it
then bad things will happen, but not til later. If a xact is not in
snapshot *and* less than xamx then we presume it completed prior to the
snapshot. If that xact did subsequently commit *after* we took the
snapshot, then it's absence from the snapshot will make that data
visible if the xact committed, making it look like it committed *before*
the snapshot was taken. So the "confusion" results in an MVCC violation
and so we must handle this case correctly, or die.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>> I still *strongly* feel the default has to be the
>>> non-destructive conservative -1.
>> 
>> I don't. Primarily, we must support high availability. It is much better
>> if we get people saying "I get my queries cancelled" and we say RTFM and
>> change parameter X, than if people say "my failover was 12 hours behind
>> when I needed it to be 10 seconds behind and I lost a $1 million because
>> of downtime of Postgres" and we say RTFM and change parameter X.

> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money.

Well, those unexpectedly cancelled queries could have represented
critical functionality too.  I think this argument calls the entire
approach into question.  If there is no safe setting for the parameter
then we need to find a way to not have the parameter.
        regards, tom lane


Re: Hot Standby (v9d)

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [090128 15:02]:
> Well, those unexpectedly cancelled queries could have represented
> critical functionality too.  I think this argument calls the entire
> approach into question.  If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.

That's what we currently have without HS, but people aren't completely
satisfied with it either ;-)

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Hot Standby (v9d)

From
Greg Stark
Date:
(sorry for top posting -- blame apple)

I don't see anything "dangerous" with either setting. For use cases  
where the backup is the primary purpose then killing queries is fine.  
For use cases where the maching is a reporting machine then saving  
large amounts of archived logs is fine.

Engineering is about tradeoffs and these two use cases are  
intrinsically in conflict.

The alternative is postponing vacuuming on the master which is imho  
even worse than the disease.

-- 
Greg


On 28 Jan 2009, at 19:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>>> I still *strongly* feel the default has to be the
>>>> non-destructive conservative -1.
>>>
>>> I don't. Primarily, we must support high availability. It is much  
>>> better
>>> if we get people saying "I get my queries cancelled" and we say  
>>> RTFM and
>>> change parameter X, than if people say "my failover was 12 hours  
>>> behind
>>> when I needed it to be 10 seconds behind and I lost a $1 million  
>>> because
>>> of downtime of Postgres" and we say RTFM and change parameter X.
>
>> If the person was stupid enough to configure it for such as thing  
>> they
>> deserve to the lose the money.
>
> Well, those unexpectedly cancelled queries could have represented
> critical functionality too.  I think this argument calls the entire
> approach into question.  If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.
>
>            regards, tom lane


Re: Hot Standby (v9d)

From
Greg Stark
Date:
Put another way: your characterization is no more true than claiming  
there's no "safe" setting for statement_timeout since a large value  
means clog could overflow your disk and your tables could bloat.

(I note we default statement_timeout to off though)

-- 
Greg


On 28 Jan 2009, at 19:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>>> I still *strongly* feel the default has to be the
>>>> non-destructive conservative -1.
>>>
>>> I don't. Primarily, we must support high availability. It is much  
>>> better
>>> if we get people saying "I get my queries cancelled" and we say  
>>> RTFM and
>>> change parameter X, than if people say "my failover was 12 hours  
>>> behind
>>> when I needed it to be 10 seconds behind and I lost a $1 million  
>>> because
>>> of downtime of Postgres" and we say RTFM and change parameter X.
>
>> If the person was stupid enough to configure it for such as thing  
>> they
>> deserve to the lose the money.
>
> Well, those unexpectedly cancelled queries could have represented
> critical functionality too.  I think this argument calls the entire
> approach into question.  If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.
>
>            regards, tom lane


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 21:41 +0200, Heikki Linnakangas wrote:

> So, you can think of the unobserved xids array as an extension of 
> ProcArray. The entries are like light-weight PGPROC entries. In fact I
> proposed earlier to simply create dummy PGPROC entries instead.

Which we don't do because we don't know whether we are dealing with
top-level xids or subtransactions of already observed top-level xids.

Either way we have to rearrange things when we move from unobserved to
observed. A major difference is that what we have now works and what we
might have instead may not, which is being illustrated by recent
testing.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Jeff Davis
Date:
On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
> "my failover was 12 hours behind when I needed it to be 10 seconds
>  behind and I lost a $1 million because of downtime of Postgres"

The same could be said for warm standby right now. Or Slony-I, for that
matter. I think that we can reasonably expect anyone implementing
asynchronous replication for HA to properly monitor the lag time.

There are many sources of latency in the process, so I don't think
anyone can expect 10 seconds without actually monitoring to verify what
the actual lag time is.

I apologize if my post is based on ignorance, I haven't followed your
patch as closely as others involved in this discussion.

Regards,Jeff Davis



Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>>> I still *strongly* feel the default has to be the
>>>> non-destructive conservative -1.
>>> I don't. Primarily, we must support high availability. It is much better
>>> if we get people saying "I get my queries cancelled" and we say RTFM and
>>> change parameter X, than if people say "my failover was 12 hours behind
>>> when I needed it to be 10 seconds behind and I lost a $1 million because
>>> of downtime of Postgres" and we say RTFM and change parameter X.
> 
>> If the person was stupid enough to configure it for such as thing they
>> deserve to the lose the money.
> 
> Well, those unexpectedly cancelled queries could have represented
> critical functionality too.  I think this argument calls the entire
> approach into question.  If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.

We've gone through that already. Different ideas were hashed out around 
September. There's four basic feasible approaches to what to do when an 
incoming WAL record conflicts with a running read-only query:

1. Kill the query. (max_standby_delay=0)
2. Wait for the query to finish before continuing (max_standby_delay=-1)
3. Have a feedback loop from standby to master, feeding an OldestXmin to 
the master, preventing it from removing tuples that are still needed in 
the standby.
4. Allow the query to continue, knowing that it will return wrong results.

I don't consider 4 to be an option. Option 3 has its own set of 
drawbacks, as a standby can then cause bloat in the master, and in any 
case we're not going to have it in this release. And then there's some 
middle ground, like wait a while and then kill the query 
(max_standby_delay > 0).

I don't see any way around the fact that when a tuple is removed, it's 
gone and can't be accessed by queries. Either you don't remove it, or 
you kill the query.

I think the max_standby_delay setting is fairly easy to explain. It 
shouldn't be too hard for a DBA to set it correctly.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 11:33 -0800, Joshua D. Drake wrote:
> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
> > On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
> 
> > Agreed. As explained when I published that patch it is deliberately
> > severe to allow testing of conflict resolution and feedback on it.
> > 
> > > I still *strongly* feel the default has to be the
> > > non-destructive conservative -1.
> > 
> > I don't. Primarily, we must support high availability. It is much better
> > if we get people saying "I get my queries cancelled" and we say RTFM and
> > change parameter X, than if people say "my failover was 12 hours behind
> > when I needed it to be 10 seconds behind and I lost a $1 million because
> > of downtime of Postgres" and we say RTFM and change parameter X.
> 
> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money. 

It was never intended to be 0, that was just for testing, as I said. But
a smallish integer number of seconds, say 10, 60, 300 or at most 600 is
reasonable.

Crash barriers can be moved, falling off a cliff is permanent. It's easy
to change the parameter if you don't like it, and too late to change it
if we set the default wrong. 

My name is on the patch and I take responsibility for such failures. I'm
not going to turn round and say "but Josh said", kinda like Stan Laurel,
if it fails. I've never called any user stupid and never will, not while
they use Postgres, at least... If we get the default wrong then its down
to us.

Never cancelling queries is definitely a wrong default choice for an HA
server.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 14:56 -0500, Tom Lane wrote:

> Well, those unexpectedly cancelled queries could have represented
> critical functionality too.  I think this argument calls the entire
> approach into question.  If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.

I see the opposite: We don't know what tradeoffs, if any, the user is
willing to put up with, so we need input. It is about resource
prioritisation and not for us to decide, since these reflect business
objectives not internal twangy things like join_collapse_limit.

The essential choice is "What would you like the max failover time to
be?". Some users want one server with max 5 mins behind, some want two
servers, one with 0 seconds behind, one with 12 hours behind

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> The essential choice is "What would you like the max failover time to
> be?". Some users want one server with max 5 mins behind, some want two
> servers, one with 0 seconds behind, one with 12 hours behind

It's not quite that simple. Setting max_standby_delay=5mins means that 
you're willing to wait 5 minutes for each query to die. Which means that 
in worst case you have to stop for 5 minutes at every single vacuum 
record, and fall behind much more than 5 minutes.

You could make it more like that by tracking the timestamps in commit 
records, and/or having some sort of a moving average logic in the 
timeout, where you allow more waiting if you haven't waited for a long 
time, and kill queries more aggressively if you've had to wait a lot 
recently.

It should also be noted that the control functions allow you to connect 
to the database and manually pause/resume the replay. So you can for 
example set max_standby_delay=0 during the day, but pause the replay 
manually before starting a nightly report.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Wed, 2009-01-28 at 22:47 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The essential choice is "What would you like the max failover time to
> > be?". Some users want one server with max 5 mins behind, some want two
> > servers, one with 0 seconds behind, one with 12 hours behind
> 
> It's not quite that simple. 

In this case, yes it is.

> Setting max_standby_delay=5mins means that 
> you're willing to wait 5 minutes for each query to die. Which means that 
> in worst case you have to stop for 5 minutes at every single vacuum 
> record, and fall behind much more than 5 minutes.

That's not how this patch works.

> You could make it more like that by tracking the timestamps in commit 
> records

Which is what we do.

> It should also be noted that the control functions allow you to connect 
> to the database and manually pause/resume the replay. So you can for 
> example set max_standby_delay=0 during the day, but pause the replay 
> manually before starting a nightly report.

Yes, thank you for bringing balance to the discussions.

Please everybody read this before commenting further.

http://wiki.postgresql.org/wiki/Hot_Standby#Usage

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Jeff Davis
Date:
On Wed, 2009-01-28 at 22:47 +0200, Heikki Linnakangas wrote:
> It's not quite that simple. Setting max_standby_delay=5mins means that 
> you're willing to wait 5 minutes for each query to die. Which means that 
> in worst case you have to stop for 5 minutes at every single vacuum 
> record, and fall behind much more than 5 minutes.

Just trying to follow along: are you talking about the situation where
there are (for example) a continuous stream of "select pg_sleep(600)" on
the standby, and a series of rapid VACUUMs on the primary?

This situation might be more likely now that we have partial VACUUMs.

> It should also be noted that the control functions allow you to connect 
> to the database and manually pause/resume the replay. So you can for 
> example set max_standby_delay=0 during the day, but pause the replay 
> manually before starting a nightly report.
> 

That's a very cool feature.

Regards,Jeff Davis



Re: Hot Standby (v9d)

From
Robert Haas
Date:
>> I don't. Primarily, we must support high availability. It is much better
>> if we get people saying "I get my queries cancelled" and we say RTFM and
>> change parameter X, than if people say "my failover was 12 hours behind
>> when I needed it to be 10 seconds behind and I lost a $1 million because
>> of downtime of Postgres" and we say RTFM and change parameter X.
>
> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money. Not to mention we have already lost them
> as a user because they will blame postgresql regardless of reality as
> evidenced by their inability to RTFM (or have a vendor that RTFMs) in
> the first place.
>
> I got to vote with Greg on this one.

I vote with Simon.  The thing is that if you get some queries
cancelled, you'll realize you have a problem.  Now you have several
options for what to do to fix it.   Having your failover be 12 hours
behind (or 12 months behind) is something that it would be much easier
to not realize.

...Robert


Re: Hot Standby (v9d)

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> On Wed, 2009-01-28 at 14:56 -0500, Tom Lane wrote:
>
>> Well, those unexpectedly cancelled queries could have represented
>> critical functionality too.  I think this argument calls the entire
>> approach into question.  If there is no safe setting for the parameter
>> then we need to find a way to not have the parameter.
>
> I see the opposite: We don't know what tradeoffs, if any, the user is
> willing to put up with, so we need input. 

Well, if you see it that way then it seems to me you should be arguing for
making max_standby_delay a mandatory parameter. Without it don't start allow
connections. I hadn't considered that and am not exactly sure where I would
stand on it.

> The essential choice is "What would you like the max failover time to
> be?". Some users want one server with max 5 mins behind, some want two
> servers, one with 0 seconds behind, one with 12 hours behind

Sure. But if they don't configure one then we shouldn't impose one. You're
thinking of precisely one use case and taking positive action to interrupt the
user's requests on the basis of it. But there are plenty of other use cases. I
claim the default has to be to do as the user instructed without intervention.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Hot Standby (v9d)

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I vote with Simon.  The thing is that if you get some queries
> cancelled, you'll realize you have a problem.

... or if you don't, they couldn't have been all that critical.

> Having your failover be 12 hours
> behind (or 12 months behind) is something that it would be much easier
> to not realize.

Okay, I'm sold, positive max_standby_delay should be the default.
        regards, tom lane


Re: Hot Standby (v9d)

From
Hannu Krosing
Date:
On Wed, 2009-01-28 at 22:19 +0200, Heikki Linnakangas wrote:
> Tom Lane wrote:
...
> > Well, those unexpectedly cancelled queries could have represented
> > critical functionality too.  I think this argument calls the entire
> > approach into question.  If there is no safe setting for the parameter
> > then we need to find a way to not have the parameter.
> 
> We've gone through that already. Different ideas were hashed out around 
> September. There's four basic feasible approaches to what to do when an 
> incoming WAL record conflicts with a running read-only query:
> 
> 1. Kill the query. (max_standby_delay=0)
> 2. Wait for the query to finish before continuing (max_standby_delay=-1)
> 3. Have a feedback loop from standby to master, feeding an OldestXmin to 
> the master, preventing it from removing tuples that are still needed in 
> the standby.
> 4. Allow the query to continue, knowing that it will return wrong results.
> 
> I don't consider 4 to be an option. Option 3 has its own set of 
> drawbacks, as a standby can then cause bloat in the master, and in any 
> case we're not going to have it in this release. And then there's some 
> middle ground, like wait a while and then kill the query 
> (max_standby_delay > 0).
> 
> I don't see any way around the fact that when a tuple is removed, it's 
> gone and can't be accessed by queries. Either you don't remove it, or 
> you kill the query.

Actually we came up with a solution to this - use filesystem level
snapshots (like LVM2+XFS or ZFS), and redirect backends with
long-running queries to use fs snapshot mounted to a different
mountpoint.

I don't think Simon has yet put full support for it in code, but it is
clearly _the_ solution for those who want to eat the cake and have it
too.

> 
> I think the max_standby_delay setting is fairly easy to explain. It 
> shouldn't be too hard for a DBA to set it correctly.
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
> 



Re: Hot Standby (v9d)

From
Andrew Dunstan
Date:

Hannu Krosing wrote:
> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.
>
> I don't think Simon has yet put full support for it in code, but it is
> clearly _the_ solution for those who want to eat the cake and have it
> too.
>
>   


How does that work if you're using mutiple file systems via tablespaces 
(e.g. indexes in a different TS)?

cheers

andrew


Re: Hot Standby (v9d)

From
Gregory Stark
Date:
Hannu Krosing <hannu@krosing.net> writes:

> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.

Uhm, how do you determine which snapshot to direct the backend to? There could
have been several generations of tuples in that tid since your query started.
Do you take a snapshot every time there's a vacuum-snapshot conflict and
record which snapshot goes with that snapshot?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Hot Standby (v9d)

From
Robert Haas
Date:
>> I don't see any way around the fact that when a tuple is removed, it's
>> gone and can't be accessed by queries. Either you don't remove it, or
>> you kill the query.
>
> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.
>
> I don't think Simon has yet put full support for it in code, but it is
> clearly _the_ solution for those who want to eat the cake and have it
> too.

I think _the_ solution is to notice when you're about to vacuum a page
that is still visible to a running backend on the standby, and save
that page off to a separate cache of old page versions (perhaps using
the relation fork mechanism).  I suspect filesystem-level snapshots
can be made to work, but it's never going to be a portable solution,
and I suspect you're going to need a lot of the same bookkeeping
anyway (like, when you have page X in shared buffers, which version of
page X is it, anyway?).

...Robert


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
> 
> Hannu Krosing wrote:
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> >
> > I don't think Simon has yet put full support for it in code, but it is
> > clearly _the_ solution for those who want to eat the cake and have it
> > too.

> How does that work if you're using mutiple file systems via tablespaces 
> (e.g. indexes in a different TS)?

It's a great idea and easy to do, but I can't do everything in one go.

The main reasons not to are multiple file system difficulties and lack
of a mainstream Linux solution, and more simply lack of time and test
resources.

So not now, maybe later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Hannu Krosing
Date:
On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
> 
> Hannu Krosing wrote:
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> >
> > I don't think Simon has yet put full support for it in code, but it is
> > clearly _the_ solution for those who want to eat the cake and have it
> > too.
> >
> >   
> 
> 
> How does that work if you're using mutiple file systems via tablespaces 
> (e.g. indexes in a different TS)?

Basically the same way we do WAL shipping up to 8.3. 

That is using external scripts. Once we have enough experience, we could
try to move tha actual snapshot-mount-switch inside the core

-- 
------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Hot Standby (v9d)

From
Hannu Krosing
Date:
On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:
> >> I don't see any way around the fact that when a tuple is removed, it's
> >> gone and can't be accessed by queries. Either you don't remove it, or
> >> you kill the query.
> >
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> >
> > I don't think Simon has yet put full support for it in code, but it is
> > clearly _the_ solution for those who want to eat the cake and have it
> > too.
> 
> I think _the_ solution is to notice when you're about to vacuum a page
> that is still visible to a running backend on the standby, and save
> that page off to a separate cache of old page versions (perhaps using
> the relation fork mechanism).  I suspect filesystem-level snapshots
> can be made to work, but it's never going to be a portable solution,
> and I suspect you're going to need a lot of the same bookkeeping
> anyway (like, when you have page X in shared buffers, which version of
> page X is it, anyway?).

The idea was to switch file pointers inside the backend needing old
versions, (and then flush cache if needed) so the only bookkeeping you
need is which fs snapshots you need to keep and which can be released.



-- 
------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:

> I think _the_ solution is to notice when you're about to vacuum a page
> that is still visible to a running backend on the standby, and save
> that page off to a separate cache of old page versions (perhaps using
> the relation fork mechanism).

I'll let you write that, for the next release...

The problem with all of this is we've been talking about it for 8 months
now and various opinions are held by people. What is being presented is
a broad set of options (summarised from Wiki)

1. Wait then cancel
a) always for databases, tablespaces and locks
b) deferred cancelation for buffer changes

2. We connect to Primary server from Standby server and keep a
transaction open using contrib/dblink functions, then commit as soon as
we are done.

3. We pause recovery by issuing a pg_recovery_pause() function, or start
server in pause mode using recovery_started_paused = on.

Yes, it's a critical area to the success of the feature. But this is
enough for first release and for us to get user feedback.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Hannu Krosing
Date:
On Tue, 2009-02-03 at 13:50 +0000, Gregory Stark wrote:
> Hannu Krosing <hannu@krosing.net> writes:
> 
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> 
> Uhm, how do you determine which snapshot to direct the backend to? There could
> have been several generations of tuples in that tid since your query started.
> Do you take a snapshot every time there's a vacuum-snapshot conflict and
> record which snapshot goes with that snapshot?

The most sensible thing to do seems to wait for some configurable period
(say a few seconds or a few minutes), delaying WAL apply, and then to do
the snaphot, mount it and redirect all running transactions to use
_current_ filesystem snapshot, and then resume WAL application. 

As each of the transactions running on saved fs snapshots complete, they
are switced back to main/live fs view. 

When the last there transaction ends, the snapshot is unmounted and
released.

-- 
------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Hot Standby (v9d)

From
Andres Freund
Date:
Hi,

On 02/03/2009 02:26 PM, Hannu Krosing wrote:
>> I don't see any way around the fact that when a tuple is removed, it's
>> gone and can't be accessed by queries. Either you don't remove it, or
>> you kill the query.
> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.
Isn't that really, really expensive?

A single write on the master logical volume yields writes of PE size 
for _every_ single snapshot (the first time the block is touched) - 
considering that there could quite many such snapshots I don't think 
that this is really feasible - io quite possible might be saturated.

The default PE size is 4MB - but on most bigger systems it is set to a 
bigger size, so its just getting worse for bigger systems.

Sure, one might say, that this is an LVM deficiency - but I do knot know 
of any snapshot-able block layer doing it that way.

Andres


Re: Hot Standby (v9d)

From
Robert Haas
Date:
On Tue, Feb 3, 2009 at 9:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:
>> I think _the_ solution is to notice when you're about to vacuum a page
>> that is still visible to a running backend on the standby, and save
>> that page off to a separate cache of old page versions (perhaps using
>> the relation fork mechanism).
>
> I'll let you write that, for the next release...

LOL.  How many sponsorship dollars are available for that project?

> The problem with all of this is we've been talking about it for 8 months
> now and various opinions are held by people. What is being presented is
> a broad set of options (summarised from Wiki)

I think everyone understands that these are things we might want to do
down the line, not things we need to have now.  For this release, I
was under the impression that we'd pretty much settled on implementing
(1) and maybe (3) but not (2) from the below list.

> 1. Wait then cancel
> a) always for databases, tablespaces and locks
> b) deferred cancelation for buffer changes
>
> 2. We connect to Primary server from Standby server and keep a
> transaction open using contrib/dblink functions, then commit as soon as
> we are done.
>
> 3. We pause recovery by issuing a pg_recovery_pause() function, or start
> server in pause mode using recovery_started_paused = on.
>
> Yes, it's a critical area to the success of the feature. But this is
> enough for first release and for us to get user feedback.

I completely agree.

...Robert


Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Tue, 2009-02-03 at 15:55 +0100, Andres Freund wrote:
> Hi,
> 
> On 02/03/2009 02:26 PM, Hannu Krosing wrote:
> >> I don't see any way around the fact that when a tuple is removed, it's
> >> gone and can't be accessed by queries. Either you don't remove it, or
> >> you kill the query.
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> Isn't that really, really expensive?
> 
> A single write on the master logical volume yields writes of PE size 
> for _every_ single snapshot (the first time the block is touched) - 
> considering that there could quite many such snapshots I don't think 
> that this is really feasible - io quite possible might be saturated.

If we did that we would provide an option to select the MVCC snapshot
that went with the filesystem snapshot. There need not be one per user.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby (v9d)

From
Hannu Krosing
Date:
On Tue, 2009-02-03 at 10:19 -0500, Robert Haas wrote:
> On Tue, Feb 3, 2009 at 9:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:
> >> I think _the_ solution is to notice when you're about to vacuum a page
> >> that is still visible to a running backend on the standby, and save
> >> that page off to a separate cache of old page versions (perhaps using
> >> the relation fork mechanism).
> >
> > I'll let you write that, for the next release...
> 
> LOL.  How many sponsorship dollars are available for that project?
> 
> > The problem with all of this is we've been talking about it for 8 months
> > now and various opinions are held by people. What is being presented is
> > a broad set of options (summarised from Wiki)
> 
> I think everyone understands that these are things we might want to do
> down the line, not things we need to have now.  For this release, I
> was under the impression that we'd pretty much settled on implementing
> (1) and maybe (3) but not (2) from the below list.

(2) is something that you can always do manually if you need it. So no
reason to support it in HS code explicitly.

Once you keep trx open on master, 1 and 3 should not happen anymore
until you close that trx.

> > 1. Wait then cancel
> > a) always for databases, tablespaces and locks
> > b) deferred cancelation for buffer changes
> >
> > 2. We connect to Primary server from Standby server and keep a
> > transaction open using contrib/dblink functions, then commit as soon as
> > we are done.
> >
> > 3. We pause recovery by issuing a pg_recovery_pause() function, or start
> > server in pause mode using recovery_started_paused = on.
> >
> > Yes, it's a critical area to the success of the feature. But this is
> > enough for first release and for us to get user feedback.
> 
> I completely agree.
> 
> ...Robert
> 



Re: Hot Standby (v9d)

From
Hannu Krosing
Date:
On Tue, 2009-02-03 at 14:28 +0000, Simon Riggs wrote:
> On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
> > 
> > Hannu Krosing wrote:
> > > Actually we came up with a solution to this - use filesystem level
> > > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > > long-running queries to use fs snapshot mounted to a different
> > > mountpoint.
> > >
> > > I don't think Simon has yet put full support for it in code, but it is
> > > clearly _the_ solution for those who want to eat the cake and have it
> > > too.
> 
> > How does that work if you're using mutiple file systems via tablespaces 
> > (e.g. indexes in a different TS)?
> 
> It's a great idea and easy to do, but I can't do everything in one go.
> 
> The main reasons not to are multiple file system difficulties and lack
> of a mainstream Linux solution, and more simply lack of time and test
> resources.

More general, but also lot harder, solution would be going back to roots
and implement what original postgres 4.2 and earlier versions were meant
to do - namely VACUUM was not meant to just discard older versions , but
rather move it to WORM storage (write once read many was all the rage
back then :) .

If we did that in a way that each relation, at least on warm standby ,
has its own "archive" fork, possibly in a separate tablespace for
cheaper storage, then we could basically apply WAL's as fast we want and
just move the old versions to "archive". It will be slower(ish),
especially for HOT updates, but may be a good solution for lots of
usecases.

And the decision to do the archiving on master and WAL-copy to slave, or
just do it on slave only could probably be left to user.

Reintroducing keeping old tuples "forever" would also allow us to bring
back time travel feature, that is 

SELECT .... AS OF 'yesterday afternoon'::timestamp;

Which was thrown out at the times we got WAL-logging.

To be really useful we should also have some way to know trx timestamps,
but that can be easily done using ticker feature from Slony -
SkyTools/pgQ, which could be run a a separate server thread similar to
what we do with background writer, autovacuum  etc. now.

> 
> So not now, maybe later.
> 



Re: Hot Standby (v9d)

From
Simon Riggs
Date:
On Tue, 2009-02-03 at 18:09 +0200, Hannu Krosing wrote:
> On Tue, 2009-02-03 at 14:28 +0000, Simon Riggs wrote:
> > On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
> > > 
> > > Hannu Krosing wrote:
> > > > Actually we came up with a solution to this - use filesystem level
> > > > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > > > long-running queries to use fs snapshot mounted to a different
> > > > mountpoint.
> > > >
> > > > I don't think Simon has yet put full support for it in code, but it is
> > > > clearly _the_ solution for those who want to eat the cake and have it
> > > > too.
> > 
> > > How does that work if you're using mutiple file systems via tablespaces 
> > > (e.g. indexes in a different TS)?
> > 
> > It's a great idea and easy to do, but I can't do everything in one go.
> > 
> > The main reasons not to are multiple file system difficulties and lack
> > of a mainstream Linux solution, and more simply lack of time and test
> > resources.
> 
> More general, but also lot harder, solution would be going back to roots
> and implement what original postgres 4.2 and earlier versions were meant
> to do - namely VACUUM was not meant to just discard older versions , but
> rather move it to WORM storage (write once read many was all the rage
> back then :) .
> 
> If we did that in a way that each relation, at least on warm standby ,
> has its own "archive" fork, possibly in a separate tablespace for
> cheaper storage, then we could basically apply WAL's as fast we want and
> just move the old versions to "archive". It will be slower(ish),
> especially for HOT updates, but may be a good solution for lots of
> usecases.
> 
> And the decision to do the archiving on master and WAL-copy to slave, or
> just do it on slave only could probably be left to user.
> 
> Reintroducing keeping old tuples "forever" would also allow us to bring
> back time travel feature, that is 
> 
> SELECT .... AS OF 'yesterday afternoon'::timestamp;
> 
> Which was thrown out at the times we got WAL-logging.
> 
> To be really useful we should also have some way to know trx timestamps,
> but that can be easily done using ticker feature from Slony -
> SkyTools/pgQ, which could be run a a separate server thread similar to
> what we do with background writer, autovacuum  etc. now.

That might be the way to do the "Named Restore Points" that is
frequently requested.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support