Thread: Hot Standby Design

Hot Standby Design

From
Simon Riggs
Date:
Hot Standby design has been growing on the PostgreSQL Wiki for some
weeks now.

I've updated the design to reflect all feedback received so far on
various topics.

http://wiki.postgresql.org/wiki/Hot_Standby

It's not hugely detailed in every area, but it gives a flavour of the
topics and issues related to them.

Comments or questions welcome here, or I will discuss specific areas in
more detail as I tackle those topics.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby Design

From
Robert Treat
Date:
On Tuesday 23 September 2008 14:15:34 Simon Riggs wrote:
> Hot Standby design has been growing on the PostgreSQL Wiki for some
> weeks now.
>
> I've updated the design to reflect all feedback received so far on
> various topics.
>
> http://wiki.postgresql.org/wiki/Hot_Standby
>
> It's not hugely detailed in every area, but it gives a flavour of the
> topics and issues related to them.
>
> Comments or questions welcome here, or I will discuss specific areas in
> more detail as I tackle those topics.
>

very nice work. very in depth. unfortunatly, this means it is long and the 
hour is late... so here are some scattered thoughts i had while reading it :

* "However, some WAL redo actions will be for DDL actions. These DDL actions 
are repeating actions that have already committed on the primary node, so 
they must not fail on the standby node. These DDL locks take priority and 
will automatically cancel any read-only transactions that get in their way."

Some people will want the option to stack-up ddl transactions behind 
long-running queries (one of the main use cases of a hot slave is reporting 
stye and other long running queries)


* Actions not allowed on Standby are:
DML - Insert, Update, Delete, COPY FROM, Truncate

copy from suprised me a bit, since it is something i could see people wanting 
to do... did you mean COPY TO in this case?

* Statspack functions should work OK, so tools such as pgAdminIII should work. 
pgAgent will not.

I presume this means the backend kill function would work?  Also, can you go 
into why pgAgent would not work? (I presume it's because you can't update 
information that needs to be changed when jobs run, if thats the case it 
might be worth thinking about making pgAgent work across different clusters)

* Looking for suggestions about what monitoring capability will be required.

one place to start might be to think about which checks in check_nagios might 
still apply.  Possibly also looking to systems like slony for some 
guidence... for example everyone will want some way to check how far the lag 
is on a stnadby machine. 

* The following commands will not be accepted during recovery mode GRANT, REVOKE, REASSIGN 

How is user management done on a standby? can you have users that dont exist 
on the primary (it would seem not). 

... more to come i'm sure, but fading out... thanks again for the work so far 
Simon. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Hot Standby Design

From
Simon Riggs
Date:
On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote:

> here are some scattered thoughts i had while reading it :

Thanks for your comments.

It is very detailed, so further feedback is going to be very beneficial
in making this all work in the way we hope and expect.

> * "However, some WAL redo actions will be for DDL actions. These DDL actions 
> are repeating actions that have already committed on the primary node, so 
> they must not fail on the standby node. These DDL locks take priority and 
> will automatically cancel any read-only transactions that get in their way."
> 
> Some people will want the option to stack-up ddl transactions behind 
> long-running queries (one of the main use cases of a hot slave is reporting 
> stye and other long running queries)

Scheduling tools can help here. Run set of queries on Standby, then when
complete run DDL on Primary and have its changes filter through.

> * Actions not allowed on Standby are:
> DML - Insert, Update, Delete, COPY FROM, Truncate
> 
> copy from suprised me a bit, since it is something i could see people wanting 
> to do... did you mean COPY TO in this case?

I checked...

COPY TO is allowed, since it extracts data. So pg_dump will run. 
COPY FROM will not be allowed since it loads data. So pg_restore will
not run.

> * Statspack functions should work OK, so tools such as pgAdminIII should work. 
> pgAgent will not.
> 
> I presume this means the backend kill function would work?  

Yes it will.

> Also, can you go 
> into why pgAgent would not work? (I presume it's because you can't update 
> information that needs to be changed when jobs run, if thats the case it 
> might be worth thinking about making pgAgent work across different clusters)

Yes

> * Looking for suggestions about what monitoring capability will be required.
> 
> one place to start might be to think about which checks in check_nagios might 
> still apply.  Possibly also looking to systems like slony for some 
> guidence... for example 

check_pgsql will work, but its very simple.

check_postgres will also work, though many some actions could give
different or confusing results. e.g. last vacuum time will not be
maintained for example, since no vacuum occurs on the standby.

slony won't run on the standby either, so those checks won't work
either. 

> everyone will want some way to check how far the lag 
> is on a stnadby machine. 

Agreed

> * The following commands will not be accepted during recovery mode 
>  GRANT, REVOKE, REASSIGN 
> 
> How is user management done on a standby? can you have users that dont
> exist 
> on the primary (it would seem not). 

No you can't have different users. (In time, I see this as a good thing
because it will allow us to move queries around to different nodes for
execution so that a single database acts like a "hive mind".)

Nearly everything must be set via the Primary. Users, passwords.
Changes to .conf files will be possible. There is no mechanism to auto
synchronise the .conf files between nodes.

The Standby is a Clone and not a Slave. A Slave is a separate database
that is forced to duplicate the actions of the Master. The Standby is an
exact copy, in every detail that matters.

I can see it might be desirable to have it work that way, but that's not
going to happen in the first release.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby Design

From
"Dave Page"
Date:
On Wed, Sep 24, 2008 at 5:30 AM, Robert Treat
<xzilla@users.sourceforge.net> wrote:

> I presume this means the backend kill function would work?  Also, can you go
> into why pgAgent would not work? (I presume it's because you can't update
> information that needs to be changed when jobs run, if thats the case it
> might be worth thinking about making pgAgent work across different clusters)

Probably - it needs to be able to update its catalogs to record job
results and ensure that only a single node runs any given job
instance.

We could probably update the code to optionally accept a connection
string instead of a database name, which would allow us to operate
against servers other than the one with the catalogs installed.

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


Re: Hot Standby Design

From
Simon Riggs
Date:
On Wed, 2008-09-24 at 08:28 +0100, Dave Page wrote:
> On Wed, Sep 24, 2008 at 5:30 AM, Robert Treat
> <xzilla@users.sourceforge.net> wrote:
> 
> > I presume this means the backend kill function would work?  Also, can you go
> > into why pgAgent would not work? (I presume it's because you can't update
> > information that needs to be changed when jobs run, if thats the case it
> > might be worth thinking about making pgAgent work across different clusters)
> 
> Probably - it needs to be able to update its catalogs to record job
> results and ensure that only a single node runs any given job
> instance.
> 
> We could probably update the code to optionally accept a connection
> string instead of a database name, which would allow us to operate
> against servers other than the one with the catalogs installed.

It would be useful to be able to schedule jobs against multiple nodes.

Not much need for maintenance actions on a Standby node, but I think if
I say "there aren't any needed at all" that would likely be wrong.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby Design

From
Robert Treat
Date:
On Wednesday 24 September 2008 03:27:44 Simon Riggs wrote:
> On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote:
> > here are some scattered thoughts i had while reading it :
>
> Thanks for your comments.
>
> It is very detailed, so further feedback is going to be very beneficial
> in making this all work in the way we hope and expect.
>
> > * "However, some WAL redo actions will be for DDL actions. These DDL
> > actions are repeating actions that have already committed on the primary
> > node, so they must not fail on the standby node. These DDL locks take
> > priority and will automatically cancel any read-only transactions that
> > get in their way."
> >
> > Some people will want the option to stack-up ddl transactions behind
> > long-running queries (one of the main use cases of a hot slave is
> > reporting stye and other long running queries)
>
> Scheduling tools can help here. Run set of queries on Standby, then when
> complete run DDL on Primary and have its changes filter through.
>

true... i am just reminded of Oracle 8's log replay, where you had to stop 
replay to run any queries... given that was a usefull feature, I suspect 
we'll hear complaints about it not going that way. I think one could argue 
that we might be able to provide such an option in the future, if not in the 
first release. 

> > * Actions not allowed on Standby are:
> > DML - Insert, Update, Delete, COPY FROM, Truncate
> >
> > copy from suprised me a bit, since it is something i could see people
> > wanting to do... did you mean COPY TO in this case?
>
> I checked...
>
> COPY TO is allowed, since it extracts data. So pg_dump will run.
> COPY FROM will not be allowed since it loads data. So pg_restore will
> not run.
>

ah, of course... the late hour had me thinking backwards.  pg_dump on the 
clone will be a big plus. 

<snip>
> > * Looking for suggestions about what monitoring capability will be
> > required.
> >
> > one place to start might be to think about which checks in check_nagios
> > might still apply.  Possibly also looking to systems like slony for some
> > guidence... for example
>
> check_pgsql will work, but its very simple.
>
> check_postgres will also work, though many some actions could give
> different or confusing results. e.g. last vacuum time will not be
> maintained for example, since no vacuum occurs on the standby.
>

yep.

> slony won't run on the standby either, so those checks won't work
> either.
>
> > everyone will want some way to check how far the lag
> > is on a stnadby machine.
>
> Agreed
>

right... my thought with slony was, what do people monitor on thier slony 
slaves?  (there is actually a nagios script for that iirc)

<snip>
> The Standby is a Clone and not a Slave. A Slave is a separate database
> that is forced to duplicate the actions of the Master. The Standby is an
> exact copy, in every detail that matters.
>

This is an interesting clarification. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Hot Standby Design

From
Simon Riggs
Date:
On Wed, 2008-09-24 at 12:35 -0400, Robert Treat wrote:
> On Wednesday 24 September 2008 03:27:44 Simon Riggs wrote:
> > On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote:
> > > * "However, some WAL redo actions will be for DDL actions. These DDL
> > > actions are repeating actions that have already committed on the primary
> > > node, so they must not fail on the standby node. These DDL locks take
> > > priority and will automatically cancel any read-only transactions that
> > > get in their way."
> > >
> > > Some people will want the option to stack-up ddl transactions behind
> > > long-running queries (one of the main use cases of a hot slave is
> > > reporting stye and other long running queries)
> >
> > Scheduling tools can help here. Run set of queries on Standby, then when
> > complete run DDL on Primary and have its changes filter through.
> >
> 
> true... i am just reminded of Oracle 8's log replay, where you had to stop 
> replay to run any queries... given that was a usefull feature, I suspect 
> we'll hear complaints about it not going that way. I think one could argue 
> that we might be able to provide such an option in the future, if not in the 
> first release. 

It probably sounds worse than it is. If you drop a table on the Primary,
then somebody running a query against it on the Standby is in for a
shock. But on the other hand, why are you dropping a table that people
still consider worth reading? DROP TABLE should be carefully researched
before use, so I don't think its a big problem. 

We could make it wait for a while before cancelling, as an option, if
you think its important?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support