Thread: Hot Standby Design
Hot Standby design has been growing on the PostgreSQL Wiki for some weeks now. I've updated the design to reflect all feedback received so far on various topics. http://wiki.postgresql.org/wiki/Hot_Standby It's not hugely detailed in every area, but it gives a flavour of the topics and issues related to them. Comments or questions welcome here, or I will discuss specific areas in more detail as I tackle those topics. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tuesday 23 September 2008 14:15:34 Simon Riggs wrote: > Hot Standby design has been growing on the PostgreSQL Wiki for some > weeks now. > > I've updated the design to reflect all feedback received so far on > various topics. > > http://wiki.postgresql.org/wiki/Hot_Standby > > It's not hugely detailed in every area, but it gives a flavour of the > topics and issues related to them. > > Comments or questions welcome here, or I will discuss specific areas in > more detail as I tackle those topics. > very nice work. very in depth. unfortunatly, this means it is long and the hour is late... so here are some scattered thoughts i had while reading it : * "However, some WAL redo actions will be for DDL actions. These DDL actions are repeating actions that have already committed on the primary node, so they must not fail on the standby node. These DDL locks take priority and will automatically cancel any read-only transactions that get in their way." Some people will want the option to stack-up ddl transactions behind long-running queries (one of the main use cases of a hot slave is reporting stye and other long running queries) * Actions not allowed on Standby are: DML - Insert, Update, Delete, COPY FROM, Truncate copy from suprised me a bit, since it is something i could see people wanting to do... did you mean COPY TO in this case? * Statspack functions should work OK, so tools such as pgAdminIII should work. pgAgent will not. I presume this means the backend kill function would work? Also, can you go into why pgAgent would not work? (I presume it's because you can't update information that needs to be changed when jobs run, if thats the case it might be worth thinking about making pgAgent work across different clusters) * Looking for suggestions about what monitoring capability will be required. one place to start might be to think about which checks in check_nagios might still apply. Possibly also looking to systems like slony for some guidence... for example everyone will want some way to check how far the lag is on a stnadby machine. * The following commands will not be accepted during recovery mode GRANT, REVOKE, REASSIGN How is user management done on a standby? can you have users that dont exist on the primary (it would seem not). ... more to come i'm sure, but fading out... thanks again for the work so far Simon. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote: > here are some scattered thoughts i had while reading it : Thanks for your comments. It is very detailed, so further feedback is going to be very beneficial in making this all work in the way we hope and expect. > * "However, some WAL redo actions will be for DDL actions. These DDL actions > are repeating actions that have already committed on the primary node, so > they must not fail on the standby node. These DDL locks take priority and > will automatically cancel any read-only transactions that get in their way." > > Some people will want the option to stack-up ddl transactions behind > long-running queries (one of the main use cases of a hot slave is reporting > stye and other long running queries) Scheduling tools can help here. Run set of queries on Standby, then when complete run DDL on Primary and have its changes filter through. > * Actions not allowed on Standby are: > DML - Insert, Update, Delete, COPY FROM, Truncate > > copy from suprised me a bit, since it is something i could see people wanting > to do... did you mean COPY TO in this case? I checked... COPY TO is allowed, since it extracts data. So pg_dump will run. COPY FROM will not be allowed since it loads data. So pg_restore will not run. > * Statspack functions should work OK, so tools such as pgAdminIII should work. > pgAgent will not. > > I presume this means the backend kill function would work? Yes it will. > Also, can you go > into why pgAgent would not work? (I presume it's because you can't update > information that needs to be changed when jobs run, if thats the case it > might be worth thinking about making pgAgent work across different clusters) Yes > * Looking for suggestions about what monitoring capability will be required. > > one place to start might be to think about which checks in check_nagios might > still apply. Possibly also looking to systems like slony for some > guidence... for example check_pgsql will work, but its very simple. check_postgres will also work, though many some actions could give different or confusing results. e.g. last vacuum time will not be maintained for example, since no vacuum occurs on the standby. slony won't run on the standby either, so those checks won't work either. > everyone will want some way to check how far the lag > is on a stnadby machine. Agreed > * The following commands will not be accepted during recovery mode > GRANT, REVOKE, REASSIGN > > How is user management done on a standby? can you have users that dont > exist > on the primary (it would seem not). No you can't have different users. (In time, I see this as a good thing because it will allow us to move queries around to different nodes for execution so that a single database acts like a "hive mind".) Nearly everything must be set via the Primary. Users, passwords. Changes to .conf files will be possible. There is no mechanism to auto synchronise the .conf files between nodes. The Standby is a Clone and not a Slave. A Slave is a separate database that is forced to duplicate the actions of the Master. The Standby is an exact copy, in every detail that matters. I can see it might be desirable to have it work that way, but that's not going to happen in the first release. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, Sep 24, 2008 at 5:30 AM, Robert Treat <xzilla@users.sourceforge.net> wrote: > I presume this means the backend kill function would work? Also, can you go > into why pgAgent would not work? (I presume it's because you can't update > information that needs to be changed when jobs run, if thats the case it > might be worth thinking about making pgAgent work across different clusters) Probably - it needs to be able to update its catalogs to record job results and ensure that only a single node runs any given job instance. We could probably update the code to optionally accept a connection string instead of a database name, which would allow us to operate against servers other than the one with the catalogs installed. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
On Wed, 2008-09-24 at 08:28 +0100, Dave Page wrote: > On Wed, Sep 24, 2008 at 5:30 AM, Robert Treat > <xzilla@users.sourceforge.net> wrote: > > > I presume this means the backend kill function would work? Also, can you go > > into why pgAgent would not work? (I presume it's because you can't update > > information that needs to be changed when jobs run, if thats the case it > > might be worth thinking about making pgAgent work across different clusters) > > Probably - it needs to be able to update its catalogs to record job > results and ensure that only a single node runs any given job > instance. > > We could probably update the code to optionally accept a connection > string instead of a database name, which would allow us to operate > against servers other than the one with the catalogs installed. It would be useful to be able to schedule jobs against multiple nodes. Not much need for maintenance actions on a Standby node, but I think if I say "there aren't any needed at all" that would likely be wrong. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wednesday 24 September 2008 03:27:44 Simon Riggs wrote: > On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote: > > here are some scattered thoughts i had while reading it : > > Thanks for your comments. > > It is very detailed, so further feedback is going to be very beneficial > in making this all work in the way we hope and expect. > > > * "However, some WAL redo actions will be for DDL actions. These DDL > > actions are repeating actions that have already committed on the primary > > node, so they must not fail on the standby node. These DDL locks take > > priority and will automatically cancel any read-only transactions that > > get in their way." > > > > Some people will want the option to stack-up ddl transactions behind > > long-running queries (one of the main use cases of a hot slave is > > reporting stye and other long running queries) > > Scheduling tools can help here. Run set of queries on Standby, then when > complete run DDL on Primary and have its changes filter through. > true... i am just reminded of Oracle 8's log replay, where you had to stop replay to run any queries... given that was a usefull feature, I suspect we'll hear complaints about it not going that way. I think one could argue that we might be able to provide such an option in the future, if not in the first release. > > * Actions not allowed on Standby are: > > DML - Insert, Update, Delete, COPY FROM, Truncate > > > > copy from suprised me a bit, since it is something i could see people > > wanting to do... did you mean COPY TO in this case? > > I checked... > > COPY TO is allowed, since it extracts data. So pg_dump will run. > COPY FROM will not be allowed since it loads data. So pg_restore will > not run. > ah, of course... the late hour had me thinking backwards. pg_dump on the clone will be a big plus. <snip> > > * Looking for suggestions about what monitoring capability will be > > required. > > > > one place to start might be to think about which checks in check_nagios > > might still apply. Possibly also looking to systems like slony for some > > guidence... for example > > check_pgsql will work, but its very simple. > > check_postgres will also work, though many some actions could give > different or confusing results. e.g. last vacuum time will not be > maintained for example, since no vacuum occurs on the standby. > yep. > slony won't run on the standby either, so those checks won't work > either. > > > everyone will want some way to check how far the lag > > is on a stnadby machine. > > Agreed > right... my thought with slony was, what do people monitor on thier slony slaves? (there is actually a nagios script for that iirc) <snip> > The Standby is a Clone and not a Slave. A Slave is a separate database > that is forced to duplicate the actions of the Master. The Standby is an > exact copy, in every detail that matters. > This is an interesting clarification. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Wed, 2008-09-24 at 12:35 -0400, Robert Treat wrote: > On Wednesday 24 September 2008 03:27:44 Simon Riggs wrote: > > On Wed, 2008-09-24 at 00:30 -0400, Robert Treat wrote: > > > * "However, some WAL redo actions will be for DDL actions. These DDL > > > actions are repeating actions that have already committed on the primary > > > node, so they must not fail on the standby node. These DDL locks take > > > priority and will automatically cancel any read-only transactions that > > > get in their way." > > > > > > Some people will want the option to stack-up ddl transactions behind > > > long-running queries (one of the main use cases of a hot slave is > > > reporting stye and other long running queries) > > > > Scheduling tools can help here. Run set of queries on Standby, then when > > complete run DDL on Primary and have its changes filter through. > > > > true... i am just reminded of Oracle 8's log replay, where you had to stop > replay to run any queries... given that was a usefull feature, I suspect > we'll hear complaints about it not going that way. I think one could argue > that we might be able to provide such an option in the future, if not in the > first release. It probably sounds worse than it is. If you drop a table on the Primary, then somebody running a query against it on the Standby is in for a shock. But on the other hand, why are you dropping a table that people still consider worth reading? DROP TABLE should be carefully researched before use, so I don't think its a big problem. We could make it wait for a while before cancelling, as an option, if you think its important? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support