Warm Standby Servers for High Availability

*** a/doc/src/sgml/backup.sgml --- b/doc/src/sgml/backup.sgml *************** *** 1492,2879 **** archive_command = 'local_backup_script.sh' - - Warm Standby Servers for High Availability - - - warm standby - - - - PITR standby - - - - standby server - - - - log shipping - - - - witness server - - - - STONITH - - - - high availability - - - - Continuous archiving can be used to create a high - availability (HA) cluster configuration with one or more - standby servers ready to take over operations if the - primary server fails. This capability is widely referred to as - warm standby or log shipping. - - - - The primary and standby server work together to provide this capability, - though the servers are only loosely coupled. The primary server operates - in continuous archiving mode, while each standby server operates in - continuous recovery mode, reading the WAL files from the primary. No - changes to the database tables are required to enable this capability, - so it offers low administration overhead compared to some other - replication approaches. This configuration also has relatively low - performance impact on the primary server. - - - - Directly moving WAL records from one database server to another - is typically described as log shipping. PostgreSQL - implements file-based log shipping, which means that WAL records are - transferred one file (WAL segment) at a time. WAL files (16MB) can be - shipped easily and cheaply over any distance, whether it be to an - adjacent system, another system at the same site, or another system on - the far side of the globe. The bandwidth required for this technique - varies according to the transaction rate of the primary server. - Record-based log shipping is also possible with custom-developed - procedures, as discussed in . - - - - It should be noted that the log shipping is asynchronous, i.e., the WAL - records are shipped after transaction commit. As a result there is a - window for data loss should the primary server suffer a catastrophic - failure: transactions not yet shipped will be lost. The length of the - window of data loss can be limited by use of the - archive_timeout parameter, which can be set as low - as a few seconds if required. However such a low setting will - substantially increase the bandwidth required for file shipping. - If you need a window of less than a minute or so, it's probably better - to consider record-based log shipping. - - - - The standby server is not available for access, since it is continually - performing recovery processing. Recovery performance is sufficiently - good that the standby will typically be only moments away from full - availability once it has been activated. As a result, we refer to this - capability as a warm standby configuration that offers high - availability. Restoring a server from an archived base backup and - rollforward will take considerably longer, so that technique only - offers a solution for disaster recovery, not high availability. - - - - Planning - - - It is usually wise to create the primary and standby servers - so that they are as similar as possible, at least from the - perspective of the database server. In particular, the path names - associated with tablespaces will be passed across unmodified, so both - primary and standby servers must have the same mount paths for - tablespaces if that feature is used. Keep in mind that if - - is executed on the primary, any new mount point needed for it must - be created on the primary and all standby servers before the command - is executed. Hardware need not be exactly the same, but experience shows - that maintaining two identical systems is easier than maintaining two - dissimilar ones over the lifetime of the application and system. - In any case the hardware architecture must be the same — shipping - from, say, a 32-bit to a 64-bit system will not work. - - - - In general, log shipping between servers running different major - PostgreSQL release - levels is not possible. It is the policy of the PostgreSQL Global - Development Group not to make changes to disk formats during minor release - upgrades, so it is likely that running different minor release levels - on primary and standby servers will work successfully. However, no - formal support for that is offered and you are advised to keep primary - and standby servers at the same release level as much as possible. - When updating to a new minor release, the safest policy is to update - the standby servers first — a new minor release is more likely - to be able to read WAL files from a previous minor release than vice - versa. - - - - There is no special mode required to enable a standby server. The - operations that occur on both primary and standby servers are - normal continuous archiving and recovery tasks. The only point of - contact between the two database servers is the archive of WAL files - that both share: primary writing to the archive, standby reading from - the archive. Care must be taken to ensure that WAL archives from separate - primary servers do not become mixed together or confused. The archive - need not be large if it is only required for standby operation. - - - - The magic that makes the two loosely coupled servers work together is - simply a restore_command used on the standby that, - when asked for the next WAL file, waits for it to become available from - the primary. The restore_command is specified in the - recovery.conf file on the standby server. Normal recovery - processing would request a file from the WAL archive, reporting failure - if the file was unavailable. For standby processing it is normal for - the next WAL file to be unavailable, so we must be patient and wait for - it to appear. For files ending in .backup or - .history there is no need to wait, and a non-zero return - code must be returned. A waiting restore_command can be - written as a custom script that loops after polling for the existence of - the next WAL file. There must also be some way to trigger failover, which - should interrupt the restore_command, break the loop and - return a file-not-found error to the standby server. This ends recovery - and the standby will then come up as a normal server. - - - - Pseudocode for a suitable restore_command is: - - triggered = false; - while (!NextWALFileReady() && !triggered) - { - sleep(100000L); /* wait for ~0.1 sec */ - if (CheckForExternalTrigger()) - triggered = true; - } - if (!triggered) - CopyWALFileForRecovery(); - - - - - A working example of a waiting restore_command is provided - as a contrib module named pg_standby. It - should be used as a reference on how to correctly implement the logic - described above. It can also be extended as needed to support specific - configurations and environments. - - - - PostgreSQL does not provide the system - software required to identify a failure on the primary and notify - the standby database server. Many such tools exist and are well - integrated with the operating system facilities required for - successful failover, such as IP address migration. - - - - The method for triggering failover is an important part of planning - and design. One potential option is the restore_command - command. It is executed once for each WAL file, but the process - running the restore_command is created and dies for - each file, so there is no daemon or server process, and we cannot - use signals or a signal handler. Therefore, the - restore_command is not suitable to trigger failover. - It is possible to use a simple timeout facility, especially if - used in conjunction with a known archive_timeout - setting on the primary. However, this is somewhat error prone - since a network problem or busy primary server might be sufficient - to initiate failover. A notification mechanism such as the explicit - creation of a trigger file is ideal, if this can be arranged. - - - - The size of the WAL archive can be minimized by using the %r - option of the restore_command. This option specifies the - last archive file name that needs to be kept to allow the recovery to - restart correctly. This can be used to truncate the archive once - files are no longer required, assuming the archive is writable from the - standby server. - - - - - Implementation - - - The short procedure for configuring a standby server is as follows. For - full details of each step, refer to previous sections as noted. - - - - Set up primary and standby systems as nearly identical as - possible, including two identical copies of - PostgreSQL at the same release level. - - - - - Set up continuous archiving from the primary to a WAL archive - directory on the standby server. Ensure that - , - and - - are set appropriately on the primary - (see ). - - - - - Make a base backup of the primary server (see ), and load this data onto the standby. - - - - - Begin recovery on the standby server from the local WAL - archive, using a recovery.conf that specifies a - restore_command that waits as described - previously (see ). - - - - - - - Recovery treats the WAL archive as read-only, so once a WAL file has - been copied to the standby system it can be copied to tape at the same - time as it is being read by the standby database server. - Thus, running a standby server for high availability can be performed at - the same time as files are stored for longer term disaster recovery - purposes. - - - - For testing purposes, it is possible to run both primary and standby - servers on the same system. This does not provide any worthwhile - improvement in server robustness, nor would it be described as HA. - - - - - Failover - - - If the primary server fails then the standby server should begin - failover procedures. - - - - If the standby server fails then no failover need take place. If the - standby server can be restarted, even some time later, then the recovery - process can also be immediately restarted, taking advantage of - restartable recovery. If the standby server cannot be restarted, then a - full new standby server instance should be created. - - - - If the primary server fails and the standby server becomes the - new primary, and then the old primary restarts, you must have - a mechanism for informing old primary that it is no longer the primary. This is - sometimes known as STONITH (Shoot The Other Node In The Head), which is - necessary to avoid situations where both systems think they are the - primary, which will lead to confusion and ultimately data loss. - - - - Many failover systems use just two systems, the primary and the standby, - connected by some kind of heartbeat mechanism to continually verify the - connectivity between the two and the viability of the primary. It is - also possible to use a third system (called a witness server) to prevent - some cases of inappropriate failover, but the additional complexity - might not be worthwhile unless it is set up with sufficient care and - rigorous testing. - - - - Once failover to the standby occurs, we have only a - single server in operation. This is known as a degenerate state. - The former standby is now the primary, but the former primary is down - and might stay down. To return to normal operation we must - fully recreate a standby server, - either on the former primary system when it comes up, or on a third, - possibly new, system. Once complete the primary and standby can be - considered to have switched roles. Some people choose to use a third - server to provide backup for the new primary until the new standby - server is recreated, - though clearly this complicates the system configuration and - operational processes. - - - - So, switching from primary to standby server can be fast but requires - some time to re-prepare the failover cluster. Regular switching from - primary to standby is useful, since it allows regular downtime on - each system for maintenance. This also serves as a test of the - failover mechanism to ensure that it will really work when you need it. - Written administration procedures are advised. - - - - - Record-based Log Shipping - - - PostgreSQL directly supports file-based - log shipping as described above. It is also possible to implement - record-based log shipping, though this requires custom development. - - - - An external program can call the pg_xlogfile_name_offset() - function (see ) - to find out the file name and the exact byte offset within it of - the current end of WAL. It can then access the WAL file directly - and copy the data from the last known end of WAL through the current end - over to the standby servers. With this approach, the window for data - loss is the polling cycle time of the copying program, which can be very - small, and there is no wasted bandwidth from forcing partially-used - segment files to be archived. Note that the standby servers' - restore_command scripts can only deal with whole WAL files, - so the incrementally copied data is not ordinarily made available to - the standby servers. It is of use only when the primary dies — - then the last partial WAL file is fed to the standby before allowing - it to come up. The correct implementation of this process requires - cooperation of the restore_command script with the data - copying program. - - - - Starting with PostgreSQL version 8.5, you can use - streaming replication (see ) to - achieve the same with less effort. - - - - - Streaming Replication - - - PostgreSQL includes a simple streaming replication - mechanism, which lets the standby server to stay more up-to-date than - file-based replication allows. The standby connects to the primary - and the primary starts streaming WAL records from where the standby - left off, and continues streaming them as they are generated, without - waiting for the WAL file to be filled. So with streaming replication, - archive_timeout does not need to be configured. - - - - Streaming replication relies on file-based continuous archiving for - making the base backup and for allowing a standby to catch up if it's - disconnected from the primary for long enough for the primary to - delete old WAL files still required by the standby. - - - - Setup - - The short procedure for configuring streaming replication is as follows. - For full details of each step, refer to other sections as noted. - - - - Set up primary and standby systems as near identically as possible, - including two identical copies of PostgreSQL at the - same release level. - - - - - Set up continuous archiving from the primary to a WAL archive located - in a directory on the standby server. Ensure that - , - and - - are set appropriately on the primary - (see ). - - - - - - Set up connections and authentication so that the standby server can - successfully connect to the pseudo replication database of - the primary server (see - ). Ensure that - and pg_hba.conf are - configured appropriately on the primary. - - - On systems that support the keepalive socket option, setting - , - and - helps you to find the - troubles with replication (e.g., the network outage or the failure of - the standby server) as soon as possible. - - - - - Set the maximum number of concurrent connections from the standby servers - (see for details). - - - - - Enable WAL archiving in the primary server because we need to make a base - backup of it later (see and - for details). - - - - - Start the PostgreSQL server on the primary. - - - - - Make a base backup of the primary server (see - ), and load this data onto the - standby. Note that all files present in pg_xlog - and pg_xlog/archive_status on the standby - server should be removed because they might be obsolete. - - - - - Set up WAL archiving, connections and authentication like the primary - server, because the standby server might work as a primary server after - failover. Ensure that your settings are consistent with the - future environment after the primary and the standby - server are interchanged by failover. If you're setting up the standby - server for e.g reporting purposes, with no plans to fail over to it, - configure the standby accordingly. - - - - - Create a recovery command file recovery.conf in the data - directory on the standby server. - - - - - standby_mode (boolean) - - - Specifies whether to start the PostgreSQL server as - a standby. If this parameter is on, the streaming - replication is enabled and the standby server will try to connect - to the primary to receive and apply WAL records continuously. The - default is off, which allows only an archive recovery - without replication. So, streaming replication requires this - parameter to be explicitly set to on. - - - - - primary_conninfo (string) - - - Specifies a connection string which is used for the standby server - to connect with the primary. This string is in the same format as - described in . If any option is - unspecified in this string, then the corresponding environment - variable (see ) is checked. If the - environment variable is not set either, then the indicated built-in - defaults are used. - - - The built-in replication requires that a host name (or host address) - or port number which the primary server listens on should be - specified in this string, respectively. Also ensure that a role with - the SUPERUSER and LOGIN privileges on the - primary is set (see - ). Note that - the password needs to be set if the primary demands password - authentication. - - - - - trigger_file (string) - - - Specifies a trigger file whose presence activates the standby. - If no trigger file is specified, the standby never exits - recovery. - - - - - - - - Start the PostgreSQL server on the standby. The standby - server will go into recovery mode and proceeds to receive WAL records - from the primary and apply them continuously. - - - - - - - Authentication - - It's very important that the access privilege for replication are set - properly so that only trusted users can read the WAL stream, because it's - easy to extract serious information from it. - - - Only superuser is allowed to connect to the primary as the replication - standby. So a role with the SUPERUSER and LOGIN - privileges needs to be created in the primary. - - - Client authentication for replication is controlled by the - pg_hba.conf record specifying replication in the - database field. For example, if the standby is running on - host IP 192.168.1.100 and the superuser's name for replication - is foo, the administrator can add the following line to the - pg_hba.conf file on the primary. - - - # Allow the user "foo" from host 192.168.1.100 to connect to the primary - # as a replication standby if the user's password is correctly supplied. - # - # TYPE DATABASE USER CIDR-ADDRESS METHOD - host replication foo 192.168.1.100/32 md5 - - - - The host name and port number of the primary, user name to connect as, - and password are specified in the recovery.conf file or - the corresponding environment variable on the standby. - For example, if the primary is running on host IP 192.168.1.50, - port 5432, the superuser's name for replication is - foo, and the password is foopass, the administrator - can add the following line to the recovery.conf file on the - standby. - - - # The standby connects to the primary that is running on host 192.168.1.50 - # and port 5432 as the user "foo" whose password is "foopass". - primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' - - - - - - - Incrementally Updated Backups - - - incrementally updated backups - - - - change accumulation - - - - In a warm standby configuration, it is possible to offload the expense of - taking periodic base backups from the primary server; instead base backups - can be made by backing - up a standby server's files. This concept is generally known as - incrementally updated backups, log change accumulation, or more simply, - change accumulation. - - - - If we take a file system backup of the standby server's data - directory while it is processing - logs shipped from the primary, we will be able to reload that backup and - restart the standby's recovery process from the last restart point. - We no longer need to keep WAL files from before the standby's restart point. - If we need to recover, it will be faster to recover from the incrementally - updated backup than from the original base backup. - - - - Since the standby server is not live, it is not possible to - use pg_start_backup() and pg_stop_backup() - to manage the backup process; it will be up to you to determine how - far back you need to keep WAL segment files to have a recoverable - backup. You can do this by running pg_controldata - on the standby server to inspect the control file and determine the - current checkpoint WAL location, or by using the - log_checkpoints option to print values to the standby's - server log. - - - - - - Hot Standby - - - Hot Standby - - - - Hot Standby is the term used to describe the ability to connect to - the server and run queries while the server is in archive recovery. This - is useful for both log shipping replication and for restoring a backup - to an exact state with great precision. - The term Hot Standby also refers to the ability of the server to move - from recovery through to normal running while users continue running - queries and/or continue their connections. - - - - Running queries in recovery is in many ways the same as normal running - though there are a large number of usage and administrative points - to note. - - - - User's Overview - - - Users can connect to the database while the server is in recovery - and perform read-only queries. Read-only access to catalogs and views - will also occur as normal. - - - - The data on the standby takes some time to arrive from the primary server - so there will be a measurable delay between primary and standby. Running the - same query nearly simultaneously on both primary and standby might therefore - return differing results. We say that data on the standby is eventually - consistent with the primary. - Queries executed on the standby will be correct with regard to the transactions - that had been recovered at the start of the query, or start of first statement, - in the case of serializable transactions. In comparison with the primary, - the standby returns query results that could have been obtained on the primary - at some exact moment in the past. - - - - When a transaction is started in recovery, the parameter - transaction_read_only will be forced to be true, regardless of the - default_transaction_read_only setting in postgresql.conf. - It can't be manually set to false either. As a result, all transactions - started during recovery will be limited to read-only actions only. In all - other ways, connected sessions will appear identical to sessions - initiated during normal processing mode. There are no special commands - required to initiate a connection at this time, so all interfaces - work normally without change. After recovery finishes, the session - will allow normal read-write transactions at the start of the next - transaction, if these are requested. - - - - Read-only here means "no writes to the permanent database tables". - There are no problems with queries that make use of transient sort and - work files. - - - - The following actions are allowed - - - - - Query access - SELECT, COPY TO including views and SELECT RULEs - - - - - Cursor commands - DECLARE, FETCH, CLOSE, - - - - - Parameters - SHOW, SET, RESET - - - - - Transaction management commands - - - - BEGIN, END, ABORT, START TRANSACTION - - - - - SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT - - - - - EXCEPTION blocks and other internal subtransactions - - - - - - - - LOCK TABLE, though only when explicitly in one of these modes: - ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. - - - - - Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD - - - - - Plugins and extensions - LOAD - - - - - - - These actions produce error messages - - - - - Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. - Note that there are no allowed actions that result in a trigger - being executed during recovery. - - - - - Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. - This also applies to temporary tables currently because currently their - definition causes writes to catalog tables. - - - - - SELECT ... FOR SHARE | UPDATE which cause row locks to be written - - - - - RULEs on SELECT statements that generate DML commands. - - - - - LOCK TABLE, in short default form, since it requests ACCESS EXCLUSIVE MODE. - LOCK TABLE that explicitly requests a mode higher than ROW EXCLUSIVE MODE. - - - - - Transaction management commands that explicitly set non-read only state - - - - BEGIN READ WRITE, - START TRANSACTION READ WRITE - - - - - SET TRANSACTION READ WRITE, - SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE - - - - - SET transaction_read_only = off - - - - - - - - Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, - ROLLBACK PREPARED because even read-only transactions need to write - WAL in the prepare phase (the first phase of two phase commit). - - - - - sequence update - nextval() - - - - - LISTEN, UNLISTEN, NOTIFY since they currently write to system tables - - - - - - - Note that current behaviour of read only transactions when not in - recovery is to allow the last two actions, so there are small and - subtle differences in behaviour between read-only transactions - run on standby and during normal running. - It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and - temporary tables may be lifted in a future release, if their internal - implementation is altered to make this possible. - - - - If failover or switchover occurs the database will switch to normal - processing mode. Sessions will remain connected while the server - changes mode. Current transactions will continue, though will remain - read-only. After recovery is complete, it will be possible to initiate - read-write transactions. - - - - Users will be able to tell whether their session is read-only by - issuing SHOW transaction_read_only. In addition a set of - functions allow users to - access information about Hot Standby. These allow you to write - functions that are aware of the current state of the database. These - can be used to monitor the progress of recovery, or to allow you to - write complex programs that restore the database to particular states. - - - - In recovery, transactions will not be permitted to take any table lock - higher than RowExclusiveLock. In addition, transactions may never assign - a TransactionId and may never write WAL. - Any LOCK TABLE command that runs on the standby and requests - a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected. - - - - In general queries will not experience lock conflicts with the database - changes made by recovery. This is becase recovery follows normal - concurrency control mechanisms, known as MVCC. There are - some types of change that will cause conflicts, covered in the following - section. - - - - - Handling query conflicts - - - The primary and standby nodes are in many ways loosely connected. Actions - on the primary will have an effect on the standby. As a result, there is - potential for negative interactions or conflicts between them. The easiest - conflict to understand is performance: if a huge data load is taking place - on the primary then this will generate a similar stream of WAL records on the - standby, so standby queries may contend for system resources, such as I/O. - - - - There are also additional types of conflict that can occur with Hot Standby. - These conflicts are hard conflicts in the sense that we may - need to cancel queries and in some cases disconnect sessions to resolve them. - The user is provided with a number of optional ways to handle these - conflicts, though we must first understand the possible reasons behind a conflict. - - - - - Access Exclusive Locks from primary node, including both explicit - LOCK commands and various kinds of DDL action - - - - - Dropping tablespaces on the primary while standby queries are using - those tablespaces for temporary work files (work_mem overflow) - - - - - Dropping databases on the primary while users are connected to that - database on the standby. - - - - - Waiting to acquire buffer cleanup locks - - - - - Early cleanup of data still visible to the current query's snapshot - - - - - - - Some WAL redo actions will be for DDL actions. These DDL actions are - repeating actions that have already committed on the primary node, so - they must not fail on the standby node. These DDL locks take priority - and will automatically *cancel* any read-only transactions that get in - their way, after a grace period. This is similar to the possibility of - being canceled by the deadlock detector, but in this case the standby - process always wins, since the replayed actions must not fail. This - also ensures that replication doesn't fall behind while we wait for a - query to complete. Again, we assume that the standby is there for high - availability purposes primarily. - - - - An example of the above would be an Administrator on Primary server - runs a DROP TABLE on a table that's currently being queried - in the standby server. - Clearly the query cannot continue if we let the DROP TABLE - proceed. If this situation occurred on the primary, the DROP TABLE - would wait until the query has finished. When the query is on the standby - and the DROP TABLE is on the primary, the primary doesn't have - information about which queries are running on the standby and so the query - does not wait on the primary. The WAL change records come through to the - standby while the standby query is still running, causing a conflict. - - - - The most common reason for conflict between standby queries and WAL redo is - "early cleanup". Normally, PostgreSQL allows cleanup of old - row versions when there are no users who may need to see them to ensure correct - visibility of data (the heart of MVCC). If there is a standby query that has - been running for longer than any query on the primary then it is possible - for old row versions to be removed by either a vacuum or HOT. This will - then generate WAL records that, if applied, would remove data on the - standby that might *potentially* be required by the standby query. - In more technical language, the primary's xmin horizon is later than - the standby's xmin horizon, allowing dead rows to be removed. - - - - Experienced users should note that both row version cleanup and row version - freezing will potentially conflict with recovery queries. Running a - manual VACUUM FREEZE is likely to cause conflicts even on tables - with no updated or deleted rows. - - - - We have a number of choices for resolving query conflicts. The default - is that we wait and hope the query completes. The server will wait - automatically until the lag between primary and standby is at most - max_standby_delay seconds. Once that grace period expires, - we take one of the following actions: - - - - - If the conflict is caused by a lock, we cancel the conflicting standby - transaction immediately. If the transaction is idle-in-transaction - then currently we abort the session instead, though this may change - in the future. - - - - - - If the conflict is caused by cleanup records we tell the standby query - that a conflict has occurred and that it must cancel itself to avoid the - risk that it silently fails to read relevant data because - that data has been removed. (This is regrettably very similar to the - much feared and iconic error message "snapshot too old"). Some cleanup - records only cause conflict with older queries, though some types of - cleanup record affect all queries. - - - - If cancellation does occur, the query and/or transaction can always - be re-executed. The error is dynamic and will not necessarily occur - the same way if the query is executed again. - - - - - - - max_standby_delay is set in postgresql.conf. - The parameter applies to the server as a whole so if the delay is all used - up by a single query then there may be little or no waiting for queries that - follow immediately, though they will have benefited equally from the initial - waiting period. The server may take time to catch up again before the grace - period is available again, though if there is a heavy and constant stream - of conflicts it may seldom catch up fully. - - - - Users should be clear that tables that are regularly and heavily updated on - primary server will quickly cause cancellation of longer running queries on - the standby. In those cases max_standby_delay can be - considered somewhat but not exactly the same as setting - statement_timeout. - - - - Other remedial actions exist if the number of cancellations is unacceptable. - The first option is to connect to primary server and keep a query active - for as long as we need to run queries on the standby. This guarantees that - a WAL cleanup record is never generated and we don't ever get query - conflicts as described above. This could be done using contrib/dblink - and pg_sleep(), or via other mechanisms. If you do this, you should note - that this will delay cleanup of dead rows by vacuum or HOT and many - people may find this undesirable. However, we should remember that - primary and standby nodes are linked via the WAL, so this situation is no - different to the case where we ran the query on the primary node itself - except we have the benefit of off-loading the execution onto the standby. - - - - It is also possible to set vacuum_defer_cleanup_age on the primary - to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow - more time for queries to execute before they are cancelled on the standby, - without the need for setting a high max_standby_delay. - - - - Three-way deadlocks are possible between AccessExclusiveLocks arriving from - the primary, cleanup WAL records that require buffer cleanup locks and - user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks - are resolved by time-out when we exceed max_standby_delay. - - - - Dropping tablespaces or databases is discussed in the administrator's - section since they are not typical user situations. - - - - - Administrator's Overview - - - If there is a recovery.conf file present the server will start - in Hot Standby mode by default, though recovery_connections can - be disabled via postgresql.conf, if required. The server may take - some time to enable recovery connections since the server must first complete - sufficient recovery to provide a consistent state against which queries - can run before enabling read only connections. Look for these messages - in the server logs - - - LOG: initializing recovery connections - - ... then some time later ... - - LOG: consistent recovery state reached - LOG: database system is ready to accept read only connections - - - Consistency information is recorded once per checkpoint on the primary, as long - as recovery_connections is enabled (on the primary). If this parameter - is disabled, it will not be possible to enable recovery connections on the standby. - The consistent state can also be delayed in the presence of both of these conditions - - - - - a write transaction has more than 64 subtransactions - - - - - very long-lived write transactions - - - - - If you are running file-based log shipping ("warm standby"), you may need - to wait until the next WAL file arrives, which could be as long as the - archive_timeout setting on the primary. - - - - The setting of some parameters on the standby will need reconfiguration - if they have been changed on the primary. The value on the standby must - be equal to or greater than the value on the primary. If these parameters - are not set high enough then the standby will not be able to track work - correctly from recovering transactions. If these values are set too low the - the server will halt. Higher values can then be supplied and the server - restarted to begin recovery again. - - - - - max_connections - - - - - max_prepared_transactions - - - - - max_locks_per_transaction - - - - - - - It is important that the administrator consider the appropriate setting - of max_standby_delay, set in postgresql.conf. - There is no optimal setting and should be set according to business - priorities. For example if the server is primarily tasked as a High - Availability server, then you may wish to lower - max_standby_delay or even set it to zero, though that is a - very aggressive setting. If the standby server is tasked as an additional - server for decision support queries then it may be acceptable to set this - to a value of many hours (in seconds). - - - - Transaction status "hint bits" written on primary are not WAL-logged, - so data on standby will likely re-write the hints again on the standby. - Thus the main database blocks will produce write I/Os even though - all users are read-only; no changes have occurred to the data values - themselves. Users will be able to write large sort temp files and - re-generate relcache info files, so there is no part of the database - that is truly read-only during hot standby mode. There is no restriction - on the use of set returning functions, or other users of tuplestore/tuplesort - code. Note also that writes to remote databases will still be possible, - even though the transaction is read-only locally. - - - - The following types of administrator command are not accepted - during recovery mode - - - - - Data Definition Language (DDL) - e.g. CREATE INDEX - - - - - Privilege and Ownership - GRANT, REVOKE, REASSIGN - - - - - Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX - - - - - - - Note again that some of these commands are actually allowed during - "read only" mode transactions on the primary. - - - - As a result, you cannot create additional indexes that exist solely - on the standby, nor can statistics that exist solely on the standby. - If these administrator commands are needed they should be executed - on the primary so that the changes will propagate through to the - standby. - - - - pg_cancel_backend() will work on user backends, but not the - Startup process, which performs recovery. pg_stat_activity does not - show an entry for the Startup process, nor do recovering transactions - show as active. As a result, pg_prepared_xacts is always empty during - recovery. If you wish to resolve in-doubt prepared transactions - then look at pg_prepared_xacts on the primary and issue commands to - resolve those transactions there. - - - - pg_locks will show locks held by backends as normal. pg_locks also shows - a virtual transaction managed by the Startup process that owns all - AccessExclusiveLocks held by transactions being replayed by recovery. - Note that Startup process does not acquire locks to - make database changes and thus locks other than AccessExclusiveLocks - do not show in pg_locks for the Startup process, they are just presumed - to exist. - - - - check_pgsql will work, but it is very simple. - check_postgres will also work, though many some actions - could give different or confusing results. - e.g. last vacuum time will not be maintained for example, since no - vacuum occurs on the standby (though vacuums running on the primary do - send their changes to the standby). - - - - WAL file control commands will not work during recovery - e.g. pg_start_backup, pg_switch_xlog etc.. - - - - Dynamically loadable modules work, including pg_stat_statements. - - - - Advisory locks work normally in recovery, including deadlock detection. - Note that advisory locks are never WAL logged, so it is not possible for - an advisory lock on either the primary or the standby to conflict with WAL - replay. Nor is it possible to acquire an advisory lock on the primary - and have it initiate a similar advisory lock on the standby. Advisory - locks relate only to a single server on which they are acquired. - - - - Trigger-based replication systems such as Slony, - Londiste and Bucardo won't run on the - standby at all, though they will run happily on the primary server as - long as the changes are not sent to standby servers to be applied. - WAL replay is not trigger-based so you cannot relay from the - standby to any system that requires additional database writes or - relies on the use of triggers. - - - - New oids cannot be assigned, though some UUID generators may still - work as long as they do not rely on writing new status to the database. - - - - Currently, temp table creation is not allowed during read only - transactions, so in some cases existing scripts will not run correctly. - It is possible we may relax that restriction in a later release. This is - both a SQL Standard compliance issue and a technical issue. - - - - DROP TABLESPACE can only succeed if the tablespace is empty. - Some standby users may be actively using the tablespace via their - temp_tablespaces parameter. If there are temp files in the - tablespace we currently cancel all active queries to ensure that temp - files are removed, so that we can remove the tablespace and continue with - WAL replay. - - - - Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, - or ALTER DATABASE ... RENAME on primary will generate a log message - that will cause all users connected to that database on the standby to be - forcibly disconnected. This action occurs immediately, whatever the setting of - max_standby_delay. - - - - In normal running, if you issue DROP USER or DROP ROLE - for a role with login capability while that user is still connected then - nothing happens to the connected user - they remain connected. The user cannot - reconnect however. This behaviour applies in recovery also, so a - DROP USER on the primary does not disconnect that user on the standby. - - - - Stats collector is active during recovery. All scans, reads, blocks, - index usage etc will all be recorded normally on the standby. Replayed - actions will not duplicate their effects on primary, so replaying an - insert will not increment the Inserts column of pg_stat_user_tables. - The stats file is deleted at start of recovery, so stats from primary - and standby will differ; this is considered a feature not a bug. - - - - Autovacuum is not active during recovery, though will start normally - at the end of recovery. - - - - Background writer is active during recovery and will perform - restartpoints (similar to checkpoints on primary) and normal block - cleaning activities. The CHECKPOINT command is accepted during recovery, - though performs a restartpoint rather than a new checkpoint. - - - - - Hot Standby Parameter Reference - - - Various parameters have been mentioned above in the - and sections. - - - - On the primary, parameters recovery_connections and - vacuum_defer_cleanup_age can be used to enable and control the - primary server to assist the successful configuration of Hot Standby servers. - max_standby_delay has no effect if set on the primary. - - - - On the standby, parameters recovery_connections and - max_standby_delay can be used to enable and control Hot Standby. - standby server to assist the successful configuration of Hot Standby servers. - vacuum_defer_cleanup_age has no effect during recovery. - - - - - Caveats - - - At this writing, there are several limitations of Hot Standby. - These can and probably will be fixed in future releases: - - - - - Operations on hash indexes are not presently WAL-logged, so - replay will not update these indexes. Hash indexes will not be - used for query plans during recovery. - - - - - Full knowledge of running transactions is required before snapshots - may be taken. Transactions that take use large numbers of subtransactions - (currently greater than 64) will delay the start of read only - connections until the completion of the longest running write transaction. - If this situation occurs explanatory messages will be sent to server log. - - - - - Valid starting points for recovery connections are generated at each - checkpoint on the master. If the standby is shutdown while the master - is in a shutdown state it may not be possible to re-enter Hot Standby - until the primary is started up so that it generates further starting - points in the WAL logs. This is not considered a serious issue - because the standby is usually switched into the primary role while - the first node is taken down. - - - - - At the end of recovery, AccessExclusiveLocks held by prepared transactions - will require twice the normal number of lock table entries. If you plan - on running either a large number of concurrent prepared transactions - that normally take AccessExclusiveLocks, or you plan on having one - large transaction that takes many AccessExclusiveLocks then you are - advised to select a larger value of max_locks_per_transaction, - up to, but never more than twice the value of the parameter setting on - the primary server in rare extremes. You need not consider this at all if - your setting of max_prepared_transactions is 0. - - - - - - - - - Migration Between Releases --- 1492,1497 ---- *** a/doc/src/sgml/high-availability.sgml --- b/doc/src/sgml/high-availability.sgml *************** *** 79,84 **** --- 79,87 ---- also available. + + Comparison of different solutions + *************** *** 450,453 **** protocol to make nodes agree on a serializable transactional order. --- 453,1840 ---- + + + + File-based Log Shipping + + + warm standby + + + + PITR standby + + + + standby server + + + + log shipping + + + + witness server + + + + STONITH + + + + Continuous archiving can be used to create a high + availability (HA) cluster configuration with one or more + standby servers ready to take over operations if the + primary server fails. This capability is widely referred to as + warm standby or log shipping. + + + + The primary and standby server work together to provide this capability, + though the servers are only loosely coupled. The primary server operates + in continuous archiving mode, while each standby server operates in + continuous recovery mode, reading the WAL files from the primary. No + changes to the database tables are required to enable this capability, + so it offers low administration overhead compared to some other + replication approaches. This configuration also has relatively low + performance impact on the primary server. + + + + Directly moving WAL records from one database server to another + is typically described as log shipping. PostgreSQL + implements file-based log shipping, which means that WAL records are + transferred one file (WAL segment) at a time. WAL files (16MB) can be + shipped easily and cheaply over any distance, whether it be to an + adjacent system, another system at the same site, or another system on + the far side of the globe. The bandwidth required for this technique + varies according to the transaction rate of the primary server. + Record-based log shipping is also possible with custom-developed + procedures, as discussed in . + + + + It should be noted that the log shipping is asynchronous, i.e., the WAL + records are shipped after transaction commit. As a result there is a + window for data loss should the primary server suffer a catastrophic + failure: transactions not yet shipped will be lost. The length of the + window of data loss can be limited by use of the + archive_timeout parameter, which can be set as low + as a few seconds if required. However such a low setting will + substantially increase the bandwidth required for file shipping. + If you need a window of less than a minute or so, it's probably better + to consider record-based log shipping. + + + + The standby server is not available for access, since it is continually + performing recovery processing. Recovery performance is sufficiently + good that the standby will typically be only moments away from full + availability once it has been activated. As a result, we refer to this + capability as a warm standby configuration that offers high + availability. Restoring a server from an archived base backup and + rollforward will take considerably longer, so that technique only + offers a solution for disaster recovery, not high availability. + + + + Planning + + + It is usually wise to create the primary and standby servers + so that they are as similar as possible, at least from the + perspective of the database server. In particular, the path names + associated with tablespaces will be passed across unmodified, so both + primary and standby servers must have the same mount paths for + tablespaces if that feature is used. Keep in mind that if + + is executed on the primary, any new mount point needed for it must + be created on the primary and all standby servers before the command + is executed. Hardware need not be exactly the same, but experience shows + that maintaining two identical systems is easier than maintaining two + dissimilar ones over the lifetime of the application and system. + In any case the hardware architecture must be the same — shipping + from, say, a 32-bit to a 64-bit system will not work. + + + + In general, log shipping between servers running different major + PostgreSQL release + levels is not possible. It is the policy of the PostgreSQL Global + Development Group not to make changes to disk formats during minor release + upgrades, so it is likely that running different minor release levels + on primary and standby servers will work successfully. However, no + formal support for that is offered and you are advised to keep primary + and standby servers at the same release level as much as possible. + When updating to a new minor release, the safest policy is to update + the standby servers first — a new minor release is more likely + to be able to read WAL files from a previous minor release than vice + versa. + + + + There is no special mode required to enable a standby server. The + operations that occur on both primary and standby servers are + normal continuous archiving and recovery tasks. The only point of + contact between the two database servers is the archive of WAL files + that both share: primary writing to the archive, standby reading from + the archive. Care must be taken to ensure that WAL archives from separate + primary servers do not become mixed together or confused. The archive + need not be large if it is only required for standby operation. + + + + The magic that makes the two loosely coupled servers work together is + simply a restore_command used on the standby that, + when asked for the next WAL file, waits for it to become available from + the primary. The restore_command is specified in the + recovery.conf file on the standby server. Normal recovery + processing would request a file from the WAL archive, reporting failure + if the file was unavailable. For standby processing it is normal for + the next WAL file to be unavailable, so we must be patient and wait for + it to appear. For files ending in .backup or + .history there is no need to wait, and a non-zero return + code must be returned. A waiting restore_command can be + written as a custom script that loops after polling for the existence of + the next WAL file. There must also be some way to trigger failover, which + should interrupt the restore_command, break the loop and + return a file-not-found error to the standby server. This ends recovery + and the standby will then come up as a normal server. + + + + Pseudocode for a suitable restore_command is: + + triggered = false; + while (!NextWALFileReady() && !triggered) + { + sleep(100000L); /* wait for ~0.1 sec */ + if (CheckForExternalTrigger()) + triggered = true; + } + if (!triggered) + CopyWALFileForRecovery(); + + + + + A working example of a waiting restore_command is provided + as a contrib module named pg_standby. It + should be used as a reference on how to correctly implement the logic + described above. It can also be extended as needed to support specific + configurations and environments. + + + + PostgreSQL does not provide the system + software required to identify a failure on the primary and notify + the standby database server. Many such tools exist and are well + integrated with the operating system facilities required for + successful failover, such as IP address migration. + + + + The method for triggering failover is an important part of planning + and design. One potential option is the restore_command + command. It is executed once for each WAL file, but the process + running the restore_command is created and dies for + each file, so there is no daemon or server process, and we cannot + use signals or a signal handler. Therefore, the + restore_command is not suitable to trigger failover. + It is possible to use a simple timeout facility, especially if + used in conjunction with a known archive_timeout + setting on the primary. However, this is somewhat error prone + since a network problem or busy primary server might be sufficient + to initiate failover. A notification mechanism such as the explicit + creation of a trigger file is ideal, if this can be arranged. + + + + The size of the WAL archive can be minimized by using the %r + option of the restore_command. This option specifies the + last archive file name that needs to be kept to allow the recovery to + restart correctly. This can be used to truncate the archive once + files are no longer required, assuming the archive is writable from the + standby server. + + + + + Implementation + + + The short procedure for configuring a standby server is as follows. For + full details of each step, refer to previous sections as noted. + + + + Set up primary and standby systems as nearly identical as + possible, including two identical copies of + PostgreSQL at the same release level. + + + + + Set up continuous archiving from the primary to a WAL archive + directory on the standby server. Ensure that + , + and + + are set appropriately on the primary + (see ). + + + + + Make a base backup of the primary server (see ), and load this data onto the standby. + + + + + Begin recovery on the standby server from the local WAL + archive, using a recovery.conf that specifies a + restore_command that waits as described + previously (see ). + + + + + + + Recovery treats the WAL archive as read-only, so once a WAL file has + been copied to the standby system it can be copied to tape at the same + time as it is being read by the standby database server. + Thus, running a standby server for high availability can be performed at + the same time as files are stored for longer term disaster recovery + purposes. + + + + For testing purposes, it is possible to run both primary and standby + servers on the same system. This does not provide any worthwhile + improvement in server robustness, nor would it be described as HA. + + + + + Record-based Log Shipping + + + PostgreSQL directly supports file-based + log shipping as described above. It is also possible to implement + record-based log shipping, though this requires custom development. + + + + An external program can call the pg_xlogfile_name_offset() + function (see ) + to find out the file name and the exact byte offset within it of + the current end of WAL. It can then access the WAL file directly + and copy the data from the last known end of WAL through the current end + over to the standby servers. With this approach, the window for data + loss is the polling cycle time of the copying program, which can be very + small, and there is no wasted bandwidth from forcing partially-used + segment files to be archived. Note that the standby servers' + restore_command scripts can only deal with whole WAL files, + so the incrementally copied data is not ordinarily made available to + the standby servers. It is of use only when the primary dies — + then the last partial WAL file is fed to the standby before allowing + it to come up. The correct implementation of this process requires + cooperation of the restore_command script with the data + copying program. + + + + Starting with PostgreSQL version 8.5, you can use + streaming replication (see ) to + achieve the same with less effort. + + + + + + Streaming Replication + + + Streaming Replication + + + + PostgreSQL includes a simple streaming replication + mechanism, which lets the standby server to stay more up-to-date than + file-based replication allows. The standby connects to the primary + and the primary starts streaming WAL records from where the standby + left off, and continues streaming them as they are generated, without + waiting for the WAL file to be filled. So with streaming replication, + archive_timeout does not need to be configured. + + + + Streaming replication relies on file-based continuous archiving for + making the base backup and for allowing a standby to catch up if it's + disconnected from the primary for long enough for the primary to + delete old WAL files still required by the standby. + + + + Setup + + The short procedure for configuring streaming replication is as follows. + For full details of each step, refer to other sections as noted. + + + + Set up primary and standby systems as near identically as possible, + including two identical copies of PostgreSQL at the + same release level. + + + + + Set up continuous archiving from the primary to a WAL archive located + in a directory on the standby server. Ensure that + , + and + + are set appropriately on the primary + (see ). + + + + + + Set up connections and authentication so that the standby server can + successfully connect to the pseudo replication database of + the primary server (see + ). Ensure that + and pg_hba.conf are + configured appropriately on the primary. + + + On systems that support the keepalive socket option, setting + , + and + helps you to find the + troubles with replication (e.g., the network outage or the failure of + the standby server) as soon as possible. + + + + + Set the maximum number of concurrent connections from the standby servers + (see for details). + + + + + Enable WAL archiving in the primary server because we need to make a base + backup of it later (see and + for details). + + + + + Start the PostgreSQL server on the primary. + + + + + Make a base backup of the primary server (see + ), and load this data onto the + standby. Note that all files present in pg_xlog + and pg_xlog/archive_status on the standby + server should be removed because they might be obsolete. + + + + + Set up WAL archiving, connections and authentication like the primary + server, because the standby server might work as a primary server after + failover. Ensure that your settings are consistent with the + future environment after the primary and the standby + server are interchanged by failover. If you're setting up the standby + server for e.g reporting purposes, with no plans to fail over to it, + configure the standby accordingly. + + + + + Create a recovery command file recovery.conf in the data + directory on the standby server. + + + + + standby_mode (boolean) + + + Specifies whether to start the PostgreSQL server as + a standby. If this parameter is on, the streaming + replication is enabled and the standby server will try to connect + to the primary to receive and apply WAL records continuously. The + default is off, which allows only an archive recovery + without replication. So, streaming replication requires this + parameter to be explicitly set to on. + + + + + primary_conninfo (string) + + + Specifies a connection string which is used for the standby server + to connect with the primary. This string is in the same format as + described in . If any option is + unspecified in this string, then the corresponding environment + variable (see ) is checked. If the + environment variable is not set either, then the indicated built-in + defaults are used. + + + The built-in replication requires that a host name (or host address) + or port number which the primary server listens on should be + specified in this string, respectively. Also ensure that a role with + the SUPERUSER and LOGIN privileges on the + primary is set (see + ). Note that + the password needs to be set if the primary demands password + authentication. + + + + + trigger_file (string) + + + Specifies a trigger file whose presence activates the standby. + If no trigger file is specified, the standby never exits + recovery. + + + + + + + + Start the PostgreSQL server on the standby. The standby + server will go into recovery mode and proceeds to receive WAL records + from the primary and apply them continuously. + + + + + + + Authentication + + It's very important that the access privilege for replication are set + properly so that only trusted users can read the WAL stream, because it's + easy to extract serious information from it. + + + Only superuser is allowed to connect to the primary as the replication + standby. So a role with the SUPERUSER and LOGIN + privileges needs to be created in the primary. + + + Client authentication for replication is controlled by the + pg_hba.conf record specifying replication in the + database field. For example, if the standby is running on + host IP 192.168.1.100 and the superuser's name for replication + is foo, the administrator can add the following line to the + pg_hba.conf file on the primary. + + + # Allow the user "foo" from host 192.168.1.100 to connect to the primary + # as a replication standby if the user's password is correctly supplied. + # + # TYPE DATABASE USER CIDR-ADDRESS METHOD + host replication foo 192.168.1.100/32 md5 + + + + The host name and port number of the primary, user name to connect as, + and password are specified in the recovery.conf file or + the corresponding environment variable on the standby. + For example, if the primary is running on host IP 192.168.1.50, + port 5432, the superuser's name for replication is + foo, and the password is foopass, the administrator + can add the following line to the recovery.conf file on the + standby. + + + # The standby connects to the primary that is running on host 192.168.1.50 + # and port 5432 as the user "foo" whose password is "foopass". + primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' + + + + + + + Failover + + + If the primary server fails then the standby server should begin + failover procedures. + + + + If the standby server fails then no failover need take place. If the + standby server can be restarted, even some time later, then the recovery + process can also be immediately restarted, taking advantage of + restartable recovery. If the standby server cannot be restarted, then a + full new standby server instance should be created. + + + + If the primary server fails and the standby server becomes the + new primary, and then the old primary restarts, you must have + a mechanism for informing old primary that it is no longer the primary. This is + sometimes known as STONITH (Shoot The Other Node In The Head), which is + necessary to avoid situations where both systems think they are the + primary, which will lead to confusion and ultimately data loss. + + + + Many failover systems use just two systems, the primary and the standby, + connected by some kind of heartbeat mechanism to continually verify the + connectivity between the two and the viability of the primary. It is + also possible to use a third system (called a witness server) to prevent + some cases of inappropriate failover, but the additional complexity + might not be worthwhile unless it is set up with sufficient care and + rigorous testing. + + + + Once failover to the standby occurs, we have only a + single server in operation. This is known as a degenerate state. + The former standby is now the primary, but the former primary is down + and might stay down. To return to normal operation we must + fully recreate a standby server, + either on the former primary system when it comes up, or on a third, + possibly new, system. Once complete the primary and standby can be + considered to have switched roles. Some people choose to use a third + server to provide backup for the new primary until the new standby + server is recreated, + though clearly this complicates the system configuration and + operational processes. + + + + So, switching from primary to standby server can be fast but requires + some time to re-prepare the failover cluster. Regular switching from + primary to standby is useful, since it allows regular downtime on + each system for maintenance. This also serves as a test of the + failover mechanism to ensure that it will really work when you need it. + Written administration procedures are advised. + + + + + Hot Standby + + + Hot Standby + + + + Hot Standby is the term used to describe the ability to connect to + the server and run queries while the server is in archive recovery. This + is useful for both log shipping replication and for restoring a backup + to an exact state with great precision. + The term Hot Standby also refers to the ability of the server to move + from recovery through to normal running while users continue running + queries and/or continue their connections. + + + + Running queries in recovery is in many ways the same as normal running + though there are a large number of usage and administrative points + to note. + + + + User's Overview + + + Users can connect to the database while the server is in recovery + and perform read-only queries. Read-only access to catalogs and views + will also occur as normal. + + + + The data on the standby takes some time to arrive from the primary server + so there will be a measurable delay between primary and standby. Running the + same query nearly simultaneously on both primary and standby might therefore + return differing results. We say that data on the standby is eventually + consistent with the primary. + Queries executed on the standby will be correct with regard to the transactions + that had been recovered at the start of the query, or start of first statement, + in the case of serializable transactions. In comparison with the primary, + the standby returns query results that could have been obtained on the primary + at some exact moment in the past. + + + + When a transaction is started in recovery, the parameter + transaction_read_only will be forced to be true, regardless of the + default_transaction_read_only setting in postgresql.conf. + It can't be manually set to false either. As a result, all transactions + started during recovery will be limited to read-only actions only. In all + other ways, connected sessions will appear identical to sessions + initiated during normal processing mode. There are no special commands + required to initiate a connection at this time, so all interfaces + work normally without change. After recovery finishes, the session + will allow normal read-write transactions at the start of the next + transaction, if these are requested. + + + + Read-only here means "no writes to the permanent database tables". + There are no problems with queries that make use of transient sort and + work files. + + + + The following actions are allowed + + + + + Query access - SELECT, COPY TO including views and SELECT RULEs + + + + + Cursor commands - DECLARE, FETCH, CLOSE, + + + + + Parameters - SHOW, SET, RESET + + + + + Transaction management commands + + + + BEGIN, END, ABORT, START TRANSACTION + + + + + SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT + + + + + EXCEPTION blocks and other internal subtransactions + + + + + + + + LOCK TABLE, though only when explicitly in one of these modes: + ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. + + + + + Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD + + + + + Plugins and extensions - LOAD + + + + + + + These actions produce error messages + + + + + Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. + Note that there are no allowed actions that result in a trigger + being executed during recovery. + + + + + Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. + This also applies to temporary tables currently because currently their + definition causes writes to catalog tables. + + + + + SELECT ... FOR SHARE | UPDATE which cause row locks to be written + + + + + RULEs on SELECT statements that generate DML commands. + + + + + LOCK TABLE, in short default form, since it requests ACCESS EXCLUSIVE MODE. + LOCK TABLE that explicitly requests a mode higher than ROW EXCLUSIVE MODE. + + + + + Transaction management commands that explicitly set non-read only state + + + + BEGIN READ WRITE, + START TRANSACTION READ WRITE + + + + + SET TRANSACTION READ WRITE, + SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE + + + + + SET transaction_read_only = off + + + + + + + + Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, + ROLLBACK PREPARED because even read-only transactions need to write + WAL in the prepare phase (the first phase of two phase commit). + + + + + sequence update - nextval() + + + + + LISTEN, UNLISTEN, NOTIFY since they currently write to system tables + + + + + + + Note that current behaviour of read only transactions when not in + recovery is to allow the last two actions, so there are small and + subtle differences in behaviour between read-only transactions + run on standby and during normal running. + It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and + temporary tables may be lifted in a future release, if their internal + implementation is altered to make this possible. + + + + If failover or switchover occurs the database will switch to normal + processing mode. Sessions will remain connected while the server + changes mode. Current transactions will continue, though will remain + read-only. After recovery is complete, it will be possible to initiate + read-write transactions. + + + + Users will be able to tell whether their session is read-only by + issuing SHOW transaction_read_only. In addition a set of + functions allow users to + access information about Hot Standby. These allow you to write + functions that are aware of the current state of the database. These + can be used to monitor the progress of recovery, or to allow you to + write complex programs that restore the database to particular states. + + + + In recovery, transactions will not be permitted to take any table lock + higher than RowExclusiveLock. In addition, transactions may never assign + a TransactionId and may never write WAL. + Any LOCK TABLE command that runs on the standby and requests + a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected. + + + + In general queries will not experience lock conflicts with the database + changes made by recovery. This is becase recovery follows normal + concurrency control mechanisms, known as MVCC. There are + some types of change that will cause conflicts, covered in the following + section. + + + + + Handling query conflicts + + + The primary and standby nodes are in many ways loosely connected. Actions + on the primary will have an effect on the standby. As a result, there is + potential for negative interactions or conflicts between them. The easiest + conflict to understand is performance: if a huge data load is taking place + on the primary then this will generate a similar stream of WAL records on the + standby, so standby queries may contend for system resources, such as I/O. + + + + There are also additional types of conflict that can occur with Hot Standby. + These conflicts are hard conflicts in the sense that we may + need to cancel queries and in some cases disconnect sessions to resolve them. + The user is provided with a number of optional ways to handle these + conflicts, though we must first understand the possible reasons behind a conflict. + + + + + Access Exclusive Locks from primary node, including both explicit + LOCK commands and various kinds of DDL action + + + + + Dropping tablespaces on the primary while standby queries are using + those tablespaces for temporary work files (work_mem overflow) + + + + + Dropping databases on the primary while users are connected to that + database on the standby. + + + + + Waiting to acquire buffer cleanup locks + + + + + Early cleanup of data still visible to the current query's snapshot + + + + + + + Some WAL redo actions will be for DDL actions. These DDL actions are + repeating actions that have already committed on the primary node, so + they must not fail on the standby node. These DDL locks take priority + and will automatically *cancel* any read-only transactions that get in + their way, after a grace period. This is similar to the possibility of + being canceled by the deadlock detector, but in this case the standby + process always wins, since the replayed actions must not fail. This + also ensures that replication doesn't fall behind while we wait for a + query to complete. Again, we assume that the standby is there for high + availability purposes primarily. + + + + An example of the above would be an Administrator on Primary server + runs a DROP TABLE on a table that's currently being queried + in the standby server. + Clearly the query cannot continue if we let the DROP TABLE + proceed. If this situation occurred on the primary, the DROP TABLE + would wait until the query has finished. When the query is on the standby + and the DROP TABLE is on the primary, the primary doesn't have + information about which queries are running on the standby and so the query + does not wait on the primary. The WAL change records come through to the + standby while the standby query is still running, causing a conflict. + + + + The most common reason for conflict between standby queries and WAL redo is + "early cleanup". Normally, PostgreSQL allows cleanup of old + row versions when there are no users who may need to see them to ensure correct + visibility of data (the heart of MVCC). If there is a standby query that has + been running for longer than any query on the primary then it is possible + for old row versions to be removed by either a vacuum or HOT. This will + then generate WAL records that, if applied, would remove data on the + standby that might *potentially* be required by the standby query. + In more technical language, the primary's xmin horizon is later than + the standby's xmin horizon, allowing dead rows to be removed. + + + + Experienced users should note that both row version cleanup and row version + freezing will potentially conflict with recovery queries. Running a + manual VACUUM FREEZE is likely to cause conflicts even on tables + with no updated or deleted rows. + + + + We have a number of choices for resolving query conflicts. The default + is that we wait and hope the query completes. The server will wait + automatically until the lag between primary and standby is at most + max_standby_delay seconds. Once that grace period expires, + we take one of the following actions: + + + + + If the conflict is caused by a lock, we cancel the conflicting standby + transaction immediately. If the transaction is idle-in-transaction + then currently we abort the session instead, though this may change + in the future. + + + + + + If the conflict is caused by cleanup records we tell the standby query + that a conflict has occurred and that it must cancel itself to avoid the + risk that it silently fails to read relevant data because + that data has been removed. (This is regrettably very similar to the + much feared and iconic error message "snapshot too old"). Some cleanup + records only cause conflict with older queries, though some types of + cleanup record affect all queries. + + + + If cancellation does occur, the query and/or transaction can always + be re-executed. The error is dynamic and will not necessarily occur + the same way if the query is executed again. + + + + + + + max_standby_delay is set in postgresql.conf. + The parameter applies to the server as a whole so if the delay is all used + up by a single query then there may be little or no waiting for queries that + follow immediately, though they will have benefited equally from the initial + waiting period. The server may take time to catch up again before the grace + period is available again, though if there is a heavy and constant stream + of conflicts it may seldom catch up fully. + + + + Users should be clear that tables that are regularly and heavily updated on + primary server will quickly cause cancellation of longer running queries on + the standby. In those cases max_standby_delay can be + considered somewhat but not exactly the same as setting + statement_timeout. + + + + Other remedial actions exist if the number of cancellations is unacceptable. + The first option is to connect to primary server and keep a query active + for as long as we need to run queries on the standby. This guarantees that + a WAL cleanup record is never generated and we don't ever get query + conflicts as described above. This could be done using contrib/dblink + and pg_sleep(), or via other mechanisms. If you do this, you should note + that this will delay cleanup of dead rows by vacuum or HOT and many + people may find this undesirable. However, we should remember that + primary and standby nodes are linked via the WAL, so this situation is no + different to the case where we ran the query on the primary node itself + except we have the benefit of off-loading the execution onto the standby. + + + + It is also possible to set vacuum_defer_cleanup_age on the primary + to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow + more time for queries to execute before they are cancelled on the standby, + without the need for setting a high max_standby_delay. + + + + Three-way deadlocks are possible between AccessExclusiveLocks arriving from + the primary, cleanup WAL records that require buffer cleanup locks and + user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks + are resolved by time-out when we exceed max_standby_delay. + + + + Dropping tablespaces or databases is discussed in the administrator's + section since they are not typical user situations. + + + + + Administrator's Overview + + + If there is a recovery.conf file present the server will start + in Hot Standby mode by default, though recovery_connections can + be disabled via postgresql.conf, if required. The server may take + some time to enable recovery connections since the server must first complete + sufficient recovery to provide a consistent state against which queries + can run before enabling read only connections. Look for these messages + in the server logs + + + LOG: initializing recovery connections + + ... then some time later ... + + LOG: consistent recovery state reached + LOG: database system is ready to accept read only connections + + + Consistency information is recorded once per checkpoint on the primary, as long + as recovery_connections is enabled (on the primary). If this parameter + is disabled, it will not be possible to enable recovery connections on the standby. + The consistent state can also be delayed in the presence of both of these conditions + + + + + a write transaction has more than 64 subtransactions + + + + + very long-lived write transactions + + + + + If you are running file-based log shipping ("warm standby"), you may need + to wait until the next WAL file arrives, which could be as long as the + archive_timeout setting on the primary. + + + + The setting of some parameters on the standby will need reconfiguration + if they have been changed on the primary. The value on the standby must + be equal to or greater than the value on the primary. If these parameters + are not set high enough then the standby will not be able to track work + correctly from recovering transactions. If these values are set too low the + the server will halt. Higher values can then be supplied and the server + restarted to begin recovery again. + + + + + max_connections + + + + + max_prepared_transactions + + + + + max_locks_per_transaction + + + + + + + It is important that the administrator consider the appropriate setting + of max_standby_delay, set in postgresql.conf. + There is no optimal setting and should be set according to business + priorities. For example if the server is primarily tasked as a High + Availability server, then you may wish to lower + max_standby_delay or even set it to zero, though that is a + very aggressive setting. If the standby server is tasked as an additional + server for decision support queries then it may be acceptable to set this + to a value of many hours (in seconds). + + + + Transaction status "hint bits" written on primary are not WAL-logged, + so data on standby will likely re-write the hints again on the standby. + Thus the main database blocks will produce write I/Os even though + all users are read-only; no changes have occurred to the data values + themselves. Users will be able to write large sort temp files and + re-generate relcache info files, so there is no part of the database + that is truly read-only during hot standby mode. There is no restriction + on the use of set returning functions, or other users of tuplestore/tuplesort + code. Note also that writes to remote databases will still be possible, + even though the transaction is read-only locally. + + + + The following types of administrator command are not accepted + during recovery mode + + + + + Data Definition Language (DDL) - e.g. CREATE INDEX + + + + + Privilege and Ownership - GRANT, REVOKE, REASSIGN + + + + + Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX + + + + + + + Note again that some of these commands are actually allowed during + "read only" mode transactions on the primary. + + + + As a result, you cannot create additional indexes that exist solely + on the standby, nor can statistics that exist solely on the standby. + If these administrator commands are needed they should be executed + on the primary so that the changes will propagate through to the + standby. + + + + pg_cancel_backend() will work on user backends, but not the + Startup process, which performs recovery. pg_stat_activity does not + show an entry for the Startup process, nor do recovering transactions + show as active. As a result, pg_prepared_xacts is always empty during + recovery. If you wish to resolve in-doubt prepared transactions + then look at pg_prepared_xacts on the primary and issue commands to + resolve those transactions there. + + + + pg_locks will show locks held by backends as normal. pg_locks also shows + a virtual transaction managed by the Startup process that owns all + AccessExclusiveLocks held by transactions being replayed by recovery. + Note that Startup process does not acquire locks to + make database changes and thus locks other than AccessExclusiveLocks + do not show in pg_locks for the Startup process, they are just presumed + to exist. + + + + check_pgsql will work, but it is very simple. + check_postgres will also work, though many some actions + could give different or confusing results. + e.g. last vacuum time will not be maintained for example, since no + vacuum occurs on the standby (though vacuums running on the primary do + send their changes to the standby). + + + + WAL file control commands will not work during recovery + e.g. pg_start_backup, pg_switch_xlog etc.. + + + + Dynamically loadable modules work, including pg_stat_statements. + + + + Advisory locks work normally in recovery, including deadlock detection. + Note that advisory locks are never WAL logged, so it is not possible for + an advisory lock on either the primary or the standby to conflict with WAL + replay. Nor is it possible to acquire an advisory lock on the primary + and have it initiate a similar advisory lock on the standby. Advisory + locks relate only to a single server on which they are acquired. + + + + Trigger-based replication systems such as Slony, + Londiste and Bucardo won't run on the + standby at all, though they will run happily on the primary server as + long as the changes are not sent to standby servers to be applied. + WAL replay is not trigger-based so you cannot relay from the + standby to any system that requires additional database writes or + relies on the use of triggers. + + + + New oids cannot be assigned, though some UUID generators may still + work as long as they do not rely on writing new status to the database. + + + + Currently, temp table creation is not allowed during read only + transactions, so in some cases existing scripts will not run correctly. + It is possible we may relax that restriction in a later release. This is + both a SQL Standard compliance issue and a technical issue. + + + + DROP TABLESPACE can only succeed if the tablespace is empty. + Some standby users may be actively using the tablespace via their + temp_tablespaces parameter. If there are temp files in the + tablespace we currently cancel all active queries to ensure that temp + files are removed, so that we can remove the tablespace and continue with + WAL replay. + + + + Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, + or ALTER DATABASE ... RENAME on primary will generate a log message + that will cause all users connected to that database on the standby to be + forcibly disconnected. This action occurs immediately, whatever the setting of + max_standby_delay. + + + + In normal running, if you issue DROP USER or DROP ROLE + for a role with login capability while that user is still connected then + nothing happens to the connected user - they remain connected. The user cannot + reconnect however. This behaviour applies in recovery also, so a + DROP USER on the primary does not disconnect that user on the standby. + + + + Stats collector is active during recovery. All scans, reads, blocks, + index usage etc will all be recorded normally on the standby. Replayed + actions will not duplicate their effects on primary, so replaying an + insert will not increment the Inserts column of pg_stat_user_tables. + The stats file is deleted at start of recovery, so stats from primary + and standby will differ; this is considered a feature not a bug. + + + + Autovacuum is not active during recovery, though will start normally + at the end of recovery. + + + + Background writer is active during recovery and will perform + restartpoints (similar to checkpoints on primary) and normal block + cleaning activities. The CHECKPOINT command is accepted during recovery, + though performs a restartpoint rather than a new checkpoint. + + + + + Hot Standby Parameter Reference + + + Various parameters have been mentioned above in the + and sections. + + + + On the primary, parameters recovery_connections and + vacuum_defer_cleanup_age can be used to enable and control the + primary server to assist the successful configuration of Hot Standby servers. + max_standby_delay has no effect if set on the primary. + + + + On the standby, parameters recovery_connections and + max_standby_delay can be used to enable and control Hot Standby. + standby server to assist the successful configuration of Hot Standby servers. + vacuum_defer_cleanup_age has no effect during recovery. + + + + + Caveats + + + At this writing, there are several limitations of Hot Standby. + These can and probably will be fixed in future releases: + + + + + Operations on hash indexes are not presently WAL-logged, so + replay will not update these indexes. Hash indexes will not be + used for query plans during recovery. + + + + + Full knowledge of running transactions is required before snapshots + may be taken. Transactions that take use large numbers of subtransactions + (currently greater than 64) will delay the start of read only + connections until the completion of the longest running write transaction. + If this situation occurs explanatory messages will be sent to server log. + + + + + Valid starting points for recovery connections are generated at each + checkpoint on the master. If the standby is shutdown while the master + is in a shutdown state it may not be possible to re-enter Hot Standby + until the primary is started up so that it generates further starting + points in the WAL logs. This is not considered a serious issue + because the standby is usually switched into the primary role while + the first node is taken down. + + + + + At the end of recovery, AccessExclusiveLocks held by prepared transactions + will require twice the normal number of lock table entries. If you plan + on running either a large number of concurrent prepared transactions + that normally take AccessExclusiveLocks, or you plan on having one + large transaction that takes many AccessExclusiveLocks then you are + advised to select a larger value of max_locks_per_transaction, + up to, but never more than twice the value of the parameter setting on + the primary server in rare extremes. You need not consider this at all if + your setting of max_prepared_transactions is 0. + + + + + + + + + + + Incrementally Updated Backups + + + incrementally updated backups + + + + change accumulation + + + + In a warm standby configuration, it is possible to offload the expense of + taking periodic base backups from the primary server; instead base backups + can be made by backing + up a standby server's files. This concept is generally known as + incrementally updated backups, log change accumulation, or more simply, + change accumulation. + + + + If we take a file system backup of the standby server's data + directory while it is processing + logs shipped from the primary, we will be able to reload that backup and + restart the standby's recovery process from the last restart point. + We no longer need to keep WAL files from before the standby's restart point. + If we need to recover, it will be faster to recover from the incrementally + updated backup than from the original base backup. + + + + Since the standby server is not live, it is not possible to + use pg_start_backup() and pg_stop_backup() + to manage the backup process; it will be up to you to determine how + far back you need to keep WAL segment files to have a recoverable + backup. You can do this by running pg_controldata + on the standby server to inspect the control file and determine the + current checkpoint WAL location, or by using the + log_checkpoints option to print values to the standby's + server log. + + +