Thread: Sync Rep at Oct 5
This is an attempt to compile everybody's stated viewpoints and come to an understanding about where we are and where we want to go. The idea from here is that we discuss what we are trying to achieve (requirements) and then later come back to how (design). To explain replication I use a simplified notation: -> means "transferred to" M means master S means standby e.g. M -> S describes the simply 2 node case of 1 master and 1 standby There are a number of "main" use cases, with many sub-cases. = Main Use Cases = == Simplest Case == Simplest case is the 2-node case: M -> S For sync rep, we want to be able to request that M's changes are transferred safely to S. This case suffers from the problem that if S is unavailable we are not sure whether to wait, or not. If we wait, we might wait forever, so we need some mechanisms to prevent waiting forever. Variety of mechanisms would be desirable * timeout defined prior to start of wait * explicit "end waiting" message sent to one/all backends * ability to specify "don't wait" if standby is currently absent This problem is particularly clear when the server starts. In current streaming the standby can only connect when master is active, so there may be a sequencing delay between master start and standby start. One possible solution here is to prevent connection until required robustness levels are available, rather than forcing master transactions to wait until standbys connect and can ack receipt of data. == Maximum Availability == A common architecture for avoiding the problems of waiting is to have multiple standbys. We allow for the possibility that one of the standbys is unavailable, by allowing sync rep against just one of the standbys, rather than both of them. M -> (S1, S2) Thus, if S1 is down we still have M -> S2, or if S2 is down we still have M -> S1. This gives N+1 redundancy. Some people may wish to have N+k redundancy, like this M -> (S1, ..., Sk), since the possibility of cluster unavailability increases geometrically with large N. (Would somebody like to work out the probability of cluster outage capital P, when each server has identical probability of individual outage, small p, and we have N nodes and k spares? ..Binomial series). == Quorum Commit == Some people have also proposed that they would like to hear back from more than one node, but not necessarily all standby nodes before allowing commit, which has led to the phrase "quorum commit". Again, this seems a likely requirement for large N. Waiting for more than one response complicates the shmem data structures since we have to book-keep state of each standby and follow a more complex algorithm before releasing standbys from their wait. == Load Balanced Server Farm == Some people have requested that we have a mode that requires confirmation of sync rep against *all* standbys before allowing commit. Even with this, data modifications would become visible on the standbys at different times, so this configuration is still only "eventually consistent", as are the other sync rep use cases. This config would get geometrically less robust as new servers are added. If we require waiting for "All" servers we would need to keep a registered list of servers that ought to be connected, allowing us to wait when one of those servers is unavailable. Automatic and explicit registration have been proposed. = Common Sub-Cases/Requirements = == Standby Registration == Some people prefer an explicitly defined cluster config, with additional specifications of what happens if some parts go down. This section is a record of the various methods suggested Other people prefer less than that. There are a few possibilities: * path config: we specify what happens standby to specific standby * standby config: we specify what a standby can do, but not the specific paths to that standby * no standby-level config: we don't specify what happens on a per-standby config. We can also specify registration as * explicit registration * automatic registration - some servers register automatically * no registration records kept at all == Failover Configuration Minimisation == An important aspect of robustness is the ability to specify a configuration that will remain in place even though 1 or more servers have gone down. It is desirable to specify sync rep requirements such that we do not refer to individual servers, if possible. Each such rule necessarily requires an "else" condition, possibly multiple else conditions. It is desirable to avoid both of these * the need to have different configuration files on each node * the need to have configurations that only become active in case of failure. These are known to be hard to test and very likely to be misconfigured in the event of failover [I know a bank that was down for a whole week when standby server's config was wrong and had never been fully tested. The error was simple and obvious, but the fault showed itself as a sporadic error that was difficult to trace] == Sync Rep Performance == Sync Rep is a potential performance hit, and that hit is known to increase as geographical distance increases. We want to be able to specify the performance of some nodes so that we have 4 levels of robustness: async - doesn't wait for sync recv - syncs when messages received by standby fsync - syncs when messages written to disk by standby apply - sync when messages applied to standby == Transaction Controlled Robustness == It's desirable to be able to specify the robustness levels (in crude form, which remote servers need to confirm before we end waiting). This ability is an important tuning feature for sync replication. Some people have mentioned simply a "level", so that once that level is achieved by one server that is sufficient sync. Others have said they want to confirm that their preferred failover target has received the replicated data. [I am unsure as yet why specifying an exact failover target is desirable, since it complicates configuration when that exact server is down] Others have proposed some combination of levels and specific servers would provide sufficient sync, or some combination based upon weighted values of different servers. It has also been suggested that any level of complexity beyond the very simple can be hidden within a "replication service", so there is indirection between the application and the sync rep configuration. == Path Minimization == We want to be able to minimize and control the path of data transfer, * so that the current master doesn't have initiate transfer to all dependent nodes, thereby reducing overhead on master * so that if the path from current master to descendent is expensive we would minimize network costs. This requirement is commonly known as "relaying". In its most simply stated form, we want one standby to be able to get WAL data from another standby. e.g. M -> S -> S. Stating the problem in that way misses out on the actual requirement, since people would like the arrangement to be robust in case of failures of M or any S. If we specify the exact arrangement of paths then we need to respecify the arrangement of paths if a server goes down. For sync rep, we may have a situation where the master sends data to a group of remote nodes. We would like to specify that the transfer is synchronous to *one* of S1, S2 or S3, depending upon which is directly connected to master at present moment. After that, data is transferred locally to other nodes by relay. If S1 is down, then S2 would take the place of S1 seamlessly and would feed S1 if it comes back up. The master does not wish to specify which of (S1, S2, S3) it syncs with - as long as *one* of them confirms it. So from master perspective we have M -> {S1, S2, S3} though in practice the physical arrangement will be M -> S1 -> (S2, S3) unless S1 is down, in which case the arrangement would be M -> S2 -> S3 or M -> S2 -> (S3, S1) if S1 comes back up. Note the curly brackets to denote a difference between this case and the M -> (S1, S2, S3) arrangement where all 3 paths are active. == Lower Priority System Access == e.g. a Development or Test database server The requirement here is for a lower priority server to be attached to the replication cluster. It receives data normally from the master, but we would like to explicitly specify that it is not part of the high availability framework and will never be a failover target. So we need to be able to specify either/all * explicitly ignore this server * implicitly ignore this server because it is not on the list of explicitly included servers So we must describe config at the server level, somewhere. This requirement differs from the ability to specify robustness level on the master. There may be mistakes or omissions in the above. I've mentioned a few places where I note the request but don't yet understand the argument as to why that aspect is desirable. Please add any aspects I've missed. I'll dump the edited version to the Wiki when we've had some feedback on this. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On 10-10-05 04:32 AM, Simon Riggs wrote: > > This is an attempt to compile everybody's stated viewpoints and come to > an understanding about where we are and where we want to go. The idea > from here is that we discuss what we are trying to achieve > (requirements) and then later come back to how (design). Great start on summarizing the discussions. Getting a summary of the requirements in one place will help people who haven't been diligent in following all the sync-rep email threads stay involved. <snip> > == Failover Configuration Minimisation == > > An important aspect of robustness is the ability to specify a > configuration that will remain in place even though 1 or more servers > have gone down. > > It is desirable to specify sync rep requirements such that we do not > refer to individual servers, if possible. Each such rule necessarily > requires an "else" condition, possibly multiple else conditions. > > It is desirable to avoid both of these > * the need to have different configuration files on each node > * the need to have configurations that only become active in case of > failure. These are known to be hard to test and very likely to be > misconfigured in the event of failover [I know a bank that was down for > a whole week when standby server's config was wrong and had never been > fully tested. The error was simple and obvious, but the fault showed > itself as a sporadic error that was difficult to trace] > Also on the topic of failover how do we want to deal with the master failing over. Say M->{S1,S2} and M fails and we promote S1 to M1. Can M1->S2? What if S2 was further along in processing than S1 when M failed? I don't think we want to take on this complexity for 9.1 but this means that after M fails you won't have a synchronous replica until you rebuild or somehow reset S2. > == Sync Rep Performance == > > Sync Rep is a potential performance hit, and that hit is known to > increase as geographical distance increases. > > We want to be able to specify the performance of some nodes so that we > have 4 levels of robustness: > async - doesn't wait for sync > recv - syncs when messages received by standby > fsync - syncs when messages written to disk by standby > apply - sync when messages applied to standby Will read-only queries running on a slave hold up transactions from being applied on that slave? I suspect that for most people running with 'apply' they would want the answer to be 'no'. Are we going to revisit the standby query cancellation discussion? > == Path Minimization == > > We want to be able to minimize and control the path of data transfer, > * so that the current master doesn't have initiate transfer to all > dependent nodes, thereby reducing overhead on master > * so that if the path from current master to descendent is expensive we > would minimize network costs. > > This requirement is commonly known as "relaying". > > In its most simply stated form, we want one standby to be able to get > WAL data from another standby. e.g. M -> S -> S. Stating the problem in > that way misses out on the actual requirement, since people would like > the arrangement to be robust in case of failures of M or any S. If we > specify the exact arrangement of paths then we need to respecify the > arrangement of paths if a server goes down. Are we going to allow these paths to be reconfigured on a live cluster? If we have M->S1->S2 and we want to reconfigure S2 to read from M then S2 needs to get the data that has already been committed on S1 from somewhere (either S1 or M). This has solutions but it adds to the complexity. Maybe not for 9.1
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote: > Will read-only queries running on a slave hold up transactions from > being applied on that slave? I suspect that for most people running > with 'apply' they would want the answer to be 'no'. Are we going to > revisit the standby query cancellation discussion? Once we have a feedback channel from standby to master its a simple matter to add some feedback to avoid many query cancellations. That was the original plan for 9.0 but we changed from sync rep to streaming rep so late in the cycle that there was no time to do it that way. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote: > Also on the topic of failover how do we want to deal with the master > failing over. Say M->{S1,S2} and M fails and we promote S1 to M1. Can > M1->S2? What if S2 was further along in processing than S1 when M > failed? I don't think we want to take on this complexity for 9.1 but > this means that after M fails you won't have a synchronous replica until > you rebuild or somehow reset S2. Those are problems that can be resolved, but that is the current state. The trick, I guess, is to promote the correct standby. Those are generic issues, not related to any specific patch. Thanks for keeping those issues in the limelight. > > == Path Minimization == > > > > We want to be able to minimize and control the path of data transfer, > > * so that the current master doesn't have initiate transfer to all > > dependent nodes, thereby reducing overhead on master > > * so that if the path from current master to descendent is expensive we > > would minimize network costs. > > > > This requirement is commonly known as "relaying". > > > > In its most simply stated form, we want one standby to be able to get > > WAL data from another standby. e.g. M -> S -> S. Stating the problem in > > that way misses out on the actual requirement, since people would like > > the arrangement to be robust in case of failures of M or any S. If we > > specify the exact arrangement of paths then we need to respecify the > > arrangement of paths if a server goes down. > > Are we going to allow these paths to be reconfigured on a live cluster? > If we have M->S1->S2 and we want to reconfigure S2 to read from M then > S2 needs to get the data that has already been committed on S1 from > somewhere (either S1 or M). This has solutions but it adds to the > complexity. Maybe not for 9.1 If you switch from M -> S1 -> S2 to M -> (S1, S2) it should work fine. At the moment that needs a shutdown/restart, but that could easily be done with a disconnect/reconnect following a file reload. The problem is how much WAL is stored on (any) node. Currently that is wal_keep_segments, which doesn't work very well, but I've seen no better ideas that cover all important cases. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The problem is how much WAL is stored on (any) node. Currently that is > wal_keep_segments, which doesn't work very well, but I've seen no better > ideas that cover all important cases. What about allowing the master to read and send WAL from the archive? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 10-10-07 05:52 AM, Fujii Masao wrote: > On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> The problem is how much WAL is stored on (any) node. Currently that is >> wal_keep_segments, which doesn't work very well, but I've seen no better >> ideas that cover all important cases. > > What about allowing the master to read and send WAL from the archive? > > Regards, Then you have to deal with telling the archive how long it needs to keep WAL segments because the master might ask for them back. If the archive is remote from the master then you have some extra network copying going on. It would be better to let the slave being reconfigured to read the missing WAL from the archive.
On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer <ssinger@ca.afilias.info> wrote: > Then you have to deal with telling the archive how long it needs to keep WAL > segments because the master might ask for them back. Yeah, it's not easy to determine how long we should keep the archived WAL files. We need to calculate which WAL file is deletable according to the progress of each standby and the state of the stored base backups which still might be used for PITR. > If the archive is > remote from the master then you have some extra network copying going on. Yep, so I think that the master (i.e., walsender) should read the archived WAL file by using restore_command specified by users. If the archive is remote from the master, then we would need to specify something like scp in restore_command. Also, even if you compress the archived WAL file by using pg_compress, the master can decompress it by using pg_decompress in restore_command and transfer it. > It would be better to let the slave being reconfigured to read the missing > WAL from the archive. That's one of choices. But I've heard that some people don't want to set up the shared archive area which can be accessed by the master and the standby. For example, they feel that it's complex to configure NFS server or automatic-scp-without-password setting for sharing the archived WAL files. Currently we have to increase wal_keep_segments to work around that problem. But the pg_xlog disk space is usually small and not suitable to keep many WAL files. So we might be unable to increase wal_keep_segments. If we allow the master to stream WAL files from the archive, we don't need to increase wal_keep_segments and set up such a complex configuration. So this idea is one of useful choices, I think. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Oct 7, 2010 at 9:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer <ssinger@ca.afilias.info> wrote: >> Then you have to deal with telling the archive how long it needs to keep WAL >> segments because the master might ask for them back. > > Yeah, it's not easy to determine how long we should keep the archived WAL files. > We need to calculate which WAL file is deletable according to the progress of > each standby and the state of the stored base backups which still might be used > for PITR. > >> If the archive is >> remote from the master then you have some extra network copying going on. > > Yep, so I think that the master (i.e., walsender) should read the archived > WAL file by using restore_command specified by users. If the archive is > remote from the master, then we would need to specify something like scp in > restore_command. Also, even if you compress the archived WAL file by using > pg_compress, the master can decompress it by using pg_decompress in > restore_command and transfer it. > >> It would be better to let the slave being reconfigured to read the missing >> WAL from the archive. > > That's one of choices. > > But I've heard that some people don't want to set up the shared archive area > which can be accessed by the master and the standby. For example, they feel > that it's complex to configure NFS server or automatic-scp-without-password > setting for sharing the archived WAL files. > > Currently we have to increase wal_keep_segments to work around that problem. > But the pg_xlog disk space is usually small and not suitable to keep many > WAL files. So we might be unable to increase wal_keep_segments. > > If we allow the master to stream WAL files from the archive, we don't need > to increase wal_keep_segments and set up such a complex configuration. So > this idea is one of useful choices, I think. I'm not sure anyone other than yourself has endorsed this idea, but in any case it seems off the critical path for getting this feature committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company