Sync Rep at Oct 5 - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Sync Rep at Oct 5 |
Date | |
Msg-id | 1286267566.2025.901.camel@ebony Whole thread Raw |
Responses |
Re: Sync Rep at Oct 5
|
List | pgsql-hackers |
This is an attempt to compile everybody's stated viewpoints and come to an understanding about where we are and where we want to go. The idea from here is that we discuss what we are trying to achieve (requirements) and then later come back to how (design). To explain replication I use a simplified notation: -> means "transferred to" M means master S means standby e.g. M -> S describes the simply 2 node case of 1 master and 1 standby There are a number of "main" use cases, with many sub-cases. = Main Use Cases = == Simplest Case == Simplest case is the 2-node case: M -> S For sync rep, we want to be able to request that M's changes are transferred safely to S. This case suffers from the problem that if S is unavailable we are not sure whether to wait, or not. If we wait, we might wait forever, so we need some mechanisms to prevent waiting forever. Variety of mechanisms would be desirable * timeout defined prior to start of wait * explicit "end waiting" message sent to one/all backends * ability to specify "don't wait" if standby is currently absent This problem is particularly clear when the server starts. In current streaming the standby can only connect when master is active, so there may be a sequencing delay between master start and standby start. One possible solution here is to prevent connection until required robustness levels are available, rather than forcing master transactions to wait until standbys connect and can ack receipt of data. == Maximum Availability == A common architecture for avoiding the problems of waiting is to have multiple standbys. We allow for the possibility that one of the standbys is unavailable, by allowing sync rep against just one of the standbys, rather than both of them. M -> (S1, S2) Thus, if S1 is down we still have M -> S2, or if S2 is down we still have M -> S1. This gives N+1 redundancy. Some people may wish to have N+k redundancy, like this M -> (S1, ..., Sk), since the possibility of cluster unavailability increases geometrically with large N. (Would somebody like to work out the probability of cluster outage capital P, when each server has identical probability of individual outage, small p, and we have N nodes and k spares? ..Binomial series). == Quorum Commit == Some people have also proposed that they would like to hear back from more than one node, but not necessarily all standby nodes before allowing commit, which has led to the phrase "quorum commit". Again, this seems a likely requirement for large N. Waiting for more than one response complicates the shmem data structures since we have to book-keep state of each standby and follow a more complex algorithm before releasing standbys from their wait. == Load Balanced Server Farm == Some people have requested that we have a mode that requires confirmation of sync rep against *all* standbys before allowing commit. Even with this, data modifications would become visible on the standbys at different times, so this configuration is still only "eventually consistent", as are the other sync rep use cases. This config would get geometrically less robust as new servers are added. If we require waiting for "All" servers we would need to keep a registered list of servers that ought to be connected, allowing us to wait when one of those servers is unavailable. Automatic and explicit registration have been proposed. = Common Sub-Cases/Requirements = == Standby Registration == Some people prefer an explicitly defined cluster config, with additional specifications of what happens if some parts go down. This section is a record of the various methods suggested Other people prefer less than that. There are a few possibilities: * path config: we specify what happens standby to specific standby * standby config: we specify what a standby can do, but not the specific paths to that standby * no standby-level config: we don't specify what happens on a per-standby config. We can also specify registration as * explicit registration * automatic registration - some servers register automatically * no registration records kept at all == Failover Configuration Minimisation == An important aspect of robustness is the ability to specify a configuration that will remain in place even though 1 or more servers have gone down. It is desirable to specify sync rep requirements such that we do not refer to individual servers, if possible. Each such rule necessarily requires an "else" condition, possibly multiple else conditions. It is desirable to avoid both of these * the need to have different configuration files on each node * the need to have configurations that only become active in case of failure. These are known to be hard to test and very likely to be misconfigured in the event of failover [I know a bank that was down for a whole week when standby server's config was wrong and had never been fully tested. The error was simple and obvious, but the fault showed itself as a sporadic error that was difficult to trace] == Sync Rep Performance == Sync Rep is a potential performance hit, and that hit is known to increase as geographical distance increases. We want to be able to specify the performance of some nodes so that we have 4 levels of robustness: async - doesn't wait for sync recv - syncs when messages received by standby fsync - syncs when messages written to disk by standby apply - sync when messages applied to standby == Transaction Controlled Robustness == It's desirable to be able to specify the robustness levels (in crude form, which remote servers need to confirm before we end waiting). This ability is an important tuning feature for sync replication. Some people have mentioned simply a "level", so that once that level is achieved by one server that is sufficient sync. Others have said they want to confirm that their preferred failover target has received the replicated data. [I am unsure as yet why specifying an exact failover target is desirable, since it complicates configuration when that exact server is down] Others have proposed some combination of levels and specific servers would provide sufficient sync, or some combination based upon weighted values of different servers. It has also been suggested that any level of complexity beyond the very simple can be hidden within a "replication service", so there is indirection between the application and the sync rep configuration. == Path Minimization == We want to be able to minimize and control the path of data transfer, * so that the current master doesn't have initiate transfer to all dependent nodes, thereby reducing overhead on master * so that if the path from current master to descendent is expensive we would minimize network costs. This requirement is commonly known as "relaying". In its most simply stated form, we want one standby to be able to get WAL data from another standby. e.g. M -> S -> S. Stating the problem in that way misses out on the actual requirement, since people would like the arrangement to be robust in case of failures of M or any S. If we specify the exact arrangement of paths then we need to respecify the arrangement of paths if a server goes down. For sync rep, we may have a situation where the master sends data to a group of remote nodes. We would like to specify that the transfer is synchronous to *one* of S1, S2 or S3, depending upon which is directly connected to master at present moment. After that, data is transferred locally to other nodes by relay. If S1 is down, then S2 would take the place of S1 seamlessly and would feed S1 if it comes back up. The master does not wish to specify which of (S1, S2, S3) it syncs with - as long as *one* of them confirms it. So from master perspective we have M -> {S1, S2, S3} though in practice the physical arrangement will be M -> S1 -> (S2, S3) unless S1 is down, in which case the arrangement would be M -> S2 -> S3 or M -> S2 -> (S3, S1) if S1 comes back up. Note the curly brackets to denote a difference between this case and the M -> (S1, S2, S3) arrangement where all 3 paths are active. == Lower Priority System Access == e.g. a Development or Test database server The requirement here is for a lower priority server to be attached to the replication cluster. It receives data normally from the master, but we would like to explicitly specify that it is not part of the high availability framework and will never be a failover target. So we need to be able to specify either/all * explicitly ignore this server * implicitly ignore this server because it is not on the list of explicitly included servers So we must describe config at the server level, somewhere. This requirement differs from the ability to specify robustness level on the master. There may be mistakes or omissions in the above. I've mentioned a few places where I note the request but don't yet understand the argument as to why that aspect is desirable. Please add any aspects I've missed. I'll dump the edited version to the Wiki when we've had some feedback on this. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: