Thread: Sync Rep Design
I've mulled over the design for sync rep for awhile now, and have come up with a feature set that includes the final detailed feedback from Fujii Masao, Aidan Van Dyk, Josh Berkus and others. The design also draws from MySQL concepts to make the two interfaces as similar and as simple as possible. It should be noted that the design presented here has many features that the MySQL design does not. I am currently finishing up my patch to offer these features, so its time to begin final discussions. As an interim step, I enclose a PDF version of relevant excerpts from the doc patch. The patch will follow on a later post in the near future. I would like to separate discussions on "user interface" from that of internal design, to make it easier for more people to get involved. Please read the following and post your comments. Thank you. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On 12/30/2010 06:26 PM, Simon Riggs wrote: > > I've mulled over the design for sync rep for awhile now, and have come > up with a feature set that includes the final detailed feedback from > Fujii Masao, Aidan Van Dyk, Josh Berkus and others. > > The design also draws from MySQL concepts to make the two interfaces as > similar and as simple as possible. It should be noted that the design > presented here has many features that the MySQL design does not. > > I am currently finishing up my patch to offer these features, so its > time to begin final discussions. > > As an interim step, I enclose a PDF version of relevant excerpts from > the doc patch. The patch will follow on a later post in the near future. > > I would like to separate discussions on "user interface" from that of > internal design, to make it easier for more people to get involved. > Please read the following and post your comments. Thank you. it would help if this would just be a simple text-only description of the design that people can actually comment on inline. I don't think sending technical design proposals as a pdf (which seems to be written in doc-style as well) is a good idea to encourage discussion on -hackers :( Stefan
On Thu, 2010-12-30 at 18:42 +0100, Stefan Kaltenbrunner wrote: > it would help if this would just be a simple text-only description of > the design that people can actually comment on inline. I don't think > sending technical design proposals as a pdf (which seems to be written > in doc-style as well) is a good idea to encourage discussion on -hackers :( 25.2.6. Synchronous Replication Streaming replication is by default asynchronous. Transactions on the primary server write commit records to WAL, yet do not know whether or when a standby has received and processed those changes. So with asynchronous replication, if the primary crashes, transactions committed on the primary might not have been received by any standby. As a result, failover from primary to standby could cause data loss because transaction completions are absent, relative to the primary. The amount of data loss is proportional to the replication delay at the time of failover. Synchronous replication offers the ability to guarantee that all changes made by a transaction have been transferred to at least one remote standby server. This is an extension to the standard level of durability offered by a transaction commit. This is referred to as semi-synchronous replication. When synchronous replication is requested, the commit of a write transaction will wait until confirmation that the commit record has been transferred successfully to at least one standby server. Waiting for confirmation increases the user's confidence that the changes will not be lost in the event of server crashes but it also necessarily increases the response time for the requesting transaction. The minimum wait time is the roundtrip time from primary to standby. Read only transactions and transaction rollbacks need not wait for replies from standby servers. Subtransaction commits do not wait for responses from standby servers, only final top-level commits. Long running actions such as data loading or index building do not wait until the very final commit message. 25.2.6.1. Basic Configuration Synchronous replication must be enabled on both the primary and at least one standby server. If synchronous replication is disabled on the master, or enabled on the primary but not enabled on any slaves, the primary will use asynchronous replication by default. We use a single parameter to enable synchronous replication, set in postgresql.conf on both primary and standby servers: synchronous_replication = off (default) | on On the primary, synchronous_replication can be set for particular users or databases, or dynamically by applications programs. If more than one standby server specifies synchronous_replication, then whichever standby replies first will release waiting commits. Turning this setting off for a standby allows the administrator to exclude certain standby servers from releasing waiting transactions. This is useful if not all standby servers are designated as potential future primary servers. On the standby, this parameter only takes effect at server start. 25.2.6.2. Planning for Performance Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilise system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will reduce performance for database applications because of increased response times and higher contention. PostgreSQL allows the application developer to specify the durability level required via replication. This can be specified for the system overall, though it can also be specified for specific users or connections, or even individual transactions. For example, an application workload might consist of: 10% of changes are important customer details, while 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users. With synchronous replication options specified at the application level (on the master) we can offer sync rep for the most important changes, without slowing down the bulk of the total workload. Application level options are an important and practical tool for allowing the benefits of synchronous replication for high performance applications. This feature is unique to PostgreSQL. 25.2.6.3. Planning for High Availability The easiest and safest method of gaining High Availability using synchronous replication is to configure at least two standby servers. To understand why, we need to examine what can happen when you lose all standby servers. Commits made when synchronous_replication is set will wait until at least one standby responds. The response may never occur if the last, or only, standby should crash or the network drops. What should we do in that situation? Sitting and waiting will typically cause operational problems because it is an effective outage of the primary server. Allowing the primary server to continue processing in the absence of a standby puts those latest data changes at risk. How we handle this situation is controlled by allow_standalone_primary. The default setting is on, allowing processing to continue, though there is no recommended setting. Choosing the best setting for allow_standalone_primary is a difficult decision and best left to those with combined business responsibility for both data and applications. The difficulty of this choice is the reason why we recommend that you reduce the possibility of this situation occurring by using multiple standby servers. When the primary is started with allow_standalone_primary enabled, the primary will not allow connections until a standby connects that also has synchronous_replication enabled. This is a convenience to ensure that we don't allow connections before write transactions will return successfully. When allow_standalone_primary is set, a user will stop waiting once the replication_timeout has been reached for their specific session. Users are not waiting for a specific standby to reply, they are waiting for a reply from any standby, so the unavailability of any one standby is not significant to a user. It is possible for user sessions to hit timeout even though standbys are communicating normally. In that case, the setting of replication_timeout is probably too low. The standby sends regular status messages to the primary. If no status messages have been received for replication_timeout the primary server will assume the connection is dead and terminate it. This happens whatever the setting of allow_standalone_primary. If primary crashes while commits are waiting for acknowledgement, those transactions will be marked fully committed if the primary database recovers, no matter how allow_standalone_primary is set. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by the standby. Hence this mechanism is technically "semi synchronous" rather than "fully synchronous" replication. Note that replication still not be fully synchronous even if we wait for all standby servers, though this would reduce availability, as described previously. If you need to re-create a standby server while transactions are waiting, make sure that the commands to run pg_start_backup() and pg_stop_backup() are run in a session with synchronous_replication = off, otherwise those requests will wait forever for the standby to appear. 18.5.5. Synchronous Replication These settings control the behavior of the built-in synchronous replication feature. These parameters would be set on the primary server that is to send replication data to one or more standby servers. synchronous_replication (boolean) Specifies whether transaction commit will wait for WAL records to be replicated before the command returns a "success" indication to the client. The default setting is off. When on, there will be a delay while the client waits for confirmation of successful replication. That delay will increase depending upon the physical distance and network activity between primary and standby. The commit wait will last until the first reply from any standby. Multiple standby servers allow increased availability and possibly increase performance as well. The parameter must be set on both primary and standby. On the primary, this parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions replicate synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is synchronous replication, issue SET LOCAL synchronous_replication TO OFF within the transaction. On the standby, the parameter value is taken only at server start. synchronous_replication_timeout (boolean) If the client has synchronous_replication set, and allow_standalone_primary is also set, then the commit will wait for up to synchronous_replication_timeout milliseconds before it returns a "success", or will wait forever if synchronous_replication_timeout is set to -1. If a standby server does not reply for synchronous_replication_timeout the primary will terminate the replication connection. allow_standalone_primary (boolean) If allow_standalone_primary is not set, then the server will not allow connections until a standby connects that has synchronous_replication enabled. allow_standalone_primary also affects the behaviour when the synchronous_replication_timeout is reached. 25.5.2. Handling query conflicts …. Remedial possibilities exist if the number of standby-query cancellations is found to be unacceptable. Typically the best option is to enable hot_standby_feedback. This prevents VACUUM from removing recently-dead rows and so cleanup conflicts do not occur. If you do this, you should note that this will delay cleanup of dead rows on the primary, which may result in undesirable table bloat. However, the cleanup situation will be no worse than if the standby queries were running directly on the primary server. You are still getting the benefit of off-loading execution onto the standby and the query may complete faster than it would have done on the primary server. max_standby_archive_delay must be kept large in this case, because delayed WAL files might already contain entries that conflict with the desired standby queries. … 18.5.6. Standby Servers These settings control the behavior of a standby server that is to receive replication data. hot_standby (boolean) Specifies whether or not you can connect and run queries during recovery, as described in Section 25.5. The default value is off. This parameter can only be set at server start. It only has effect during archive recovery or in standby mode. hot_standby_feedback (boolean) Specifies whether or not a hot standby will send feedback to the primary about queries currently executing on the standby. This parameter can be used to eliminate query cancels caused by cleanup records, though it can cause database bloat on the primary for some workloads. The default value is off. This parameter can only be set at server start. It only has effect if hot_standby is enabled. …. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, 2010-12-30 at 18:42 +0100, Stefan Kaltenbrunner wrote:25.2.6. Synchronous Replication
> it would help if this would just be a simple text-only description of
> the design that people can actually comment on inline. I don't think
> sending technical design proposals as a pdf (which seems to be written
> in doc-style as well) is a good idea to encourage discussion on -hackers :(
Streaming replication is by default asynchronous. Transactions on the
primary server write commit records to WAL, yet do not know whether or
when a standby has received and processed those changes. So with
asynchronous replication, if the primary crashes, transactions committed
on the primary might not have been received by any standby. As a result,
failover from primary to standby could cause data loss because
transaction completions are absent, relative to the primary. The amount
of data loss is proportional to the replication delay at the time of
failover.
Synchronous replication offers the ability to guarantee that all changes
made by a transaction have been transferred to at least one remote
standby server. This is an extension to the standard level of durability
offered by a transaction commit. This is referred to as semi-synchronous
replication.
When synchronous replication is requested, the commit of a write
transaction will wait until confirmation that the commit record has been
transferred successfully to at least one standby server. Waiting for
confirmation increases the user's confidence that the changes will not
be lost in the event of server crashes but it also necessarily increases
the response time for the requesting transaction. The minimum wait time
is the roundtrip time from primary to standby.
Read only transactions and transaction rollbacks need not wait for
replies from standby servers. Subtransaction commits do not wait for
responses from standby servers, only final top-level commits. Long
running actions such as data loading or index building do not wait until
the very final commit message.
25.2.6.1. Basic Configuration
Synchronous replication must be enabled on both the primary and at least
one standby server. If synchronous replication is disabled on the
master, or enabled on the primary but not enabled on any slaves, the
primary will use asynchronous replication by default.
We use a single parameter to enable synchronous replication, set in
postgresql.conf on both primary and standby servers:
synchronous_replication = off (default) | on
On the primary, synchronous_replication can be set for particular users
or databases, or dynamically by applications programs.
This seems like a potential issue, where I start a server with this off, and then I start turning it on for specific transactions; it isn't exactly clear what happens, since there may or may not be a running synchronous rep slave available. (I love the idea though)
If more than one standby server specifies synchronous_replication, then
whichever standby replies first will release waiting commits.
I don't want you to think I am setting an expectation, but I'm curious about the possibility of requiring more than 1 server to reply?
Turning this setting off for a standby allows the administrator to
exclude certain standby servers from releasing waiting transactions.
This is useful if not all standby servers are designated as potential
future primary servers. On the standby, this parameter only takes effect
at server start.
25.2.6.2. Planning for Performance
Synchronous replication usually requires carefully planned and placed
standby servers to ensure applications perform acceptably. Waiting
doesn't utilise system resources, but transaction locks continue to be
held until the transfer is confirmed. As a result, incautious use of
synchronous replication will reduce performance for database
applications because of increased response times and higher contention.
PostgreSQL allows the application developer to specify the durability
level required via replication. This can be specified for the system
overall, though it can also be specified for specific users or
connections, or even individual transactions.
For example, an application workload might consist of: 10% of changes
are important customer details, while 90% of changes are less important
data that the business can more easily survive if it is lost, such as
chat messages between users.
With synchronous replication options specified at the application level
(on the master) we can offer sync rep for the most important changes,
without slowing down the bulk of the total workload. Application level
options are an important and practical tool for allowing the benefits of
synchronous replication for high performance applications. This feature
is unique to PostgreSQL.
25.2.6.3. Planning for High Availability
The easiest and safest method of gaining High Availability using
synchronous replication is to configure at least two standby servers. To
understand why, we need to examine what can happen when you lose all
standby servers.
Commits made when synchronous_replication is set will wait until at
least one standby responds. The response may never occur if the last, or
only, standby should crash or the network drops. What should we do in
that situation?
Sitting and waiting will typically cause operational problems because it
is an effective outage of the primary server. Allowing the primary
server to continue processing in the absence of a standby puts those
latest data changes at risk. How we handle this situation is controlled
by allow_standalone_primary. The default setting is on, allowing
processing to continue, though there is no recommended setting. Choosing
the best setting for allow_standalone_primary is a difficult decision
and best left to those with combined business responsibility for both
data and applications. The difficulty of this choice is the reason why
we recommend that you reduce the possibility of this situation occurring
by using multiple standby servers.
When the primary is started with allow_standalone_primary enabled, the
primary will not allow connections until a standby connects that also
has synchronous_replication enabled. This is a convenience to ensure
that we don't allow connections before write transactions will return
successfully.
I think you mean "disabled" at the start here, right?
When allow_standalone_primary is set, a user will stop waiting once the
replication_timeout has been reached for their specific session. Users
are not waiting for a specific standby to reply, they are waiting for a
reply from any standby, so the unavailability of any one standby is not
significant to a user. It is possible for user sessions to hit timeout
even though standbys are communicating normally. In that case, the
setting of replication_timeout is probably too low.
will a notice or warning be thrown in these cases? I'm thinking something like the checkpoint timeout warning, but could be something else; it just seems to me you need some way to know you're timing out.
The standby sends regular status messages to the primary. If no status
messages have been received for replication_timeout the primary server
will assume the connection is dead and terminate it. This happens
whatever the setting of allow_standalone_primary.
Does the standby attempt to reconnect in these scenarios?
If primary crashes while commits are waiting for acknowledgement, those
transactions will be marked fully committed if the primary database
recovers, no matter how allow_standalone_primary is set.
This seems backwards; if you are waiting for acknowledgement, wouldn't the normal assumption be that the transactions *didnt* make it to any standby, and should be rolled back ?
There is no way
to be certain that all standbys have received all outstanding WAL data
at time of the crash of the primary. Some transactions may not show as
committed on the standby, even though they show as committed on the
primary. The guarantee we offer is that the application will not receive
explicit acknowledgement of the successful commit of a transaction until
the WAL data is known to be safely received by the standby. Hence this
mechanism is technically "semi synchronous" rather than "fully
synchronous" replication. Note that replication still not be fully
synchronous even if we wait for all standby servers, though this would
reduce availability, as described previously.
I think we ought to have an example of the best configuration for "cannot afford to lose any data" scenarios, where we would prefer an overall service interruption over the chance of having the primary / secondary out of synch.
If you need to re-create a standby server while transactions are
waiting, make sure that the commands to run pg_start_backup() and
pg_stop_backup() are run in a session with synchronous_replication =
off, otherwise those requests will wait forever for the standby to
appear.
18.5.5. Synchronous Replication
These settings control the behavior of the built-in synchronous
replication feature. These parameters would be set on the primary server
that is to send replication data to one or more standby servers.
synchronous_replication (boolean)
Specifies whether transaction commit will wait for WAL records
to be replicated before the command returns a "success"
indication to the client. The default setting is off. When on,
there will be a delay while the client waits for confirmation of
successful replication. That delay will increase depending upon
the physical distance and network activity between primary and
standby. The commit wait will last until the first reply from
any standby. Multiple standby servers allow increased
availability and possibly increase performance as well.
The parameter must be set on both primary and standby.
On the primary, this parameter can be changed at any time; the
behavior for any one transaction is determined by the setting in
effect when it commits. It is therefore possible, and useful, to
have some transactions replicate synchronously and others
asynchronously. For example, to make a single multistatement
transaction commit asynchronously when the default is
synchronous replication, issue SET LOCAL synchronous_replication
TO OFF within the transaction.
On the standby, the parameter value is taken only at server
start.
synchronous_replication_timeout (boolean)
If the client has synchronous_replication set, and
allow_standalone_primary is also set, then the commit will wait
for up to synchronous_replication_timeout milliseconds before it
returns a "success", or will wait forever if
synchronous_replication_timeout is set to -1.
If a standby server does not reply for
synchronous_replication_timeout the primary will terminate the
replication connection.
allow_standalone_primary (boolean)
If allow_standalone_primary is not set, then the server will not
allow connections until a standby connects that has
synchronous_replication enabled.
allow_standalone_primary also affects the behaviour when the
synchronous_replication_timeout is reached.
somewhat concerned that we seem to need to use double negatives to describe whats going on here. it makes me think we ought to rename this to require_synchronous_standby or similar.
25.5.2. Handling query conflicts
….
Remedial possibilities exist if the number of standby-query
cancellations is found to be unacceptable. Typically the best option is
to enable hot_standby_feedback. This prevents VACUUM from removing
recently-dead rows and so cleanup conflicts do not occur. If you do
this, you should note that this will delay cleanup of dead rows on the
primary, which may result in undesirable table bloat. However, the
cleanup situation will be no worse than if the standby queries were
running directly on the primary server. You are still getting the
benefit of off-loading execution onto the standby and the query may
complete faster than it would have done on the primary server.
max_standby_archive_delay must be kept large in this case, because
delayed WAL files might already contain entries that conflict with the
desired standby queries.
…
18.5.6. Standby Servers
These settings control the behavior of a standby server that is to
receive replication data.
hot_standby (boolean)
Specifies whether or not you can connect and run queries during
recovery, as described in Section 25.5. The default value is
off. This parameter can only be set at server start. It only has
effect during archive recovery or in standby mode.
hot_standby_feedback (boolean)
Specifies whether or not a hot standby will send feedback to the
primary about queries currently executing on the standby. This
parameter can be used to eliminate query cancels caused by
cleanup records, though it can cause database bloat on the
primary for some workloads. The default value is off. This
parameter can only be set at server start. It only has effect if
hot_standby is enabled.
i was expecting this section to mention the synchronous_replication (bool) somewhere, to control if the standby will participate synchronously or asynch; granted it's the same config as listed in 18.5.5 right? Just that the heading of that section specifically targets the primary.
HTH, looks pretty good at first glance.
Robert Treat
Most of your doc uses the terms "primary" and "standby", but a few instances of "master" and "slave" have slipped in. I think it's better to stick to consistent terminology. On Thu, Dec 30, 2010 at 21:04, Simon Riggs <simon@2ndquadrant.com> wrote: > With synchronous replication options specified at the application level > (on the master) we can offer sync rep for the most important changes, > without slowing down the bulk of the total workload. Application level > options are an important and practical tool for allowing the benefits of > synchronous replication for high performance applications. This feature > is unique to PostgreSQL. I think a comment about the "head-of-line blocking" nature of streaming repliaction is in order. If you execute massive writes in async mode and then run a transaction in sync mode, its commit will be delayed until all the async transactions before it have been applied on the slave. > synchronous_replication_timeout (boolean) Doesn't look like a boolean to me :) Regards, Marti
On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote: > > If more than one standby server specifies synchronous_replication, > then > > whichever standby replies first will release waiting commits. > I don't want you to think I am setting an expectation, but I'm curious > about the possibility of requiring more than 1 server to reply? I was initially interested in this myself, but after a long discussion on "quorum commit" it was decided to go with "first past post". That is easier to manage, requires one less parameter, performs better and doesn't really add that much additional confidence. It was also discussed that we would have a plugin API, but I'm less sure about that now. Perhaps we can add that option in the future, but its not high on my list of things for this release. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote: > > We use a single parameter to enable synchronous replication, set in > > postgresql.conf on both primary and standby servers: > > > > synchronous_replication = off (default) | on > > > > On the primary, synchronous_replication can be set for particular > users > > or databases, or dynamically by applications programs. > > > > > This seems like a potential issue, where I start a server with this > off, and then I start turning it on for specific transactions; it > isn't exactly clear what happens, since there may or may not be a > running synchronous rep slave available. (I love the idea though) Not really an issue. Even if there was a standby there a moment ago, the standby can go away at any time. So we must cope gracefully with what happens if you do this. By default, the parameters specify that in the case you mention we will just use async replication (no wait!). Options exist to change that, since some people want to wait until the sysadmin adds a standby. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 12/30/2010 08:04 PM, Simon Riggs wrote: > On Thu, 2010-12-30 at 18:42 +0100, Stefan Kaltenbrunner wrote: > >> it would help if this would just be a simple text-only description of >> the design that people can actually comment on inline. I don't think >> sending technical design proposals as a pdf (which seems to be written >> in doc-style as well) is a good idea to encourage discussion on -hackers :( > > 25.2.6. Synchronous Replication > Streaming replication is by default asynchronous. Transactions on the > primary server write commit records to WAL, yet do not know whether or > when a standby has received and processed those changes. So with > asynchronous replication, if the primary crashes, transactions committed > on the primary might not have been received by any standby. As a result, > failover from primary to standby could cause data loss because > transaction completions are absent, relative to the primary. The amount > of data loss is proportional to the replication delay at the time of > failover. > > Synchronous replication offers the ability to guarantee that all changes > made by a transaction have been transferred to at least one remote > standby server. This is an extension to the standard level of durability > offered by a transaction commit. This is referred to as semi-synchronous > replication. > > When synchronous replication is requested, the commit of a write > transaction will wait until confirmation that the commit record has been > transferred successfully to at least one standby server. Waiting for > confirmation increases the user's confidence that the changes will not > be lost in the event of server crashes but it also necessarily increases > the response time for the requesting transaction. The minimum wait time > is the roundtrip time from primary to standby. hmm this is one of the main problems I see with the proposed "master is sometimes aware of the standby"(as in the feedback mode) concept this proposal has. If it waits for only one of the standbys there is some issue with the terminology. As a DBA I would expect the master to only return if ALL of the "sync replication" declared nodes replied ok. > > Read only transactions and transaction rollbacks need not wait for > replies from standby servers. Subtransaction commits do not wait for > responses from standby servers, only final top-level commits. Long > running actions such as data loading or index building do not wait until > the very final commit message. > > > 25.2.6.1. Basic Configuration > Synchronous replication must be enabled on both the primary and at least > one standby server. If synchronous replication is disabled on the > master, or enabled on the primary but not enabled on any slaves, the > primary will use asynchronous replication by default. > > We use a single parameter to enable synchronous replication, set in > postgresql.conf on both primary and standby servers: this reads as if you can only set it there > > synchronous_replication = off (default) | on > > On the primary, synchronous_replication can be set for particular users > or databases, or dynamically by applications programs. this says otherwise > > If more than one standby server specifies synchronous_replication, then > whichever standby replies first will release waiting commits. see above for why I think this violates the configuration promise - if I say "this is a sync standby" I better expect it to be... > > Turning this setting off for a standby allows the administrator to > exclude certain standby servers from releasing waiting transactions. > This is useful if not all standby servers are designated as potential > future primary servers. On the standby, this parameter only takes effect > at server start. > > > 25.2.6.2. Planning for Performance > Synchronous replication usually requires carefully planned and placed > standby servers to ensure applications perform acceptably. Waiting > doesn't utilise system resources, but transaction locks continue to be > held until the transfer is confirmed. As a result, incautious use of > synchronous replication will reduce performance for database > applications because of increased response times and higher contention. > > PostgreSQL allows the application developer to specify the durability > level required via replication. This can be specified for the system > overall, though it can also be specified for specific users or > connections, or even individual transactions. > > For example, an application workload might consist of: 10% of changes > are important customer details, while 90% of changes are less important > data that the business can more easily survive if it is lost, such as > chat messages between users. > > With synchronous replication options specified at the application level > (on the master) we can offer sync rep for the most important changes, > without slowing down the bulk of the total workload. Application level > options are an important and practical tool for allowing the benefits of > synchronous replication for high performance applications. This feature > is unique to PostgreSQL. that seems to be a bit too much marketing for a reference level document > > > 25.2.6.3. Planning for High Availability > The easiest and safest method of gaining High Availability using > synchronous replication is to configure at least two standby servers. To > understand why, we need to examine what can happen when you lose all > standby servers. > > Commits made when synchronous_replication is set will wait until at > least one standby responds. The response may never occur if the last, or > only, standby should crash or the network drops. What should we do in > that situation? > > Sitting and waiting will typically cause operational problems because it > is an effective outage of the primary server. Allowing the primary > server to continue processing in the absence of a standby puts those > latest data changes at risk. How we handle this situation is controlled > by allow_standalone_primary. The default setting is on, allowing > processing to continue, though there is no recommended setting. Choosing > the best setting for allow_standalone_primary is a difficult decision > and best left to those with combined business responsibility for both > data and applications. The difficulty of this choice is the reason why > we recommend that you reduce the possibility of this situation occurring > by using multiple standby servers. if there is no recommended setting what will be the default? [...] > > 25.5.2. Handling query conflicts > …. > > > Remedial possibilities exist if the number of standby-query > cancellations is found to be unacceptable. Typically the best option is > to enable hot_standby_feedback. This prevents VACUUM from removing > recently-dead rows and so cleanup conflicts do not occur. If you do > this, you should note that this will delay cleanup of dead rows on the > primary, which may result in undesirable table bloat. However, the > cleanup situation will be no worse than if the standby queries were > running directly on the primary server. You are still getting the > benefit of off-loading execution onto the standby and the query may > complete faster than it would have done on the primary server. > max_standby_archive_delay must be kept large in this case, because > delayed WAL files might already contain entries that conflict with the > desired standby queries. > > > … > > > 18.5.6. Standby Servers > These settings control the behavior of a standby server that is to > receive replication data. > > > hot_standby (boolean) > Specifies whether or not you can connect and run queries during > recovery, as described in Section 25.5. The default value is > off. This parameter can only be set at server start. It only has > effect during archive recovery or in standby mode. > hot_standby_feedback (boolean) > Specifies whether or not a hot standby will send feedback to the > primary about queries currently executing on the standby. This > parameter can be used to eliminate query cancels caused by > cleanup records, though it can cause database bloat on the > primary for some workloads. The default value is off. This > parameter can only be set at server start. It only has effect if > hot_standby is enabled. so if this is enabled - suddenly the master becomes (kinda) aware of the specifics of a given standby - but what happens when one of the standby is offline for a while how does the master know that? What I'm really missing with that proposal is how people expect that solution to be managed - given there is only sometimes a feedback channel into the master you can't do the monitoring. Even if you could (which we really need!) there is nothing in the proposal yet that will help to determine on what the most recent standby (in the case of more >1 sync standby) might be. It also does not address the more general (not sync rep specific) problem of how to deal with max_keep_segments which is a wart and I was hoping we could get rid of in 9.1 - but it would require a real standby registration or at least standby management possibility on the master not a halfway done one - so do we really need hot_standby_feedback as part of the inital sync-rep patch? Stefan
On Thu, Dec 30, 2010 at 3:07 PM, Robert Treat <rob@xzilla.net> wrote: >> If primary crashes while commits are waiting for acknowledgement, those >> transactions will be marked fully committed if the primary database >> recovers, no matter how allow_standalone_primary is set. > > This seems backwards; if you are waiting for acknowledgement, wouldn't the > normal assumption be that the transactions *didnt* make it to any standby, > and should be rolled back ? This is the standard 2-phase commit problem. The primary server *has* committed it, it's fsync has returned, and the only thing keeping it from returning the commit to the client is that it's waiting on a synchronous "ack" from a slave. You've got 2 options: 1) initiate fsync on the slave first - In this case, the slave is farther ahead than the primary, and if primary fails, you're *forced* to have a failover. The standby is head of the primary, so the primary recovering can cause divergence. And you'll likely have to do a base-backup style sync to get a new primary/standby setup. 2) initiate fsync on the primary first - In this case, the slave is always slightly behind. If if your primary falls over, you don't give commit messages to the clients, but if it recovers, it might have committed data, and slaves will still be able to catch up. The thing is that currently, even without replication, #2 can happen. If your db falls over before it gets the commit packet stuffed out the network, you're in the same boat. The data might be committed, even though you didn't get the commit packet, and when your DB recovers, it's got the committed data that you never "knew" was committed. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Thu, Dec 30, 2010 at 3:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote: > > > If more than one standby server specifies synchronous_replication, > > then > > > whichever standby replies first will release waiting commits. > > > I don't want you to think I am setting an expectation, but I'm curious > > about the possibility of requiring more than 1 server to reply? > > I was initially interested in this myself, but after a long discussion > on "quorum commit" it was decided to go with "first past post". > > That is easier to manage, requires one less parameter, performs better > and doesn't really add that much additional confidence. > Yes, I think with a single master, you are probably right (been dealing with more than my fair share of multi-master based nosql solutions lately) Still, one thing that has me concerned is that in the case of two slaves, you don't know which one is the more up-to-date one if you need to failover. It'd be nice if you could just guarantee they both are, but in lieu of that, I guess whatever decision tree is being used, it needs to look at current xlog location of any potential failover targets. > It was also discussed that we would have a plugin API, but I'm less sure > about that now. Perhaps we can add that option in the future, but its > not high on my list of things for this release. > Agreed. Robert Treat http://www.xzilla.net
On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote: > > When allow_standalone_primary is set, a user will stop waiting once > the > > replication_timeout has been reached for their specific session. > Users > > are not waiting for a specific standby to reply, they are waiting > for a > > reply from any standby, so the unavailability of any one standby is > not > > significant to a user. It is possible for user sessions to hit > timeout > > even though standbys are communicating normally. In that case, the > > setting of replication_timeout is probably too low. > > > > > will a notice or warning be thrown in these cases? I'm thinking > something > like the checkpoint timeout warning, but could be something else; it > just > seems to me you need some way to know you're timing out. We can do that, yes. > > The standby sends regular status messages to the primary. If no > status > > messages have been received for replication_timeout the primary > server > > will assume the connection is dead and terminate it. This happens > > whatever the setting of allow_standalone_primary. > > > > > Does the standby attempt to reconnect in these scenarios? Yes it would, but the reason why we terminated the connection was it wasn't talking any more, so it is probably dead. > > If primary crashes while commits are waiting for acknowledgement, > those > > transactions will be marked fully committed if the primary database > > recovers, no matter how allow_standalone_primary is set. > > > This seems backwards; if you are waiting for acknowledgement, wouldn't > the > normal assumption be that the transactions *didnt* make it to any > standby, > and should be rolled back ? Well, we can't roll it back. We have already written the commit record to WAL. > > There is no way > > to be certain that all standbys have received all outstanding WAL > data > > at time of the crash of the primary. Some transactions may not show > as > > committed on the standby, even though they show as committed on the > > primary. The guarantee we offer is that the application will not > receive > > explicit acknowledgement of the successful commit of a transaction > until > > the WAL data is known to be safely received by the standby. Hence > this > > mechanism is technically "semi synchronous" rather than "fully > > synchronous" replication. Note that replication still not be fully > > synchronous even if we wait for all standby servers, though this > would > > reduce availability, as described previously. > > > > > I think we ought to have an example of the best configuration for > "cannot > afford to lose any data" scenarios, where we would prefer an overall > service > interruption over the chance of having the primary / secondary out of > synch. I say "use two or more standbys" more than once... > >> > > > somewhat concerned that we seem to need to use double negatives to > describe > whats going on here. it makes me think we ought to rename this to > require_synchronous_standby or similar. Don't see why we can't use double negatives. ;-) The parameter is named directly from Fujii Masao's suggestion. > > 18.5.6. Standby Servers > > These settings control the behavior of a standby server that is to > > receive replication data. > > ... > i was expecting this section to mention the synchronous_replication > (bool) > somewhere, to control if the standby will participate synchronously or > asynch; granted it's the same config as listed in 18.5.5 right? Just > that > the heading of that section specifically targets the primary. OK, good idea. > HTH, looks pretty good at first glance. Thanks. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote: > Still, one thing that has me concerned is that in the case of two > slaves, you don't know which one is the more up-to-date one if you > need to failover. It'd be nice if you could just guarantee they both > are... Regrettably, nobody can know that, without checking. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 12/30/2010 10:01 PM, Simon Riggs wrote: > On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote: > >> Still, one thing that has me concerned is that in the case of two >> slaves, you don't know which one is the more up-to-date one if you >> need to failover. It'd be nice if you could just guarantee they both >> are... > > Regrettably, nobody can know that, without checking. how exactly would you check? - this seems like something that needs to be done from the SQL and the CLI level and also very well documented (which I cannot see in your proposal). Stefan
On Thu, 2010-12-30 at 22:11 +0200, Marti Raudsepp wrote: > I think a comment about the "head-of-line blocking" nature of > streaming repliaction is in order. If you execute massive writes in > async mode and then run a transaction in sync mode, its commit will be > delayed until all the async transactions before it have been applied > on the slave. Not really sure I understand what you want me to add there. The case you mention is identical whether we use the word "async" or "sync" where you mention "in async mode". Replication doesn't wait until a sync commit is requested, it is continuously active. Sync rep's only addition are the reply messages. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 21:42 +0100, Stefan Kaltenbrunner wrote: > > > > Synchronous replication offers the ability to guarantee that all changes > > made by a transaction have been transferred to at least one remote > > standby server. This is an extension to the standard level of durability > > offered by a transaction commit. This is referred to as semi-synchronous > > replication. > > > > When synchronous replication is requested, the commit of a write > > transaction will wait until confirmation that the commit record has been > > transferred successfully to at least one standby server. Waiting for > > confirmation increases the user's confidence that the changes will not > > be lost in the event of server crashes but it also necessarily increases > > the response time for the requesting transaction. The minimum wait time > > is the roundtrip time from primary to standby. > > hmm this is one of the main problems I see with the proposed "master is > sometimes aware of the standby"(as in the feedback mode) concept this > proposal has. If it waits for only one of the standbys there is some > issue with the terminology. As a DBA I would expect the master to only > return if ALL of the "sync replication" declared nodes replied ok. Well, as a DBA, I expect it to work with just one. That's how MySQL and Oracle work at least. If ALL standbys reply, it takes longer, makes the code harder, how do you determine what "all" is robustly etc.. Plus its been discussed already. > What I'm really missing with that proposal is how people expect that > solution to be managed - What aspect do you wish to monitor? I'm happy to consider your suggestions. > given there is only sometimes a feedback > channel into the master you can't do the monitoring. Not sure what you mean. Please explain more. > Even if you could (which we really need!) there is nothing in the > proposal yet that will help to determine on what the most recent standby > (in the case of more >1 sync standby) might be. Functions to determine that already exist. > - but it would require a real standby > registration or at least standby management possibility on the master > not a halfway done one - so do we really need hot_standby_feedback as > part of the inital sync-rep patch? It is a Hot Standby feature, but so tightly integrated with this code that it isn't possible for me to submit as two separate patches. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > We use a single parameter to enable synchronous replication, set in > postgresql.conf on both primary and standby servers: > > synchronous_replication = off (default) | on > > On the primary, synchronous_replication can be set for particular users > or databases, or dynamically by applications programs. > > If more than one standby server specifies synchronous_replication, then > whichever standby replies first will release waiting commits. > > Turning this setting off for a standby allows the administrator to > exclude certain standby servers from releasing waiting transactions. > This is useful if not all standby servers are designated as potential > future primary servers. On the standby, this parameter only takes effect > at server start. I think it's a bad idea to use the same parameter to mean different things on the master and standby. You proposed this kind of double meaning for the hot_standby parameter (possibly back when it was called standby_connections, or something like that) and we (rightly, I think) did not adopt that, instead ending up with wal_level to control the master's behavior and hot_standby to control the slave's behavior. > synchronous_replication (boolean) > Specifies whether transaction commit will wait for WAL records > to be replicated before the command returns a "success" > indication to the client. The word "replicated" here could be taken to mean different things, most obviously: - slave has received the WAL - slave has fsync'd the WAL - slave has applied the WAL -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2010-12-30 at 22:08 +0100, Stefan Kaltenbrunner wrote: > On 12/30/2010 10:01 PM, Simon Riggs wrote: > > On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote: > > > >> Still, one thing that has me concerned is that in the case of two > >> slaves, you don't know which one is the more up-to-date one if you > >> need to failover. It'd be nice if you could just guarantee they both > >> are... > > > > Regrettably, nobody can know that, without checking. > > how exactly would you check? - this seems like something that needs to > be done from the SQL and the CLI level and also very well documented > (which I cannot see in your proposal). This is a proposal for sync rep, not multi-node failover. I'm definitely not going to widen the scope of this project. Functions already exist to check the thing you're asking. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Dec 30, 2010 at 3:42 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >> synchronous replication for high performance applications. This feature >> is unique to PostgreSQL. > > that seems to be a bit too much marketing for a reference level document +1. > It also does not address the more general (not sync rep specific) problem of > how to deal with max_keep_segments which is a wart and I was hoping we could > get rid of in 9.1 - but it would require a real standby registration or at > least standby management possibility on the master not a halfway done one - > so do we really need hot_standby_feedback as part of the inital sync-rep > patch? And this is really the key point on which previous discussions of sync rep stalled. Simon is clearly of the opinion that any system where the slaves have an individual identities (aka "standby registration") is a bad idea, but the only justification he's offered for that position is the assertion that it doesn't allow any added functionality. As you point out, and as has been pointed out before, this is not true, but unless Simon has changed his position since the last time we discussed this, he will not only refuse to include any kind of standby identifier in any of his proposals, but will also argue against including any such code even if it is written by someone else. I don't understand why, but that's how it is. Synchronous replication would probably be done and committed by now if it weren't for this issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec 30, 2010, at 3:27 PM, Robert Haas wrote: >> synchronous_replication (boolean) >> Specifies whether transaction commit will wait for WAL records >> to be replicated before the command returns a "success" >> indication to the client. > > The word "replicated" here could be taken to mean different things, > most obviously: > > - slave has received the WAL > - slave has fsync'd the WAL > - slave has applied the WAL I think that comment is valid for the entire set of docs, actually. The document goes out of its way to avoid simple phraseslike "replicated", but doesn't spell out exactly what is happening, ie: "Synchronous replication offers the ability to guarantee that all changes made by a transaction have been transferred to at least one remote standby server. This is an extension to the standard level of durability offered by a transaction commit. This is referred to as semi-synchronous replication." Reading that, I'm left with the sense that this isn't a simple matter of "Oh, the data has been replicated to the slave beforecommit returns", but nothing does a good job of clearly explaining what the distinction is and what it means. Thissection: "The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by the standby. Hence this mechanism is technically "semi synchronous" rather than "fully synchronous" replication." does provide some enlightenment, but it's at the end of the section. I think it would be best if there was a section rightat the beginning that talked about the data quality issue of sync replication and how we're avoiding it with our semi-syncsolution. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, 2010-12-30 at 16:27 -0500, Robert Haas wrote: > I think it's a bad idea to use the same parameter to mean different > things on the master and standby. Obviously if you phrase it like that, nobody would disagree. I would say I have used the same parameter on both sides in a balanced way to simplify the configuration, which had been an important factor in the debate. "You need to set parameter X on both primary and standby" seems simple and clear. It certainly works OK for MySQL. It's no bother to change, whichever way we decide and I'm happy to do so. My previous patch had two parameters: primary: synchronous_replication = ... standby: synchronous_replication_service = on | off Which do people prefer? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 16:47 -0500, Robert Haas wrote: > On Thu, Dec 30, 2010 at 3:42 PM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: > >> synchronous replication for high performance applications. This feature > >> is unique to PostgreSQL. > > > > that seems to be a bit too much marketing for a reference level document > > +1. I've removed the "This feature is unique to PostgreSQL", which I agree belongs in a press release, not docs. The explanation of a use case that would benefit from the feature seems valid and I've left that in. PostgreSQL docs are more technical and precise than any other DBMS, even DB2. Having read everybody else's docs, I'm inclined to say it would be easier to explain if I left out the details, as they do. You won't find a detailed explanation of commit guarantees in MySQL docs, for example. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 18:47 -0600, Jim Nasby wrote: > On Dec 30, 2010, at 3:27 PM, Robert Haas wrote: > >> synchronous_replication (boolean) > >> Specifies whether transaction commit will wait for WAL records > >> to be replicated before the command returns a "success" > >> indication to the client. > > > > The word "replicated" here could be taken to mean different things, > > most obviously: > > > > - slave has received the WAL > > - slave has fsync'd the WAL > > - slave has applied the WAL > > I think that comment is valid for the entire set of docs, actually. The document goes out of its way to avoid simple phraseslike "replicated", but doesn't spell out exactly what is happening, ie: > > "Synchronous replication offers the ability to guarantee that all changes > made by a transaction have been transferred to at least one remote > standby server. This is an extension to the standard level of durability > offered by a transaction commit. This is referred to as semi-synchronous > replication." > > Reading that, I'm left with the sense that this isn't a simple matter of "Oh, the data has been replicated to the slavebefore commit returns", but nothing does a good job of clearly explaining what the distinction is and what it means.This section: > > "The guarantee we offer is that the application will not receive > explicit acknowledgement of the successful commit of a transaction until > the WAL data is known to be safely received by the standby. Hence this > mechanism is technically "semi synchronous" rather than "fully > synchronous" replication." > > does provide some enlightenment, but it's at the end of the section. I think it would be best if there was a section rightat the beginning that talked about the data quality issue of sync replication and how we're avoiding it with our semi-syncsolution. I'm happy to change the docs. It's the first draft... If that's the only problem you've got, then I'm feeling good. Any problems with the user interface itself? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 16:47 -0500, Robert Haas wrote: > > It also does not address the more general (not sync rep specific) problem of > > how to deal with max_keep_segments which is a wart and I was hoping we could > > get rid of in 9.1 - but it would require a real standby registration or at > > least standby management possibility on the master not a halfway done one - > > so do we really need hot_standby_feedback as part of the inital sync-rep > > patch? > > And this is really the key point on which previous discussions of sync > rep stalled. Simon is clearly of the opinion that any system where > the slaves have an individual identities (aka "standby registration") > is a bad idea, but the only justification he's offered for that > position is the assertion that it doesn't allow any added > functionality. As you point out, and as has been pointed out before, > this is not true, but unless Simon has changed his position since the > last time we discussed this, he will not only refuse to include any > kind of standby identifier in any of his proposals, but will also > argue against including any such code even if it is written by someone > else. I don't understand why, but that's how it is. > > Synchronous replication would probably be done and committed by now if > it weren't for this issue. I'm not very clear what your response has to do with Stefan's comments. My general perspective is that MySQL released a simple design a year ahead of us, which should be to our collective shame. I will be working towards delivering something useful in this release. Standby registration is complicated and not necessary. If anybody needs to justify anything, it is the people that claim it is somehow essential. If you want increased complexity and features, you can have it, one day, but don't prevent everybody else from benefiting from simplicity, now. What we do need is performance, otherwise the feature is mostly unusable for production systems, without splitting your application into pieces. I would rather concentrate on a minimal set of functionality that we can all agree on. To show that, I have gone out of my way to include features specified by others, including exact names and behaviours of parameters. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Dec 30, 2010 at 03:24:09PM -0500, Aidan Van Dyk wrote: > On Thu, Dec 30, 2010 at 3:07 PM, Robert Treat <rob@xzilla.net> wrote: > > >> If primary crashes while commits are waiting for acknowledgement, those > >> transactions will be marked fully committed if the primary database > >> recovers, no matter how allow_standalone_primary is set. > > > > This seems backwards; if you are waiting for acknowledgement, wouldn't the > > normal assumption be that the transactions *didnt* make it to any standby, > > and should be rolled back ? > > This is the standard 2-phase commit problem. The primary server *has* > committed it, it's fsync has returned, and the only thing keeping it > from returning the commit to the client is that it's waiting on a > synchronous "ack" from a slave. <snip> > 2) initiate fsync on the primary first > - In this case, the slave is always slightly behind. If if your > primary falls over, you don't give commit messages to the clients, but > if it recovers, it might have committed data, and slaves will still be > able to catch up. > > The thing is that currently, even without replication, #2 can happen. For what little it's worth, I vote for this option, because it's a problem that can already happen (as opposed to adding an entirely new type of problem to the mix). -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
On Thu, Dec 30, 2010 at 8:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm not very clear what your response has to do with Stefan's comments. > > My general perspective is that MySQL released a simple design a year > ahead of us, which should be to our collective shame. I will be working > towards delivering something useful in this release. I don't feel ashamed of our feature set and I am not out to beat MySQL or anyone else, just to deliver the best product that we can. Our community has different interests than the MySQL community and that is fine. Still, I don't disagree that we should be aiming at feature parity. <reads MySQL documentation> I see now that you've tried to design this feature in a way that is similar to MySQL's offering, which does have some value. But it appears to me that the documentation you've written here is substantially similar to the MySQL 5.5 reference documentation. That could get us into a world of legal trouble - that documentation is not even open source, let alone BSD. http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html > I would rather concentrate on a minimal set of functionality that we can > all agree on. Me too; and perhaps your proposal is it. But I think it's a shame we didn't put more work into standby registration when we had time to get that done. It might not be necessary, but it would have delivered some nice functionality that we are now not going to have for 9.1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30.12.2010 22:27, Robert Haas wrote: > On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> synchronous_replication (boolean) >> Specifies whether transaction commit will wait for WAL records >> to be replicated before the command returns a "success" >> indication to the client. > The word "replicated" here could be taken to mean different things, > most obviously: > > - slave has received the WAL > - slave has fsync'd the WAL > - slave has applied the WAL Perhaps the level of "replication guarantee" should be decided on the slave side, by having a configuration parameter there report_as_replicated = received|written_to_disk|fsynced|applied for different types of hosts may have wildly different guarantees and performance parameters for these. One could envision a WAL-archive type "standby" which is there for data persistence only will and never "apply" WAL. of couse we could put a bitmap in the status update messages from slave and have some quorum on options on master for when the data is "in sync", say "need 5 received or (1 applied and 1 fsynced)", but I am pretty sure that trying to get anywhere with this before applying the basic sync rep patch would push back sync rep to at least 9.2 if not 9.5 --------------------- Hannu Krosing
On 31.12.2010 6:02, Robert Haas wrote: > On Thu, Dec 30, 2010 at 8:57 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> I'm not very clear what your response has to do with Stefan's comments. >> >> My general perspective is that MySQL released a simple design a year >> ahead of us, which should be to our collective shame. I will be working >> towards delivering something useful in this release. > I don't feel ashamed of our feature set and I am not out to beat MySQL > or anyone else, just to deliver the best product that we can. The key word here is "deliver" . The aim is to "deliver" sync rep, not "specify, leaving out controversial details". The registration part has been left out for a reason - while the registration itself is easy, deciding all the interactions with already running replication is not. Doing just the minimal support for sync rep (need acknowledge from at least one standby) and leaving the management of standbys to user enables us to get to actual working code instead of a pie-in-the-sky wishlist. > Our > community has different interests than the MySQL community and that is > fine. Still, I don't disagree that we should be aiming at feature > parity. > > <reads MySQL documentation> > > I see now that you've tried to design this feature in a way that is > similar to MySQL's offering, which does have some value. But it > appears to me that the documentation you've written here is > substantially similar to the MySQL 5.5 reference documentation. That > could get us into a world of legal trouble - that documentation is not > even open source, let alone BSD. > > http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html Maybe we should get someone who has not read mysql docs to re-write a spec in a "clean room" fashion, by just inspecting code and asking Simon et.al. >> I would rather concentrate on a minimal set of functionality that we can >> all agree on. > Me too; and perhaps your proposal is it. But I think it's a shame we > didn't put more work into standby registration when we had time to get > that done. When you need _just_ the registration, then make a table and two functions pg_standby_register(name) and pg_standby_unregister(name) For a little more added functionality add a third one pg_standby_last_seen(name) to update last seen timestamp and a script that polls all standbys and calls this. > It might not be necessary, but it would have delivered > some nice functionality that we are now not going to have for 9.1. There are tons of "nice functionality we are not going to have for 9.1", lets just not make this cause even more nice functionality being left out ! --------------------- Hannu Krosing
On 12/30/2010 10:27 PM, Simon Riggs wrote: > On Thu, 2010-12-30 at 22:08 +0100, Stefan Kaltenbrunner wrote: >> On 12/30/2010 10:01 PM, Simon Riggs wrote: >>> On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote: >>> >>>> Still, one thing that has me concerned is that in the case of two >>>> slaves, you don't know which one is the more up-to-date one if you >>>> need to failover. It'd be nice if you could just guarantee they both >>>> are... >>> >>> Regrettably, nobody can know that, without checking. >> >> how exactly would you check? - this seems like something that needs to >> be done from the SQL and the CLI level and also very well documented >> (which I cannot see in your proposal). > > This is a proposal for sync rep, not multi-node failover. I'm definitely > not going to widen the scope of this project. > > Functions already exist to check the thing you're asking. well your proposal includes a lot of stuff on how to avoid dataloss and getting High Availability - so I think it is a requirement for us to tell the DBA what the procedures are for both setting it up (which is what is in the docs - but only 50% of the thing) and what to do in the case of a desaster (which is the other part of the problem). Or said otherwise - sync rep is not very useful if there is no easy and reliable way to get that information, if that stuff is already available even better but I'm not aware of what is there and what not, so I expect others to have the same problem. Stefan
On 12/30/2010 10:23 PM, Simon Riggs wrote: > On Thu, 2010-12-30 at 21:42 +0100, Stefan Kaltenbrunner wrote: >>> >>> Synchronous replication offers the ability to guarantee that all changes >>> made by a transaction have been transferred to at least one remote >>> standby server. This is an extension to the standard level of durability >>> offered by a transaction commit. This is referred to as semi-synchronous >>> replication. >>> >>> When synchronous replication is requested, the commit of a write >>> transaction will wait until confirmation that the commit record has been >>> transferred successfully to at least one standby server. Waiting for >>> confirmation increases the user's confidence that the changes will not >>> be lost in the event of server crashes but it also necessarily increases >>> the response time for the requesting transaction. The minimum wait time >>> is the roundtrip time from primary to standby. >> >> hmm this is one of the main problems I see with the proposed "master is >> sometimes aware of the standby"(as in the feedback mode) concept this >> proposal has. If it waits for only one of the standbys there is some >> issue with the terminology. As a DBA I would expect the master to only >> return if ALL of the "sync replication" declared nodes replied ok. > > Well, as a DBA, I expect it to work with just one. That's how MySQL and > Oracle work at least. If ALL standbys reply, it takes longer, makes the > code harder, how do you determine what "all" is robustly etc.. Plus its > been discussed already. Maybe it has been discussed but I still don't see way it makes any sense. If I declare a standby a sync standby I better want it sync - not "maybe sync". consider the case of a 1 master and two identical sync standbys - one sync standby is in the same datacenter the other is in a backup location say 15km away. Given there is a small constant latency to the second box (even if you have fast networks) the end effect is that the second standby will NEVER be sync (because the local one will always be faster) and you end up with an async slave that cannot be used per your business rules? > >> What I'm really missing with that proposal is how people expect that >> solution to be managed - > > What aspect do you wish to monitor? I'm happy to consider your > suggestions. > >> given there is only sometimes a feedback >> channel into the master you can't do the monitoring. > > Not sure what you mean. Please explain more. well hot_standby_feedback(not sure I like the name but I can't think of anything better either) - provides feedback to the master(like sync rep) and the master is actually acting on that feedback. so in effect we have stuff affecting the master that we need to be able to monitor clearly. But to make that information useful it would help to see what standby provided what kind of feedback so we are back to having the master being aware of what standbys are connected and what standbys are supposed to be there - which is the issue sync rep got stalled before... > >> Even if you could (which we really need!) there is nothing in the >> proposal yet that will help to determine on what the most recent standby >> (in the case of more>1 sync standby) might be. > > Functions to determine that already exist. > >> - but it would require a real standby >> registration or at least standby management possibility on the master >> not a halfway done one - so do we really need hot_standby_feedback as >> part of the inital sync-rep patch? > > It is a Hot Standby feature, but so tightly integrated with this code > that it isn't possible for me to submit as two separate patches. well you are the developer of that feature but if it is already a know that can be turned on and off it seems not to hard to submit as a seperate feature... Stefan
On 31.12.2010 09:50, Hannu Krosing wrote: > On 30.12.2010 22:27, Robert Haas wrote: >> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com> >> wrote: >>> synchronous_replication (boolean) >>> Specifies whether transaction commit will wait for WAL records >>> to be replicated before the command returns a "success" >>> indication to the client. >> The word "replicated" here could be taken to mean different things, >> most obviously: >> >> - slave has received the WAL >> - slave has fsync'd the WAL >> - slave has applied the WAL > Perhaps the level of "replication guarantee" should be decided on the > slave side, by > having a configuration parameter there > > report_as_replicated = received|written_to_disk|fsynced|applied > > for different types of hosts may have wildly different guarantees and > performance > parameters for these. One could envision a WAL-archive type "standby" > which is > there for data persistence only will and never "apply" WAL. Agreed, it feels natural to specify when a piece of WAL is acknowledged in the standby. Regarding the rest of the proposal, I would still prefer the UI discussed here: http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com It ought to be the same amount of work to implement, and provides the same feature set, but makes administration a bit easier by being able to name the standbys. Also, I dislike the idea of having the standby specify that it's a synchronous standby that the master has to wait for. Behavior on the master should be configured on the master. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote: > Maybe it has been discussed but I still don't see way it makes any > sense. If I declare a standby a sync standby I better want it sync - not > "maybe sync". consider the case of a 1 master and two identical sync > standbys - one sync standby is in the same datacenter the other is in a > backup location say 15km away. > Given there is a small constant latency to the second box (even if you > have fast networks) the end effect is that the second standby will NEVER > be sync (because the local one will always be faster) and you end up > with an async slave that cannot be used per your business rules? Your picture above is a common misconception. I will add something to the docs to explain this. 1. "sync" is a guarantee about how we respond to the client when we commit. If we wait for more than one response that slows things down, makes the cluster more fragile, complicates the code and doesn't appreciably improve the guarantee. 2. "sync" does not guarantee that the updates to the standbys are in any way coordinated. You can run a query on one standby and get one answer and at the exact same time run the same query on another standby and get a different answer (slightly ahead/behind). That also means that if the master crashes one of the servers will be ahead or behind. You can use pg_last_xlog_receive_location() to check which one that is. When people say they want *all* servers to respond, its usually because they want (2), but that is literally impossible in a distributed system. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 12/31/2010 11:06 AM, Heikki Linnakangas wrote: > On 31.12.2010 09:50, Hannu Krosing wrote: >> On 30.12.2010 22:27, Robert Haas wrote: >>> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com> >>> wrote: >>>> synchronous_replication (boolean) >>>> Specifies whether transaction commit will wait for WAL records >>>> to be replicated before the command returns a "success" >>>> indication to the client. >>> The word "replicated" here could be taken to mean different things, >>> most obviously: >>> >>> - slave has received the WAL >>> - slave has fsync'd the WAL >>> - slave has applied the WAL >> Perhaps the level of "replication guarantee" should be decided on the >> slave side, by >> having a configuration parameter there >> >> report_as_replicated = received|written_to_disk|fsynced|applied >> >> for different types of hosts may have wildly different guarantees and >> performance >> parameters for these. One could envision a WAL-archive type "standby" >> which is >> there for data persistence only will and never "apply" WAL. > > Agreed, it feels natural to specify when a piece of WAL is acknowledged > in the standby. > > Regarding the rest of the proposal, I would still prefer the UI > discussed here: > > http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com > > It ought to be the same amount of work to implement, and provides the > same feature set, but makes administration a bit easier by being able to > name the standbys. Also, I dislike the idea of having the standby > specify that it's a synchronous standby that the master has to wait for. > Behavior on the master should be configured on the master. well that proposal is much closer to what I want as an admin - except that it would be nice to configure that through actual DDL. My wish would be more like: * standby provides a unique name identifier * standby has a flag to say the maximum(or minimum?) replication_reported support it can do * standby connects to the master async by default and the master registers the standby automatically * on the master I can do the following with every standby that is visible to the master or has been in the past:* enable/disable and add/remove permanently(if not added permanently the registration is transient) - enabled if not set explicitly* sync_rep_enabled (boolean) it (so you can still do per transactionor per database or whatever sync rep) - disabled if not set explicitly* sync_reply_required (booleant), (per sync standby flagto require a reply before the master returns - if there is only one sync standby this is default behaviour if there are more the admin can choose)* wait_forever or timeout per standby* maybe a way to set thereport_as_replicated from the master (if feasible) up to the max of what the standby can do so you would get the proposed "semi sync rep" mode by simply setting more than one standby as "sync_rep_enabled" and sync_reply_required is false for all of them (ie any one of them can reply and the master returns) - if you want better than that just require a reply from a specific one or more than one. this would also help in us providing a simple view with a nice and handy status report on the slaves (which ones are there, which ones should be there, how far are they in terms of applying wal, what status do they have). Imho an interface like this would be: a) convinient because it would not require any additional setup requirements for async rep except providing a "name" on the standby by default b) it would enable the master to specify the business rules clearly c) would still support the simple "one sync reply is enough" semisync replication case people like to have d) would also enable the admin to get more than ONE sync standby that is really sync - not maybe sync e) hot_standby_feedback (if enabled) would look at only the permanently enabled slaves so only an "DBA approved" standby would be able to affect the master for table bloat f) provide the necessary meta information for providing the handy "quick & nice" replication status overview reporting feature people want and need g) for all the permanently enabled async nodes we could keep track of the required oldest required WAL and keep that (optionally) so we could get rid of the hard to size max_keep_segements and maintain that automatically. the only disadvantage I can see would be that you would have to manually remove a non-functional slave from the master(and only one that you set some explicit configuration for!) if you decide you don't need it any more. Stefan
On Fri, 2010-12-31 at 12:06 +0200, Heikki Linnakangas wrote: > On 31.12.2010 09:50, Hannu Krosing wrote: > > On 30.12.2010 22:27, Robert Haas wrote: > >> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com> > >> wrote: > >>> synchronous_replication (boolean) > >>> Specifies whether transaction commit will wait for WAL records > >>> to be replicated before the command returns a "success" > >>> indication to the client. > >> The word "replicated" here could be taken to mean different things, > >> most obviously: > >> > >> - slave has received the WAL > >> - slave has fsync'd the WAL > >> - slave has applied the WAL > > Perhaps the level of "replication guarantee" should be decided on the > > slave side, by > > having a configuration parameter there > > > > report_as_replicated = received|written_to_disk|fsynced|applied > > > > for different types of hosts may have wildly different guarantees and > > performance > > parameters for these. One could envision a WAL-archive type "standby" > > which is > > there for data persistence only will and never "apply" WAL. > > Agreed, it feels natural to specify when a piece of WAL is acknowledged > in the standby. That can also be done, its not a problem. Many people asked for just "on" or "off". Currently "on" <--> "slave has fsynced" WAL. > Also, I dislike the idea of having the standby > specify that it's a synchronous standby that the master has to wait for. > Behavior on the master should be configured on the master. The parameter on the standby affects the behaviour of the standby. The standby is saying "don't pick me, I'm not a good target". -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-12-31 at 12:06 +0200, Heikki Linnakangas wrote: > Regarding the rest of the proposal, I would still prefer the UI > discussed here: > > http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com > > It ought to be the same amount of work to implement, and provides the > same feature set, but makes administration a bit easier by being able to > name the standbys. Also, I dislike the idea of having the standby > specify that it's a synchronous standby that the master has to wait for. > Behavior on the master should be configured on the master. Good point; I've added the people on the copy list from that post. This question is they key, so please respond after careful thought on my points below. There are ways to blend together the two approaches, discussed later, though first we need to look at the reasons behind my proposals. I see significant real-world issues with configuring replication using multiple named servers, as described in the link above: 1. Syncing to multiple standbys does not guarantee that the updates to the standbys are in any way coordinated. You can run a query on one standby and get one answer and at the exact same time run the same query on another standby and get a different answer (slightly ahead/behind). That also means that if the master crashes one of the servers can still be ahead or behind, even though you asked them to be the same. So you don't actually get what you think you're getting. 2. Availability of the cluster just went down. If *any* of the "important nodes" goes down, then everything just freezes. (I accept that you want that, and have provided that as an option). 3. Administrative complexity just jumped a huge amount. (a) If you add or remove servers to the config you need to respecify all the parameters, which need to be specific to the exact set of servers. There is no way to test that you have configured the parameters correctly without a testbed that exactly mirrors production with same names etc., or trying it in directly in production. So availability takes another potential hit because of user error. (b) After failover, the list of synchronous_standbys needs to be re-specified, yet what is the correct list of servers? The only way to make that config work is with complex middleware that automatically generates new config files. I don't think that is "the same amount of work to implement", its an order of magnitude harder overall. 4. As a result of the administrative complexity, testing the full range of function will take significantly longer and that is likely to have a direct impact on the robustness of PostgreSQL 9.1. 5. Requesting sync from more than one server performs poorly, since you must wait for additional servers. If there are sporadic or systemic network performance issues you will be badly hit by them. Monitoring that just got harder also. First-response-wins is more robust in the case of volatile resources since it implies responsiveness to changing conditions. 6. You just lost the ability to control performance on the master, with a userset. Performance is a huge issue with sync rep. If you can't control it, you'll simply turn it off. Having a feature that we daren't ever use because it performs poorly helps nobody. This is not a tick-box in our marketing checklist, I want it to be genuinely real-world usable. I understand very well that Oracle provides that level of configuration, though I think it is undesirable in 90% of real world use cases. I also understand how sexy that level of configuration *sounds*, but I genuinely believe trying to deliver that would be a mistake for PostgreSQL. IMHO we should take the same road here as we do in other things: simplicity encouraged, complexity allowed. So I don't have any objection to supporting that functionality in the future, but I believe it is not something we should be encouraging (ever), nor is it something we need for this release. I suppose we might regard the feature set I am proposing as being the same as making synchronous_standbys a USERSET parameter, and allowing just two options: "none" - allowing the user to specify async if they wish it "*" - allowing people to specify that syncing to *any* standby is acceptable We can blend the two approaches together, if we wish, by having two parameters (plus server naming) synchronous_replication = on | off (USERSET) synchronous_standbys = '...' If synchronous_standbys is not set and synchronous_replication = on then we sync to any standby. If synchronous_replication = off then we use async replication, whatever synchronous_standbys is set to. If synchronous_standbys is set, then we use sync rep to all listed servers. My proposal amounts to "lets add synchronous_standbys as a parameter in 9.2". If you really think that we need that functionality in this release, lets get the basic stuff added now and then fold in those ideas on top afterwards. If we do that, I will help. However, my only insistence is that we explain the above points very clearly in the docs to specifically dissuade people from using those features for typical cases. If you wondered why I ignored your post previously, its because I understood that Fujii's post of 15 Oct, one week later, effectively accepted my approach, albeit with two additional parameters. That is the UI that I had been following. http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, 2010-12-30 at 20:26 -0700, Joshua Tolley wrote: > 2) initiate fsync on the primary first > > - In this case, the slave is always slightly behind. If if your > > primary falls over, you don't give commit messages to the clients, > but > > if it recovers, it might have committed data, and slaves will still > be > > able to catch up. > > > > The thing is that currently, even without replication, #2 can > happen. > > For what little it's worth, I vote for this option, because it's a > problem that can already happen (as opposed to adding an entirely new > type of problem to the mix). This proposal provides #2, so your wish is met. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Dec 31, 2010 at 5:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Your picture above is a common misconception. I will add something to > the docs to explain this. > 2. "sync" does not guarantee that the updates to the standbys are in any > way coordinated. You can run a query on one standby and get one answer > and at the exact same time run the same query on another standby and get > a different answer (slightly ahead/behind). That also means that if the > master crashes one of the servers will be ahead or behind. You can use > pg_last_xlog_receive_location() to check which one that is. > > When people say they want *all* servers to respond, its usually because > they want (2), but that is literally impossible in a distributed system. Just to try and be clear again, in "sync" that Stefan and I are talking about, we really don't care that the slave could be a "hot standby" answering queries. In fact, mine wouldn't be. Mine would likely be pg_streamrecv or something. I'm just looking for a guarantee that I've got a copy of the data safely in the next rack, and a separate building before I tell the client I've moved his money. I want a synchronous replication of the *data*, and not a system where I can distribute queries. I'm looking for disaster mitigation, not load mitigation. A replacement for clustered/replicated devices/filesystems under pg_xlog. Having the next rack slave be "hot" in terms of applying WAL and ready to take over instantly would be a bonus, as long as I can guarantee it's current (i.e has all data a primary's COMMIT has acknowledged). So, that's what I want, and that's what your docs suggest is impossible currently; 1st past post means that I can only ever reliably configure 1 sync slave and be sure it will have all acknowledged commits. I can likely get *close* to that by putting only my "slowest" slave as the only sync slave, and monitoring the heck out of my "asynchronous but I want to be synchronous" slave, but I'ld rather trust the PG community to build robust synchronization than myself to build robust enough monitoring to catch that my slave is farther behind than the slower synchronous one. That said, I think the expectation is that if I were building a query-able "hot standby" cluster in sync rep mode, once I get a commit confirmation, I should be able to then initiate a new transaction on any member of that sync rep cluster and see the data I just committed. Yes, I know I could see *newer* data. And I know that the primary could already have newer data. Yes, we have the problem even on a single pg cluster on a single machine. But the point is that if you've committed, any new transactions see *at least* that data or newer. But no chance of older. But personally, I'm not interested in that ;-) -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Fri, Dec 31, 2010 at 6:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I suppose we might regard the feature set I am proposing as being the > same as making synchronous_standbys a USERSET parameter, and allowing > just two options: > "none" - allowing the user to specify async if they wish it > "*" - allowing people to specify that syncing to *any* standby is > acceptable > > We can blend the two approaches together, if we wish, by having two > parameters (plus server naming) > synchronous_replication = on | off (USERSET) > synchronous_standbys = '...' > If synchronous_standbys is not set and synchronous_replication = on then > we sync to any standby. If synchronous_replication = off then we use > async replication, whatever synchronous_standbys is set to. > If synchronous_standbys is set, then we use sync rep to all listed > servers. > > My proposal amounts to "lets add synchronous_standbys as a parameter in > 9.2". FWIW, this plan of attack would be OK with me. I had taken your previous comments to imply that you were opposed to EVER having a parameter like this on the master, but I could see deferring it. In my ideal world, we'd not have the synchronous_replication_service parameter on the standbys at all - all standbys would be candidates, unless someone decides to name them and set synchronous_standbys on the master. But maybe having a simple on/off on the standby is reasonable. *thinks a little bit* Someone may have proposed this before, but one way of getting standby naming "for free" would be to make the standby names the same as the roles used to log in, rather than adding a separate parameter. We could just recommend to people that they use a separate, descriptive role name for each standby. Then the synchronous_standbys parameter - when added - would mean "a standby from one of these roles". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31.12.2010 13:48, Simon Riggs wrote: > On Fri, 2010-12-31 at 12:06 +0200, Heikki Linnakangas wrote: > >> Regarding the rest of the proposal, I would still prefer the UI >> discussed here: >> >> http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com >> >> It ought to be the same amount of work to implement, and provides the >> same feature set, but makes administration a bit easier by being able to >> name the standbys. Also, I dislike the idea of having the standby >> specify that it's a synchronous standby that the master has to wait for. >> Behavior on the master should be configured on the master. > > Good point; I've added the people on the copy list from that post. This > question is they key, so please respond after careful thought on my > points below. > > There are ways to blend together the two approaches, discussed later, > though first we need to look at the reasons behind my proposals. > > I see significant real-world issues with configuring replication using > multiple named servers, as described in the link above: All of these points only apply to specifying *multiple* named servers in the synchronous_standbys='...' list. That's certainly a more complicated scenario, and the configuration is more complicated as a result. With your proposal, it's not possible in the first place. Multiple synchronous standbys probably isn't needed by most people, so I'm fine with leaving that out for now, keeping the design the same otherwise. I included it in the proposal because it easily falls out of the design. So, if you're worried about the complexities of multiple synchronous standbys, let's keep the UI exactly the same as what I described in the link above, but only allow one name in the synchronous_standbys setting, instead of a list. > 3. Administrative complexity just jumped a huge amount. > > (a) If you add or remove servers to the config you need to respecify all > the parameters, which need to be specific to the exact set of servers. Hmm, this could be alleviated by allowing the master to have a name too. All the configs could then be identical, except for the unique name for each server. For example, for a configuration with three servers that are all synchronous with each other, each server would have "synchronous_standbys='server1, server2, server3'" in the config file. The master would simply ignore the entry for itself. > (b) After failover, the list of synchronous_standbys needs to be > re-specified, yet what is the correct list of servers? The only way to > make that config work is with complex middleware that automatically > generates new config files. It depends on what you want. I think you're envisioning that the original server is taken out of the system and not waited for, meaning that you accept a lower level of persistence after failover. Yes, then you need to change the config. Or more likely you prepare the config file in the standby that way to begin with. > I don't think that is "the same amount of > work to implement", its an order of magnitude harder overall. I meant it's the same amount of work to implement the feature in PostgreSQL. No doubt that maintaining such a setup in production is more complicated. > 5. Requesting sync from more than one server performs poorly, since you > must wait for additional servers. If there are sporadic or systemic > network performance issues you will be badly hit by them. Monitoring > that just got harder also. First-response-wins is more robust in the > case of volatile resources since it implies responsiveness to changing > conditions. > > 6. You just lost the ability to control performance on the master, with > a userset. Performance is a huge issue with sync rep. If you can't > control it, you'll simply turn it off. Having a feature that we daren't > ever use because it performs poorly helps nobody. This is not a tick-box > in our marketing checklist, I want it to be genuinely real-world usable. You could make synchronous_standbys a user-settable GUC, just like your proposed boolean switch. You could then control on a per-transaction basis which servers you want to wait to respond. Although perhaps it would be more user-friendly to just have an additional boolean GUC, similar to synchronous_commit=on/off. Or maybe synchronous_commit is enough to control that. > I suppose we might regard the feature set I am proposing as being the > same as making synchronous_standbys a USERSET parameter, and allowing > just two options: > "none" - allowing the user to specify async if they wish it > "*" - allowing people to specify that syncing to *any* standby is > acceptable > > We can blend the two approaches together, if we wish, by having two > parameters (plus server naming) > synchronous_replication = on | off (USERSET) > synchronous_standbys = '...' > If synchronous_standbys is not set and synchronous_replication = on then > we sync to any standby. If synchronous_replication = off then we use > async replication, whatever synchronous_standbys is set to. > If synchronous_standbys is set, then we use sync rep to all listed > servers. Sounds good. I still don't like the synchronous_standbys='' and synchronous_replication=on combination, though. IMHO that still amounts to letting the standby control the behavior on master, and it makes it impossible to temporarily add an asynchronous standby to the mix. I could live with it, you wouldn't be forced to use it that way after all, but I would still prefer to throw an error on that combination. Or at least document the pitfalls and recommend always naming the standbys. > My proposal amounts to "lets add synchronous_standbys as a parameter in > 9.2". If you really think that we need that functionality in this > release, lets get the basic stuff added now and then fold in those ideas > on top afterwards. If we do that, I will help. However, my only > insistence is that we explain the above points very clearly in the docs > to specifically dissuade people from using those features for typical > cases. Huh, wait, if you leave out synchronous_standbys, that's a completely different UI again. I think we've finally reached agreement on how this should be configured, let's stick to that, please. (I would be fine with limiting synchronous_standbys to just one server in this release though.) > If you wondered why I ignored your post previously, its because I > understood that Fujii's post of 15 Oct, one week later, effectively > accepted my approach, albeit with two additional parameters. That is the > UI that I had been following. > http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php That thread makes no mention of how to specify which standbys are synchronous and which are not. It's about specifying the timeout and whether to wait for a disconnected standby. Yeah, Fujii-san's proposal seems reasonable for configuring that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 31.12.2010 14:40, Robert Haas wrote: > Someone may have proposed this before, but one way of getting standby > naming "for free" would be to make the standby names the same as the > roles used to log in, rather than adding a separate parameter. We > could just recommend to people that they use a separate, descriptive > role name for each standby. Then the synchronous_standbys parameter - > when added - would mean "a standby from one of these roles". Seems a bit weird. It's not a lot of effort to give each standby a name. But if you want something automatic, how about gethostname() ? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Dec 31, 2010 at 7:57 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 31.12.2010 14:40, Robert Haas wrote: >> >> Someone may have proposed this before, but one way of getting standby >> naming "for free" would be to make the standby names the same as the >> roles used to log in, rather than adding a separate parameter. We >> could just recommend to people that they use a separate, descriptive >> role name for each standby. Then the synchronous_standbys parameter - >> when added - would mean "a standby from one of these roles". > > Seems a bit weird. It's not a lot of effort to give each standby a name. But > if you want something automatic, how about gethostname() ? Uh, wow, no way. That would be making a critical aspect of system reliability depend on something way, way outside of our configuration. What's weird about using the role name? That's our standard way of distinguishing between two or more users. Why invent something new? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/31/2010 02:39 PM, Robert Haas wrote: > On Fri, Dec 31, 2010 at 7:57 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 31.12.2010 14:40, Robert Haas wrote: >>> >>> Someone may have proposed this before, but one way of getting standby >>> naming "for free" would be to make the standby names the same as the >>> roles used to log in, rather than adding a separate parameter. We >>> could just recommend to people that they use a separate, descriptive >>> role name for each standby. Then the synchronous_standbys parameter - >>> when added - would mean "a standby from one of these roles". >> >> Seems a bit weird. It's not a lot of effort to give each standby a name. But >> if you want something automatic, how about gethostname() ? > > Uh, wow, no way. That would be making a critical aspect of system > reliability depend on something way, way outside of our configuration. +1 > > What's weird about using the role name? That's our standard way of > distinguishing between two or more users. Why invent something new? wel a user is not a host/server for me - given there is no real benefit from using distinct roles for each standby yet I don't see why we should complicate the replication setup procedure by requiring 1 role per standby. So I'm all for giving each standby a name but please make it an explicit one and not something that is only vaguely related to the actual standby host. Stefan
On Fri, Dec 31, 2010 at 8:48 AM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >> What's weird about using the role name? That's our standard way of >> distinguishing between two or more users. Why invent something new? > > wel a user is not a host/server for me - given there is no real benefit from > using distinct roles for each standby yet I don't see why we should > complicate the replication setup procedure by requiring 1 role per standby. > So I'm all for giving each standby a name but please make it an explicit one > and not something that is only vaguely related to the actual standby host. OK. If that's the consensus, can someone post a patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2010-12-31 at 07:33 -0500, Aidan Van Dyk wrote: > On Fri, Dec 31, 2010 at 5:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > Your picture above is a common misconception. I will add something to > > the docs to explain this. > > > 2. "sync" does not guarantee that the updates to the standbys are in any > > way coordinated. You can run a query on one standby and get one answer > > and at the exact same time run the same query on another standby and get > > a different answer (slightly ahead/behind). That also means that if the > > master crashes one of the servers will be ahead or behind. You can use > > pg_last_xlog_receive_location() to check which one that is. > > > > When people say they want *all* servers to respond, its usually because > > they want (2), but that is literally impossible in a distributed system. > > Just to try and be clear again, in "sync" that Stefan and I are > talking about, we really don't care that the slave could be a "hot > standby" answering queries. In fact, mine wouldn't be. Mine would > likely be pg_streamrecv or something. I'm just looking for a > guarantee that I've got a copy of the data safely in the next rack, > and a separate building before I tell the client I've moved his money. > > I want a synchronous replication of the *data*, and not a system where > I can distribute queries. I'm looking for disaster mitigation, not > load mitigation. A replacement for clustered/replicated > devices/filesystems under pg_xlog. > > Having the next rack slave be "hot" in terms of applying WAL and ready > to take over instantly would be a bonus, as long as I can guarantee > it's current (i.e has all data a primary's COMMIT has acknowledged). > So, that's what I want, and that's what your docs suggest is > impossible currently; 1st past post means that I can only ever > reliably configure 1 sync slave and be sure it will have all > acknowledged commits. I can likely get *close* to that by putting > only my "slowest" slave as the only sync slave, and monitoring the > heck out of my "asynchronous but I want to be synchronous" slave, but > I'ld rather trust the PG community to build robust synchronization > than myself to build robust enough monitoring to catch that my slave > is farther behind than the slower synchronous one. > That said, I think the expectation is that if I were building a > query-able "hot standby" cluster in sync rep mode, once I get a commit > confirmation, I should be able to then initiate a new transaction on > any member of that sync rep cluster and see the data I just committed. > Yes, I know I could see *newer* data. And I know that the primary > could already have newer data. Yes, we have the problem even on a > single pg cluster on a single machine. But the point is that if > you've committed, any new transactions see *at least* that data or > newer. But no chance of older. > > But personally, I'm not interested in that ;-) I understand your requirements, listed above. There are good technical reasons why trying to achieve *all* of the above lets slip the other unstated requirements of availability, complexity, performance etc.. Inventing parameter combinations merely hides the fact that these things aren't all simultaneously achievable. In light of that, I have been espousing a simple approach to the typical case, and for the first release. I can see that people may assume my words have various other reasons behind them, but that's not the case. If I could give it all to you, I would. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2010-12-31 at 14:40 +0200, Heikki Linnakangas wrote: > On 31.12.2010 13:48, Simon Riggs wrote: > > > > I see significant real-world issues with configuring replication using > > multiple named servers, as described in the link above: > > All of these points only apply to specifying *multiple* named servers in > the synchronous_standbys='...' list. Unfortunately, some of the points apply to using named servers ever, even if there is only one. > That's certainly a more complicated > scenario, and the configuration is more complicated as a result. > With your proposal, it's not possible in the first place. > > Multiple synchronous standbys probably isn't needed by most people, so > I'm fine with leaving that out for now, keeping the design the same > otherwise. I included it in the proposal because it easily falls out of > the design. So, if you're worried about the complexities of multiple > synchronous standbys, let's keep the UI exactly the same as what I > described in the link above, but only allow one name in the > synchronous_standbys setting, instead of a list. The best usage recommendation is still to have 2+ standbys, *any* of which can be used to provide sync rep. That is the best performance, best availability and easiest to configure that I know of. That best usage is not achievable with uniquely named servers; using non-unique names defeats the point of having names in the first place. I accept that the "best usage" is a general case and there may be circumstances that make the difficulties of named servers worth the trouble. So replicating to multiple synchronous standbys is definitely needed in this release. *Confirming* replication to multiple named sync standbys is the thing we don't need in this release. > > I suppose we might regard the feature set I am proposing as being the > > same as making synchronous_standbys a USERSET parameter, and allowing > > just two options: > > "none" - allowing the user to specify async if they wish it > > "*" - allowing people to specify that syncing to *any* standby is > > acceptable > > > > We can blend the two approaches together, if we wish, by having two > > parameters (plus server naming) > > synchronous_replication = on | off (USERSET) > > synchronous_standbys = '...' > > If synchronous_standbys is not set and synchronous_replication = on then > > we sync to any standby. If synchronous_replication = off then we use > > async replication, whatever synchronous_standbys is set to. > > If synchronous_standbys is set, then we use sync rep to all listed > > servers. > > Sounds good. > I still don't like the synchronous_standbys='' and > synchronous_replication=on combination, though. > IMHO that still amounts > to letting the standby control the behavior on master, and it makes it > impossible to temporarily add an asynchronous standby to the mix. I > could live with it, you wouldn't be forced to use it that way after all, > but I would still prefer to throw an error on that combination. Or at > least document the pitfalls and recommend always naming the standbys. We need a parameter set that makes the best practice easy/easiest to specify, and yet more complicated configurations possible. So I'm happy to add "synchronous_standbys" parameter, as long as it is possible to specify "any" (for which I would use "*"), which would be the default. Initially that would be restricted to just one name. Will pass the server name as an option after IDENTIFY SYSTEM <name>. Anyway, lets continue the discussion next year. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 31.12.2010 13:40, Heikki Linnakangas wrote: > > Sounds good. > > I still don't like the synchronous_standbys='' and > synchronous_replication=on combination, though. IMHO that still > amounts to letting the standby control the behavior on master, and it > makes it impossible to temporarily add an asynchronous standby to the mix. A sync standby _will_have_ the ability to control the master anyway by simply being there or not. What is currently proposed is having dual power lines / dual UPS' and working happily on when one of them fails. Requiring both of them to be present defeats the original purpose of doubling them. So following Simons design of 2 standbys and only one required to ACK to commit you get 2X reliability of single standby. In a design where you have 2 standbys and both are required to ACK to commit you get only 1/2 the reliability of single standby. Having a list of 10 standbys and requiring ACK from all, you get only 10% of the reliability. I agree that there can be scenarios where you may need 10 sync copies before committing on master - usually for non-technical reasons like some accounting law or whatever - these are far rarer than requirement to have reasonable performance and 99.999% system uptime when using only 99.99% reliable hardware. And in such cases where you need multiple copies you will need some out-of-database technology (like signed timestamps) to make the data non-falsifiable as well, so you can't solve this with just configuring sync rep. > I could live with it, you wouldn't be forced to use it that way after > all, but I would still prefer to throw an error on that combination. > Or at least document the pitfalls and recommend always naming the > standbys. > >> My proposal amounts to "lets add synchronous_standbys as a parameter in >> 9.2". If you really think that we need that functionality in this >> release, lets get the basic stuff added now and then fold in those ideas >> on top afterwards. If we do that, I will help. However, my only >> insistence is that we explain the above points very clearly in the docs >> to specifically dissuade people from using those features for typical >> cases. > > Huh, wait, if you leave out synchronous_standbys, that's a completely > different UI again. I think we've finally reached agreement on how > this should be configured, let's stick to that, please. > > (I would be fine with limiting synchronous_standbys to just one server > in this release though.) > >> If you wondered why I ignored your post previously, its because I >> understood that Fujii's post of 15 Oct, one week later, effectively >> accepted my approach, albeit with two additional parameters. That is the >> UI that I had been following. >> http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php > > That thread makes no mention of how to specify which standbys are > synchronous and which are not. The simplest way would be to have separate database users for sync and async standbys ? That would allow any standby with right credentials act as a sync user, and those who are not eligible are not accepted even if they try to act as "a synchronity (?) provider". > It's about specifying the timeout and whether to wait for a > disconnected standby. Yeah, Fujii-san's proposal seems reasonable for > configuring that. > -------------------- Hannu Krosing
On 12/31/10 4:40 AM, Robert Haas wrote: > Someone may have proposed this before, but one way of getting standby > naming "for free" would be to make the standby names the same as the > roles used to log in, rather than adding a separate parameter. We > could just recommend to people that they use a separate, descriptive > role name for each standby. Then the synchronous_standbys parameter - > when added - would mean "a standby from one of these roles". I like this idea; it has an elegant simplicity about it and right now I can't think of any real faults. It would have the added benefit that each standby "group" would show up by the group name in ps and on pg_stat_activity. However, I agree strongly with Simon that we really want the simplest possible synch rep implementation for 9.1, given that we're currently 15 days away from the *last* commitfest. Nobody, at this point, has really even test any of the sync rep patches for reliability or performance. Here's how I could see us developing: Simplest (9.1): -- synch/async capability set on the standby in recovery.conf -- synch/async transaction status (async, recv, fsync, apply) declared as a userset by the executing session (from Simon's patch) -- only one ack for sync, regardless of the number of standbys This would allow users to have a single sync standby plus a pool of async standbys, which is what I think 90% of users who care about sync standby want. More Complex (9.2): -- all of the above, *plus* the ability to have standbys with ROLEs and require acks from one server in each ROLE. "Web-Scale" (9.3): -- all of the above, plus "group commit", the ability to specific a number of servers in each ROLE who must ack. ... but the most important thing is to make the feature for 9.1 simple, simple, simple. We know we won't get it right on the first try. --Josh Berkus
On Fri, 2010-12-31 at 22:18 +0100, Hannu Krosing wrote: > On 31.12.2010 13:40, Heikki Linnakangas wrote: > > > > Sounds good. > > > > I still don't like the synchronous_standbys='' and > > synchronous_replication=on combination, though. IMHO that still > > amounts to letting the standby control the behavior on master, and it > > makes it impossible to temporarily add an asynchronous standby to the mix. > A sync standby _will_have_ the ability to control the master anyway by > simply being there or not. > > What is currently proposed is having dual power lines / dual UPS' and > working happily on when one of them fails. > Requiring both of them to be present defeats the original purpose of > doubling them. > > So following Simons design of 2 standbys and only one required to ACK to > commit you get 2X reliability of single standby. ... Yes, working out the math is a good idea. Things are much clearer if we do that. Let's assume we have 98% availability on any single server. 1. Having one primary and 2 standbys, either of which can acknowledge, and we never lock up if both standbys fail, then we will have 99.9992% server availability. (So PostgreSQL hits "5 Nines", with data guarantees). ("Maximised availability") 2. Having one primary and 2 standbys, either of which can acknowledge, and we lock up if both standbys fail to protect the data, then we will have 99.996% availability. Slightly less availability, but we don't put data at risk at any time, since any commit is always covered by at least 2 servers. ("Maximised protection") 3. If we have a primary and a single standby which must acknowledge, and we choose to lock up if the standby fails, then we will have only 96.04% availability. 4. If we have a primary and two standbys (named or otherwise), both of which must acknowledge or we lock up the master, then we have an awesome 94.12% availability. On the last two, there is also an increased likelihood of administrative cock-ups because of more specific and complex config requirements. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 12/31/10, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote: > >> Maybe it has been discussed but I still don't see way it makes any >> sense. If I declare a standby a sync standby I better want it sync - not >> "maybe sync". consider the case of a 1 master and two identical sync >> standbys - one sync standby is in the same datacenter the other is in a >> backup location say 15km away. >> Given there is a small constant latency to the second box (even if you >> have fast networks) the end effect is that the second standby will NEVER >> be sync (because the local one will always be faster) and you end up >> with an async slave that cannot be used per your business rules? > > Your picture above is a common misconception. I will add something to > the docs to explain this. > > 1. "sync" is a guarantee about how we respond to the client when we > commit. If we wait for more than one response that slows things down, > makes the cluster more fragile, complicates the code and doesn't > appreciably improve the guarantee. Whether it is more fragile depends on if you look at up-time fragility or durability fragility. I think it can appreciably improve the guarantee. > > 2. "sync" does not guarantee that the updates to the standbys are in any > way coordinated. You can run a query on one standby and get one answer > and at the exact same time run the same query on another standby and get > a different answer (slightly ahead/behind). That also means that if the > master crashes one of the servers will be ahead or behind. You can use > pg_last_xlog_receive_location() to check which one that is. If at least one of the standbys is in the same smoking crater as the primary, then pg_last_xlog_receive_location on it is unlikely to respond. The guarantee goes away precisely when it is needed. Cheers, Jeff
On 12/31/2010 08:15 PM, Simon Riggs wrote: > On Fri, 2010-12-31 at 14:40 +0200, Heikki Linnakangas wrote: >> On 31.12.2010 13:48, Simon Riggs wrote: >>> >>> I see significant real-world issues with configuring replication using >>> multiple named servers, as described in the link above: >> >> All of these points only apply to specifying *multiple* named servers in >> the synchronous_standbys='...' list. > > Unfortunately, some of the points apply to using named servers ever, > even if there is only one. > >> That's certainly a more complicated >> scenario, and the configuration is more complicated as a result. >> With your proposal, it's not possible in the first place. >> >> Multiple synchronous standbys probably isn't needed by most people, so >> I'm fine with leaving that out for now, keeping the design the same >> otherwise. I included it in the proposal because it easily falls out of >> the design. So, if you're worried about the complexities of multiple >> synchronous standbys, let's keep the UI exactly the same as what I >> described in the link above, but only allow one name in the >> synchronous_standbys setting, instead of a list. > > The best usage recommendation is still to have 2+ standbys, *any* of > which can be used to provide sync rep. That is the best performance, > best availability and easiest to configure that I know of. That best > usage is not achievable with uniquely named servers; using non-unique > names defeats the point of having names in the first place. I disagree with that usage recommendation, if we ask for sync we should get sync, your definition is more like "we should have fsync=on only do fsync sometimes and still claim it is safe". Also it is very much possible to do that semisync style replication feature with named servers (see my post about the design I would like to see as a dba) and STILL keep the flexibility to do what other people (like me) in that thread want (at least from an UI perspective). As I said before I would very much prefer to have us restricted to exactly ONE sync capable standby and x async ones if we cannot agree on a reasonable interface :( > > I accept that the "best usage" is a general case and there may be > circumstances that make the difficulties of named servers worth the > trouble. > > So replicating to multiple synchronous standbys is definitely needed in > this release. *Confirming* replication to multiple named sync standbys > is the thing we don't need in this release. well you keep saying that but to be honest I cannot really even see a usecase for me - what is "only a random one of a set of servers is sync at any time and I don't really know which one". My usecases would al involved 2 sync standbys and 1 or more async ones. but the second sync one would be in a different datacenter and I NEED to protect against a datacenter failure which your proposals says I cannot do :( Stefan
On 01/01/2011 02:13 PM, Jeff Janes wrote: > On 12/31/10, Simon Riggs<simon@2ndquadrant.com> wrote: >> On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote: >> >>> Maybe it has been discussed but I still don't see way it makes any >>> sense. If I declare a standby a sync standby I better want it sync - not >>> "maybe sync". consider the case of a 1 master and two identical sync >>> standbys - one sync standby is in the same datacenter the other is in a >>> backup location say 15km away. >>> Given there is a small constant latency to the second box (even if you >>> have fast networks) the end effect is that the second standby will NEVER >>> be sync (because the local one will always be faster) and you end up >>> with an async slave that cannot be used per your business rules? >> >> Your picture above is a common misconception. I will add something to >> the docs to explain this. >> >> 1. "sync" is a guarantee about how we respond to the client when we >> commit. If we wait for more than one response that slows things down, >> makes the cluster more fragile, complicates the code and doesn't >> appreciably improve the guarantee. > > Whether it is more fragile depends on if you look at up-time fragility > or durability fragility. I think it can appreciably improve the > guarantee. > >> >> 2. "sync" does not guarantee that the updates to the standbys are in any >> way coordinated. You can run a query on one standby and get one answer >> and at the exact same time run the same query on another standby and get >> a different answer (slightly ahead/behind). That also means that if the >> master crashes one of the servers will be ahead or behind. You can use >> pg_last_xlog_receive_location() to check which one that is. > > If at least one of the standbys is in the same smoking crater as the > primary, then pg_last_xlog_receive_location on it is unlikely to > respond. > > The guarantee goes away precisely when it is needed. that is exactly my point - if have no guarantee that your SYNC standby is actually sync there is no use for it being used in business cases that require sync replication. If we cannot support that usecase I would either like to see us restricting to only one sync capable standby or by putting a big CAVEAT into the docs saying that sync replication in pg only is a hint and not a guarantee that might or might not be honored in the case of more than one standby. Stefan
On Sat, Jan 1, 2011 at 9:03 AM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > that is exactly my point - if have no guarantee that your SYNC standby is > actually sync there is no use for it being used in business cases that > require sync replication. > If we cannot support that usecase I would either like to see us restricting > to only one sync capable standby or by putting a big CAVEAT into the docs > saying that sync replication in pg only is a hint and not a guarantee that > might or might not be honored in the case of more than one standby. I think it's clear that different people want to different things. I understand Simon's point, but I think the point Stefan and Jeff are making is equally valid. I think the solution is: - Simon gets to implement his version first because he's writing the code. If someone else writes the code then they get to pick. - Whoever wants to make the other thing work can write a patch for that after. - The docs should not allege that either setup is preferable to the other, because there is not now and will never be consensus that this is in fact true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/01/2011 03:15 PM, Robert Haas wrote: > On Sat, Jan 1, 2011 at 9:03 AM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: >> that is exactly my point - if have no guarantee that your SYNC standby is >> actually sync there is no use for it being used in business cases that >> require sync replication. >> If we cannot support that usecase I would either like to see us restricting >> to only one sync capable standby or by putting a big CAVEAT into the docs >> saying that sync replication in pg only is a hint and not a guarantee that >> might or might not be honored in the case of more than one standby. > > I think it's clear that different people want to different things. I > understand Simon's point, but I think the point Stefan and Jeff are > making is equally valid. I think the solution is: > > - Simon gets to implement his version first because he's writing the > code. If someone else writes the code then they get to pick. fair point ;) > > - Whoever wants to make the other thing work can write a patch for that after. yeah but I still would like to get a statement on why simon thinks that the design heikki and others have proposed for supporting multiple sync standby that are actually sync (and also supports semi-sync as his patch which i consider a degraded case of full sync). if you take the syncronous_standbys=<list> thing as an example what about considering it as: foo: sync capable standby bar sync capable standby baz: sync capable standby with syncronous_standbys=<standbyname>:<sync required(bool) syncronous_standbys=foo,bar,baz you get sems sync - whatever standby returns first causes the master to return as well (as in what simons patch does) syncronous_standbys=foo:true,bar:true,baz - require at least foo and bar to reply before the master returns ** the syntax chosen ist just a random example and could be anything ** that one could as well be used to do other per standby configurations (timeouts, wait behaviour etc) or not only being a syncronous_standby=<list> thing but more a standby_list = <list> thingy that also includes async slaves (defaulting to * or whatever so everything is async with default settings unless anything else is specified) > > - The docs should not allege that either setup is preferable to the > other, because there is not now and will never be consensus that this > is in fact true. well I should think we need to clearly spell out everything that affects reliability and if we only support semi-sync for more than 1 standby we have only that setup :) Anyway as long as sync rep is disabled by default I'm fine with that. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > well you keep saying that but to be honest I cannot really even see a > usecase for me - what is "only a random one of a set of servers is sync at > any time and I don't really know which one". It looks easy enough to get to know which one it is. Surely the primary knows and could update something visible through a system view for users? This as been asked for before and I was thinking there was a consensus on this. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sat, 2011-01-01 at 05:13 -0800, Jeff Janes wrote: > On 12/31/10, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote: > > > >> Maybe it has been discussed but I still don't see way it makes any > >> sense. If I declare a standby a sync standby I better want it sync - not > >> "maybe sync". consider the case of a 1 master and two identical sync > >> standbys - one sync standby is in the same datacenter the other is in a > >> backup location say 15km away. > >> Given there is a small constant latency to the second box (even if you > >> have fast networks) the end effect is that the second standby will NEVER > >> be sync (because the local one will always be faster) and you end up > >> with an async slave that cannot be used per your business rules? > > > > Your picture above is a common misconception. I will add something to > > the docs to explain this. > > > > 1. "sync" is a guarantee about how we respond to the client when we > > commit. If we wait for more than one response that slows things down, > > makes the cluster more fragile, complicates the code and doesn't > > appreciably improve the guarantee. > > Whether it is more fragile depends on if you look at up-time fragility > or durability fragility. I think it can appreciably improve the > guarantee. Yes, agreed. That is why I proposed quorum commit earlier in 2010, as a way to improve the durability guarantee. That was bogged down by the requirement for named servers, which I see as unnecessary. > > 2. "sync" does not guarantee that the updates to the standbys are in any > > way coordinated. You can run a query on one standby and get one answer > > and at the exact same time run the same query on another standby and get > > a different answer (slightly ahead/behind). That also means that if the > > master crashes one of the servers will be ahead or behind. You can use > > pg_last_xlog_receive_location() to check which one that is. > > If at least one of the standbys is in the same smoking crater as the > primary, then pg_last_xlog_receive_location on it is unlikely to > respond. > > The guarantee goes away precisely when it is needed. Fairly obviously, I would not be advocating anything that forced you to use a server in the "same smoking crater". I can't see any guarantee that goes away precisely when it is needed. Perhaps you could explain the issue you see, because your comments seem unrelated to my point above. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 01/01/2011 05:28 PM, Dimitri Fontaine wrote: > Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> writes: >> well you keep saying that but to be honest I cannot really even see a >> usecase for me - what is "only a random one of a set of servers is sync at >> any time and I don't really know which one". > > It looks easy enough to get to know which one it is. Surely the primary > knows and could update something visible through a system view for > users? This as been asked for before and I was thinking there was a > consensus on this. well as jeff janes already said - anything that requires the master to still exist is not useful for a desaster. Consider the now often mentioned 2 sync standby scenario with one standby in the same location and one in a secondary location. If you have a desaster(fire,water,explosion,admin fail,...) at the primary location and you have no access to either the master or the standby you will never be sure that the standby on the secondary location is actually "in sync" - it could be but you will never know if you lost that 1B$ invoice just commited on the master and the closeby standby and therefor confirmed to the client... Most of my requirements have very hard requirements on the integrity of the data, very high requirements on the read-only availability and somewhat high requirements on the availability of a master for writes, but data integrity will always trump that. Stefan
On Sat, 2011-01-01 at 16:12 +0100, Stefan Kaltenbrunner wrote: > I still would like to get a statement on why simon thinks that > the design heikki and others have proposed I've explained in huge detail why I think what I think, nor avoided any technical issue. It appears to me there has been substantial confusion over alternatives, because of a misunderstanding about how synchronisation works. Requiring confirmation that standbys are in sync is *not* the same thing as them actually being in sync. Every single proposal made by anybody here on hackers that supports multiple standby servers suffers from the same issue: when the primary crashes you need to work out which standby server is ahead. > - The docs should not allege that either setup is preferable to the > > other, because there is not now and will never be consensus that this > > is in fact true. I remain hopeful that people will read what I have read and understand it. Having taken the trouble to do that publicly, my conscious is clear that I've done the very best to explain things and make it easy for users to avoid error. If I am prevented from putting sound advice into the docs, I'll not worry too much. > well I should think we need to clearly spell out everything that affects > reliability and if we only support semi-sync for more than 1 standby we > have only that setup :) You can use sync rep with 1 or more standby servers. At the end of the day, I can't stop anyone from saying "What an idiot, he designed something that gave the same durability and availability as Oracle and MySQL do, yet with additional performance management features". -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-01-01 at 17:37 +0100, Stefan Kaltenbrunner wrote: > On 01/01/2011 05:28 PM, Dimitri Fontaine wrote: > > Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> writes: > >> well you keep saying that but to be honest I cannot really even see a > >> usecase for me - what is "only a random one of a set of servers is sync at > >> any time and I don't really know which one". > > > > It looks easy enough to get to know which one it is. Surely the primary > > knows and could update something visible through a system view for > > users? This as been asked for before and I was thinking there was a > > consensus on this. > > well as jeff janes already said - anything that requires the master to > still exist is not useful for a desaster. Nobody has suggested that the master needs to still exist after a disaster. > Consider the now often > mentioned 2 sync standby scenario with one standby in the same location > and one in a secondary location. > If you have a desaster(fire,water,explosion,admin fail,...) at the > primary location and you have no access to either the master or the > standby you will never be sure that the standby on the secondary > location is actually "in sync" - it could be but you will never know if > you lost that 1B$ invoice just commited on the master and the closeby > standby and therefor confirmed to the client... I've never suggested you configure your systems like that. It would of course be stupid. This is not a sensible technical discussion. I'll go back to coding. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 01/01/2011 05:55 PM, Simon Riggs wrote: > On Sat, 2011-01-01 at 16:12 +0100, Stefan Kaltenbrunner wrote: > >> I still would like to get a statement on why simon thinks that >> the design heikki and others have proposed > > I've explained in huge detail why I think what I think, nor avoided any > technical issue. > > It appears to me there has been substantial confusion over alternatives, > because of a misunderstanding about how synchronisation works. Requiring > confirmation that standbys are in sync is *not* the same thing as them > actually being in sync. Every single proposal made by anybody here on > hackers that supports multiple standby servers suffers from the same > issue: when the primary crashes you need to work out which standby > server is ahead. aaah that was exactly what I was after - so the problem is that when you have a sync standby it will technically always be "in front" of the master (because it needs to fsync/apply/whatever before the master). In the end the question boils down to what is "the bigger problem" in the case of a lost master: a) a transaction that was confirmed on the master but might not be on any of the surviving sync standbys (or you will never know if it is) - this is how I understand the proposal so far b) a transaction that was not yet confirmed on the master but might have been applied on the surving standby before the desaster - this is what I understand "confirm from all sync standbys" could result in. Spelled out that more clearly now makes me a bit reconsider on what I said before but I still wonder if ultimately we will have to provide both modes to satisfy different business requirements (a might provide the more accurate answer on average but b might at least provide a way to identify the "wild" transaction buy looking at additional data) > > >> - The docs should not allege that either setup is preferable to the >>> other, because there is not now and will never be consensus that this >>> is in fact true. > > I remain hopeful that people will read what I have read and understand > it. Having taken the trouble to do that publicly, my conscious is clear > that I've done the very best to explain things and make it easy for > users to avoid error. If I am prevented from putting sound advice into > the docs, I'll not worry too much. > > >> well I should think we need to clearly spell out everything that affects >> reliability and if we only support semi-sync for more than 1 standby we >> have only that setup :) > > You can use sync rep with 1 or more standby servers. > > At the end of the day, I can't stop anyone from saying "What an idiot, > he designed something that gave the same durability and availability as > Oracle and MySQL do, yet with additional performance management > features". ok Stefan
On Sat, 2011-01-01 at 18:13 +0100, Stefan Kaltenbrunner wrote: > On 01/01/2011 05:55 PM, Simon Riggs wrote: > > > > It appears to me there has been substantial confusion over alternatives, > > because of a misunderstanding about how synchronisation works. Requiring > > confirmation that standbys are in sync is *not* the same thing as them > > actually being in sync. Every single proposal made by anybody here on > > hackers that supports multiple standby servers suffers from the same > > issue: when the primary crashes you need to work out which standby > > server is ahead. > > aaah that was exactly what I was after - so the problem is that when you > have a sync standby it will technically always be "in front" of the > master (because it needs to fsync/apply/whatever before the master). > In the end the question boils down to what is "the bigger problem" in > the case of a lost master: > a) a transaction that was confirmed on the master but might not be on > any of the surviving sync standbys (or you will never know if it is) - > this is how I understand the proposal so far No that cannot happen, the current situation is that we will fsync WAL on the master, then fsync WAL on the standby, then reply to the master. The standby is never ahead of the master, at any point. > b) a transaction that was not yet confirmed on the master but might have > been applied on the surving standby before the desaster - this is what I > understand "confirm from all sync standbys" could result in. Yes, that is described in the docs changes I published. (a) was discussed, but ruled out, since it would require any crash/immed shutdown of the master to become a failover, or have some kind of weird back channel to give the missing data back. There hasn't been any difference of opinion in this area, that I am aware of. All proposals have offered (b). -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-01-01 at 17:28 +0100, Dimitri Fontaine wrote: > something visible through a system view for > users? This as been asked for before and I was thinking there was a > consensus on this. Yes, it will be there. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 01/01/2011 06:29 PM, Simon Riggs wrote: > On Sat, 2011-01-01 at 18:13 +0100, Stefan Kaltenbrunner wrote: >> On 01/01/2011 05:55 PM, Simon Riggs wrote: >>> >>> It appears to me there has been substantial confusion over alternatives, >>> because of a misunderstanding about how synchronisation works. Requiring >>> confirmation that standbys are in sync is *not* the same thing as them >>> actually being in sync. Every single proposal made by anybody here on >>> hackers that supports multiple standby servers suffers from the same >>> issue: when the primary crashes you need to work out which standby >>> server is ahead. >> >> aaah that was exactly what I was after - so the problem is that when you >> have a sync standby it will technically always be "in front" of the >> master (because it needs to fsync/apply/whatever before the master). >> In the end the question boils down to what is "the bigger problem" in >> the case of a lost master: > >> a) a transaction that was confirmed on the master but might not be on >> any of the surviving sync standbys (or you will never know if it is) - >> this is how I understand the proposal so far > > No that cannot happen, the current situation is that we will fsync WAL > on the master, then fsync WAL on the standby, then reply to the master. > The standby is never ahead of the master, at any point. hmm maybe my "surviving" standbys(the case I'm wondering about is whole datacenter failures which might take out more than just the master) was not clear - consider three boxes, one master and two standby and semisync replication(ie any one of the standbys is enough to reply). 1. master fsyncs wal 2. standby #1 fsyncs and replies 3. master confirms commit 4. desaster strikes and destroys master and standby #1 while standby m2 never had time to apply the change(IO/CPU load, latency, whatever) 5. now you have a sync standby that is missing something that was commited on the master and confirmed to the client and no way to verify that this thing happened (same problem with more than two standbys - as long as you lose ONE standby and the master at the same time you will never be sure) what is it that I'm missing here? > >> b) a transaction that was not yet confirmed on the master but might have >> been applied on the surving standby before the desaster - this is what I >> understand "confirm from all sync standbys" could result in. > > Yes, that is described in the docs changes I published. > > (a) was discussed, but ruled out, since it would require any crash/immed > shutdown of the master to become a failover, or have some kind of weird > back channel to give the missing data back. > > There hasn't been any difference of opinion in this area, that I am > aware of. All proposals have offered (b). hmm I'm confused now - any chance you mixed up a & b here because in a) no backchannel is needed because the standby could just fetch the missing data from the master? If that is the case I agree that it would be hard to get the replication up again after a crash of the master with a standby that is ahead but in the end it would be a business decision (as in conflict resolution) on what to do - take the "ahead" standbys data and use that or destroy the old standby and recreate. Stefan
On Sat, 2011-01-01 at 18:49 +0100, Stefan Kaltenbrunner wrote: > hmm maybe my "surviving" standbys(the case I'm wondering about is > whole > datacenter failures which might take out more than just the master) > was > not clear - consider three boxes, one master and two standby and > semisync replication(ie any one of the standbys is enough to reply). > > 1. master fsyncs wal > 2. standby #1 fsyncs and replies > 3. master confirms commit > 4. desaster strikes and destroys master and standby #1 while standby > m2 > never had time to apply the change(IO/CPU load, latency, whatever) > 5. now you have a sync standby that is missing something that was > commited on the master and confirmed to the client and no way to > verify > that this thing happened (same problem with more than two standbys - > as > long as you lose ONE standby and the master at the same time you will > never be sure) This is obvious misconfiguration that anybody with HA experience would spot immediately. If you have local standbys then you should mark them as not available for sync rep, as described in the docs I've written. > what is it that I'm missing here? The fact that we've discussed this already and agreed to do 9.1 with quorum_commit = 1. I proposed making this a parameter; other solutions were also proposed, but it was considered too complex for this release. This is a trade-off between availability and data guarantees. MySQL and Oracle "suffer" from exactly this "problem". DB2 supports only one master and SQLServer doesn't have sync rep at all. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 01.01.2011 19:03, Simon Riggs wrote: > On Sat, 2011-01-01 at 17:37 +0100, Stefan Kaltenbrunner wrote: >> On 01/01/2011 05:28 PM, Dimitri Fontaine wrote: >>> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> writes: >>>> well you keep saying that but to be honest I cannot really even see a >>>> usecase for me - what is "only a random one of a set of servers is sync at >>>> any time and I don't really know which one". >>> >>> It looks easy enough to get to know which one it is. Surely the primary >>> knows and could update something visible through a system view for >>> users? This as been asked for before and I was thinking there was a >>> consensus on this. >> >> well as jeff janes already said - anything that requires the master to >> still exist is not useful for a desaster. > > Nobody has suggested that the master needs to still exist after a > disaster. Dimitri just did, see above. I agree it's not very useful. I don't think there's any other solution to knowing which standby is ahead than connect to both standbys and ask how far each is. I don't see a problem with that, whatever middleware handles the failover and STONITH etc. should be able to do that too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 31.12.2010 23:18, Hannu Krosing wrote: > On 31.12.2010 13:40, Heikki Linnakangas wrote: >> That thread makes no mention of how to specify which standbys are >> synchronous and which are not. > The simplest way would be to have separate database users for sync and > async standbys ? > > That would allow any standby with right credentials act as a sync user, > and those who are not eligible are not accepted even if they try to act > as "a synchronity (?) provider". Hmm, access control... We haven't yet discussed what privileges a standby needs to become synchronous. Perhaps it needs to be a separate privilege that can be granted, in addition to the replication privilege? Robert's suggestion of using the roles instead of server names would also solve that. With that you would list the roles in synchronous_standbys, and no-one else could become a synchronous standby. The downside is that if you want to have two standbys in the mode that it's enough that either one acknowledges a commit, they would have to use the same user account. If we don't adopt Robert's suggestion, do we want to restrict what standby name a user can claim, to stop one standby from spoofing another? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 1/1/11 5:59 AM, Stefan Kaltenbrunner wrote: > well you keep saying that but to be honest I cannot really even see a > usecase for me - what is "only a random one of a set of servers is sync > at any time and I don't really know which one". > My usecases would al involved 2 sync standbys and 1 or more async ones. > but the second sync one would be in a different datacenter and I NEED to > protect against a datacenter failure which your proposals says I cannot > do :( As far as I know, *nobody* has written the bookkeeping code to actually track which standbys have ack'd. We need to get single-ack synch standby merged, tested and working before we add anything as complicated as "each standby on this list must ack". That means that it's extremely unlikely for 9.1 at this point. Frankly, if Simon hadn't already submitted code, I'd be pushing for single-standby-only for 9.1, instead of "any one". Standby in general deals with the A,D,R triangle (Availability, Durability, Response time). "Any one" configuration is the A,R configuration, and the only reason to go out with it for 9.1 is because it's simpler to implement than the D,R configuration (all standbys must ack). > Hmm, access control... We haven't yet discussed what privileges a > standby needs to become synchronous. Perhaps it needs to be a separate > privilege that can be granted, in addition to the replication privilege? No, I don't think so. An additional priv would just complicate life for DBAs without providing any real benefit. You'd be guarding against the very narrow hypothetical case where there's a server admin with limited privs on the master, and authorization to create async standbies, but not the authorization to create s synch standby. How likely is that to *ever* happen? > Robert's suggestion of using the roles instead of server names would > also solve that. With that you would list the roles in > synchronous_standbys, and no-one else could become a synchronous > standby. The downside is that if you want to have two standbys in the > mode that it's enough that either one acknowledges a commit, they would > have to use the same user account. I really don't think that Robert was suggesting that we have predetermined Roles with "magic names" like synchronous_standbys (were you, Robert?). That would defeat eventually having the feature which people like Stefan want: the ability to define pools of servers with custom names to represent various data centers. It also overloads and "perverts" the concept of Roles. While I quite like the idea of having Roles for replication, synch/async should be a property (CREATE ROLE seattle_dac NOLOGIN REPLICATION SYNCHRONOUS*), not a special role. Also, I'll repeat: I see this as 9.2 work, not 9.1 work. There's going to be far too much bookkeeping code to write to make it happen without severely delaying 9.1. BTW, I don't see *any* of this requiring us to have a synchronous_standbys = "list" GUC if we embrace the Roles concept. --Josh ( here's how I can see Roles working: 1) Create a group role for the synch standbyes (NOLOGIN REPLICATION) 2) Create one or more replication roles which are members of that group role. 3) File-synch those standbys and get them replicating (asynchronously) 4) ALTER ROLE synch_standbys SYNCHRONOUS 5) The postmaster scans the list of Roles for synchronous roles. For each synchronous role, the must be one replication ack in order commit. Since the above means that it would be possible to have a replication connection which was a member of more than one synch group, you can see that the bookkeeping involved will be substantial. So, 9.2. )
On Sat, 2011-01-01 at 21:41 +0200, Heikki Linnakangas wrote: > On 31.12.2010 23:18, Hannu Krosing wrote: > > On 31.12.2010 13:40, Heikki Linnakangas wrote: > >> That thread makes no mention of how to specify which standbys are > >> synchronous and which are not. > > The simplest way would be to have separate database users for sync and > > async standbys ? > > > > That would allow any standby with right credentials act as a sync user, > > and those who are not eligible are not accepted even if they try to act > > as "a synchronity (?) provider". > > Hmm, access control... We haven't yet discussed what privileges a > standby needs to become synchronous. Perhaps it needs to be a separate > privilege that can be granted, in addition to the replication privilege? Perhaps we don't need it, also. Why is that essential in this release? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-01-01 at 14:40 -0800, Josh Berkus wrote: > Standby in general deals with the A,D,R triangle (Availability, > Durability, Response time). "Any one" configuration is the A,R > configuration, and the only reason to go out with it for 9.1 is > because it's simpler to implement than the D,R configuration (all > standbys must ack). Nicely put. Not the "only reason" though... As I showed earlier, the AR gives you 99.999% availability and the DR gives you 94% availability, considering a 3 server config. If you add more servers, the availability of the DR option gets much worse, very quickly. The performance of AR is much better also, and stays same or better as cluster size increases. DR choice makes performance degrade as cluster size increases, since it works at the speed of the slowest node. The fact that I can get "5 Nines" with simpler code makes it even sweeter. quorum commit > 1 can improve the durability guarantee of data, but it also seems that many people wouldn't understand its implications and would mis-configure it, to our collective embarrassment. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jan 1, 2011 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2011-01-01 at 14:40 -0800, Josh Berkus wrote: > >> Standby in general deals with the A,D,R triangle (Availability, >> Durability, Response time). "Any one" configuration is the A,R >> configuration, and the only reason to go out with it for 9.1 is >> because it's simpler to implement than the D,R configuration (all >> standbys must ack). > > Nicely put. Not the "only reason" though... > > As I showed earlier, the AR gives you 99.999% availability and the DR > gives you 94% availability, considering a 3 server config. If you add > more servers, the availability of the DR option gets much worse, very > quickly. > > The performance of AR is much better also, and stays same or better as > cluster size increases. DR choice makes performance degrade as cluster > size increases, since it works at the speed of the slowest node. I'm all for getting first-past-post in for 9.1. Otherwise I fear we'll get nothing. Stephen and I will only be able to use 1 sync slave, the "DR-site" one. That's fine. I can live with it, and make my local slave be async. Or replicate the FS/block under WAL. I can monitor the **** out of it, and unless it goes "down", it should easily be able to keep up with the remote sync one beind a slower WAN link. And I think both Stephen and I understand your availability math. We're not arguing that the 1st past post both gives better query availabiliyt, and cluster scale performance. But when the primary datacenter servers are dust in the crater (or boats in the flood, or ash in the fire), I either keep my job, or I don't. And that depends on whether there is a chance I (my database system) confirmed a transaction that I can't recover. So sync rep with 1st past post already makes my job easier. I'll take it over nothing ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Yes, working out the math is a good idea. Things are much clearer if we > do that. > > Let's assume we have 98% availability on any single server. > > 1. Having one primary and 2 standbys, either of which can acknowledge, > and we never lock up if both standbys fail, then we will have 99.9992% > server availability. (So PostgreSQL hits "5 Nines", with data > guarantees). ("Maximised availability") I don't agree with this math. If the master and one standby fail simultaneously, the other standby is useless, because it may or may not be caught up with the master. You know that the last transaction acknowledged as committed by the master is on at least one of the two standbys, but you don't know which one, and so you can't safely promote the surviving standby. (If you are working in an environment where promoting the surviving standby when it's possibly not caught up is OK, then you don't need sync rep in the first place: you can just run async rep and get much better performance.) So the availability is 98% (you are up when the master is up) + 98%^2 * 2% (you are up when both slaves are up and the master is down) = 99.92%. If you had only a single standby, then you could be certain that any commit acknowledged by the master was on that standby. Thus your availability would be 98% (up when master is up) + 98% * 2% (you are up when the master is down and the slave is up) = 99.96%. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02.01.2011 00:40, Josh Berkus wrote: > On 1/1/11 5:59 AM, Stefan Kaltenbrunner wrote: >> well you keep saying that but to be honest I cannot really even see a >> usecase for me - what is "only a random one of a set of servers is sync >> at any time and I don't really know which one". >> My usecases would al involved 2 sync standbys and 1 or more async ones. >> but the second sync one would be in a different datacenter and I NEED to >> protect against a datacenter failure which your proposals says I cannot >> do :( > > As far as I know, *nobody* has written the bookkeeping code to actually > track which standbys have ack'd. We need to get single-ack synch > standby merged, tested and working before we add anything as complicated > as "each standby on this list must ack". That means that it's extremely > unlikely for 9.1 at this point. The bookkeeping will presumably consist of an XLogRecPtr in shared memory for each standby, tracking how far the standby has acknowledged. At commit, you scan the standby slots in shared memory and check that the required standbys have acknowledged your commit record. The bookkeeping required is the same whether or not we support a list of standbys that must ack or just one. > Frankly, if Simon hadn't already submitted code, I'd be pushing for > single-standby-only for 9.1, instead of "any one". Yes, we are awfully late, but let's not panic. BTW, there's a bunch of replication related stuff that we should work to close, that are IMHO more important than synchronous replication. Like making the standby follow timeline changes, to make failovers smoother, and the facility to stream a base-backup over the wire. I wish someone worked on those... >> Hmm, access control... We haven't yet discussed what privileges a >> standby needs to become synchronous. Perhaps it needs to be a separate >> privilege that can be granted, in addition to the replication privilege? > > No, I don't think so. An additional priv would just complicate life for > DBAs without providing any real benefit. You'd be guarding against the > very narrow hypothetical case where there's a server admin with limited > privs on the master, and authorization to create async standbies, but > not the authorization to create s synch standby. How likely is that to > *ever* happen? Very likely. A synchronous standby can bring the master to a halt, while an asynchronous one is rather harmless. If I were a DBA, and the data wasn't very sensitive, I would liberally hand out async privileges to my colleagues to set up reporting standbys, test servers etc. But I would *not* give them synchronous privileges, because sooner or later one would go "hmm, I wonder what happens if I make this synchronous", or haphazardly copy the config file from a synchronous standby. That would either bring down the master, or act as a "fake" standby, acknowledging commits before they're flushed to the real synchronous standby. Either one would be bad. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 01/02/2011 09:35 AM, Heikki Linnakangas wrote: > On 02.01.2011 00:40, Josh Berkus wrote: >> On 1/1/11 5:59 AM, Stefan Kaltenbrunner wrote: >>> well you keep saying that but to be honest I cannot really even see a >>> usecase for me - what is "only a random one of a set of servers is sync >>> at any time and I don't really know which one". >>> My usecases would al involved 2 sync standbys and 1 or more async ones. >>> but the second sync one would be in a different datacenter and I NEED to >>> protect against a datacenter failure which your proposals says I cannot >>> do :( >> >> As far as I know, *nobody* has written the bookkeeping code to actually >> track which standbys have ack'd. We need to get single-ack synch >> standby merged, tested and working before we add anything as complicated >> as "each standby on this list must ack". That means that it's extremely >> unlikely for 9.1 at this point. > > The bookkeeping will presumably consist of an XLogRecPtr in shared > memory for each standby, tracking how far the standby has acknowledged. > At commit, you scan the standby slots in shared memory and check that > the required standbys have acknowledged your commit record. The > bookkeeping required is the same whether or not we support a list of > standbys that must ack or just one. > >> Frankly, if Simon hadn't already submitted code, I'd be pushing for >> single-standby-only for 9.1, instead of "any one". > > Yes, we are awfully late, but let's not panic. > > BTW, there's a bunch of replication related stuff that we should work to > close, that are IMHO more important than synchronous replication. Like > making the standby follow timeline changes, to make failovers smoother, > and the facility to stream a base-backup over the wire. I wish someone > worked on those... yeah I agree that those two are much more of a problem for the general user base. Whatever people think about our current system - it is very easy to configure(in terms of knobs to toggle) but extremely hard to get set up and dealt with during failovers(and I know nobody who got it right the first few times or has not fucked up one thing in the process). Syncrep is importantant but I would argue that getting those two fixed is even more so ;) Stefan
On Sun, 2011-01-02 at 10:35 +0200, Heikki Linnakangas wrote: > BTW, there's a bunch of replication related stuff that we should work > to close, that are IMHO more important than synchronous replication. > Like making the standby follow timeline changes, to make failovers > smoother, and the facility to stream a base-backup over the wire. I > wish someone worked on those... Hopefully, you'll be allowed to work on those, if they are more important? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-01-01 at 22:11 -0500, Aidan Van Dyk wrote: > On Sat, Jan 1, 2011 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Sat, 2011-01-01 at 14:40 -0800, Josh Berkus wrote: > > > >> Standby in general deals with the A,D,R triangle (Availability, > >> Durability, Response time). "Any one" configuration is the A,R > >> configuration, and the only reason to go out with it for 9.1 is > >> because it's simpler to implement than the D,R configuration (all > >> standbys must ack). > > > > Nicely put. Not the "only reason" though... > > > > As I showed earlier, the AR gives you 99.999% availability and the DR > > gives you 94% availability, considering a 3 server config. If you add > > more servers, the availability of the DR option gets much worse, very > > quickly. > > > > The performance of AR is much better also, and stays same or better as > > cluster size increases. DR choice makes performance degrade as cluster > > size increases, since it works at the speed of the slowest node. > > I'm all for getting first-past-post in for 9.1. Otherwise I fear > we'll get nothing. > > Stephen and I will only be able to use 1 sync slave, the "DR-site" > one. No, the AR and DR options are identical with just one sync standby. You've been requesting the DR option with 2 standbys, which is what gives you 94% availability. > That's fine. I can live with it, and make my local slave be > async. Or replicate the FS/block under WAL. I can monitor the **** > out of it, and unless it goes "down", it should easily be able to keep > up with the remote sync one beind a slower WAN link. > > And I think both Stephen and I understand your availability math. > We're not arguing that the 1st past post both gives better query > availabiliyt, and cluster scale performance. > > But when the primary datacenter servers are dust in the crater (or > boats in the flood, or ash in the fire), I either keep my job, or I > don't. And that depends on whether there is a chance I (my database > system) confirmed a transaction that I can't recover. I'm not impressed. You neglect to mention that Oracle and MySQL would put you in exactly the same position. You also neglect to say that if the local standby goes down, you were advocating a design that would take the whole application down. If you actually did what you have been suggesting, and the cluster went down as it inevitably would do, once your colleagues realise that you knowingly configured the cluster to have only 94% availability, you won't have a job anymore, you'll be escorted off the premises while shouting "but while it was down, it lost no data". When that never happens, thank me. There are people that need more durability than availability, but not many. If the database handles high value transactions, they very probably want it to keep on processing high value transactions. You'll have the choice of how to configure it, because of me listening to other people's views and selecting only the ideas that make sense. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, 2011-01-02 at 10:35 +0200, Heikki Linnakangas wrote: > > > Frankly, if Simon hadn't already submitted code, I'd be pushing for > > single-standby-only for 9.1, instead of "any one". > > Yes, we are awfully late, but let's not panic. Yes, we're about a year late. Getting a simple feature like this into the code could have been done in 9.0. We must stop returning to overcomplex features, especially if they aren't backed up with solid analysis of things like server availability and query visibility. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 2.1.2011 5:36, Robert Haas wrote: > On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >> Yes, working out the math is a good idea. Things are much clearer if we >> do that. >> >> Let's assume we have 98% availability on any single server. >> >> 1. Having one primary and 2 standbys, either of which can acknowledge, >> and we never lock up if both standbys fail, then we will have 99.9992% >> server availability. (So PostgreSQL hits "5 Nines", with data >> guarantees). ("Maximised availability") > I don't agree with this math. If the master and one standby fail > simultaneously, the other standby is useless, because it may or may > not be caught up with the master. You know that the last transaction > acknowledged as committed by the master is on at least one of the two > standbys, but you don't know which one, and so you can't safely > promote the surviving standby. > (If you are working in an environment where promoting the surviving > standby when it's possibly not caught up is OK, then you don't need > sync rep in the first place: you can just run async rep and get much > better performance.) > So the availability is 98% (you are up when the master is up) + 98%^2 > * 2% (you are up when both slaves are up and the master is down) = > 99.92%. If you had only a single standby, then you could be certain > that any commit acknowledged by the master was on that standby. Thus > your availability would be 98% (up when master is up) + 98% * 2% (you > are up when the master is down and the slave is up) = 99.96%. > OTOH, in the case where you need _all_ the slaves to confirm any failing slave brings the master down, so adding a slave brings down availability by extra 2% The solution to achieving good durability AND availability is requiring N past the post instead of 1 past the post. In this case you can get to 99.9992% availability with master + 3 sync slaves, 2 of which have ACK. --------------------------------------- Hannu Krosing Performance and Infinite Scalability Consultant http://www.2ndQuadrant.com/books/
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > BTW, there's a bunch of replication related stuff that we should work to > close, that are IMHO more important than synchronous replication. Like > making the standby follow timeline changes, to make failovers smoother, and > the facility to stream a base-backup over the wire. I wish someone worked on > those... So, we've been talking about base backup streaming at conferences and we have a working prototype. We even have a needed piece of it in core now, that's the pg_read_binary_file() function. What we still miss is an overall design and some integration effort. Let's design first. I propose the following new pg_ctl command to initiate the cloning: pg_ctl clone [-D datadir] [-s on|off] [-t filename] "primary_conninfo" As far as user are concerned, that would be the only novelty. Once that command is finished (successfully) they would edit postgresql.conf and start the service as usual. A basic recovery.conf file is created with the given options, standby_mode is driven by -s and defaults to off, and trigger_file defaults to being omitted and is given by -t. Of course the primary_conninfo given on the command line is what ends up into the recovery.conf file. That alone would allow for making base backups for recovery purposes and for standby preparing. To support for this new tool, the simplest would be to just copy what I've been doing in the prototype, that is run a query to get the primary file listing (per tablespace, not done in the prototype) then get their bytea content over the wire. That means there's no further backend support code to write. https://github.com/dimitri/pg_basebackup We could prefer to have a backend function prepare a tar archive and stream it using the COPY protocol, with some compression support, but that's more complex to code now and to parallelize down the road. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sat, 2011-01-01 at 23:36 -0500, Robert Haas wrote: > On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Yes, working out the math is a good idea. Things are much clearer if we > > do that. > > > > Let's assume we have 98% availability on any single server. > > > > 1. Having one primary and 2 standbys, either of which can acknowledge, > > and we never lock up if both standbys fail, then we will have 99.9992% > > server availability. (So PostgreSQL hits "5 Nines", with data > > guarantees). ("Maximised availability") > > I don't agree with this math. ...(snip by Simon)... 99.96%. OK, so that is at least 99.96%. Cool. The key point here is not (1), but option (4). The approach advocated by Heikki and yourself gives us 94% availability. IMHO that is ridiculous, and I will not accept that as the *only* way forwards, for that reason, whoever advocates it or for how long they keep arguing. I do accept that some wish that as an option. If we are to have a sensible technical debate with an eventual end, you must answer the points placed in front of you, not just sidestep and try to point out problems somewhere else. All analysis must be applied to all options, not just those options advocated by someone else. I've been asking for a failure mode analysis for months and it never comes in full. I'm more than happy to discuss your additional points once we are clear on the 94% because it is pivotal to everything I've been proposing. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs wrote: > On Sat, 2011-01-01 at 23:36 -0500, Robert Haas wrote: >> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggswrote: >>> Yes, working out the math is a good idea. Things are much clearer >>> if we do that. >>> >>> Let's assume we have 98% availability on any single server. >>> >>> 1. Having one primary and 2 standbys, either of which can >>> acknowledge, and we never lock up if both standbys fail, then we >>> will have 99.9992% server availability. (So PostgreSQL hits "5 >>> Nines", with data guarantees). ("Maximised availability") >> >> I don't agree with this math. ...(snip by Simon)... 99.96%. > > OK, so that is at least 99.96%. Cool. I think you're talking about different metrics, and you're both right. With two servers configured in sync rep your chance of having an available (running) server is 99.9992%. The chance that you know that you have one that is totally up to date, with no lost transactions is 99.9208%. The chance that you *actually* have up-to-date data would be higher, but you'd have no way to be sure. The 99.96% number is your certainty that you have a running server with up-to-date data if only one machine is sync rep. It's a matter of whether your shop needs five nines of availability or the highest probability of not losing data. You get to choose. -Kevin
On Sun, 2011-01-02 at 08:08 -0600, Kevin Grittner wrote: > I think you're talking about different metrics, and you're both > right. With two servers configured in sync rep your chance of having > an available (running) server is 99.9992%. The chance that you know > that you have one that is totally up to date, with no lost > transactions is 99.9208%. The chance that you *actually* have > up-to-date data would be higher, but you'd have no way to be sure. > The 99.96% number is your certainty that you have a running server > with up-to-date data if only one machine is sync rep. > > It's a matter of whether your shop needs five nines of availability > or the highest probability of not losing data. You get to choose. Thanks for those calculations. Do you agree that requiring response from 2 sync standbys, or locking up, gives us 94% server availability, but 99.9992% data durability? And that adding additional async servers would not increase the server availability of that cluster? Now lets look at what happens when we first start a standby: we do the base backup, configure the standby, it connects and then <wham> we cannot process any new transactions until the standby has caught up, which could well be hours on a big database. So if we don't have a processing mode that allows work to continue, how will we ever enable synchronous replication on a 24/7 database? How will we ever allow standbys to catch up if they drop out for a while? We should factor that into the availability calcs as well. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 02.01.2011 14:47, Dimitri Fontaine wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> BTW, there's a bunch of replication related stuff that we should work to >> close, that are IMHO more important than synchronous replication. Like >> making the standby follow timeline changes, to make failovers smoother, and >> the facility to stream a base-backup over the wire. I wish someone worked on >> those... > > So, we've been talking about base backup streaming at conferences and we > have a working prototype. We even have a needed piece of it in core > now, that's the pg_read_binary_file() function. What we still miss is > an overall design and some integration effort. Let's design first. We even have a rudimentary patch to add the required backend support: http://archives.postgresql.org/message-id/4C80D9B8.2020301@enterprisedb.com That just needs to be polished into shape, and documentation. > I propose the following new pg_ctl command to initiate the cloning: > > pg_ctl clone [-D datadir] [-s on|off] [-t filename] "primary_conninfo" > > As far as user are concerned, that would be the only novelty. Once that > command is finished (successfully) they would edit postgresql.conf and > start the service as usual. A basic recovery.conf file is created with > the given options, standby_mode is driven by -s and defaults to off, and > trigger_file defaults to being omitted and is given by -t. Of course > the primary_conninfo given on the command line is what ends up into the > recovery.conf file. > > That alone would allow for making base backups for recovery purposes and > for standby preparing. +1. Or maybe it would be better make it a separate binary, rather than part of pg_ctl. > To support for this new tool, the simplest would be to just copy what > I've been doing in the prototype, that is run a query to get the primary > file listing (per tablespace, not done in the prototype) then get their > bytea content over the wire. That means there's no further backend > support code to write. It would be so much nicer to have something more integrated, like the patch I linked above. Running queries requires connecting to a real database, which means that the user needs to have privileges to do that and you need to know the name of a valid database. Ideally this would all work through a replication connection. I think we should go with that from day one. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 02.01.2011 15:41, Simon Riggs wrote: > On Sat, 2011-01-01 at 23:36 -0500, Robert Haas wrote: >> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >>> Yes, working out the math is a good idea. Things are much clearer if we >>> do that. >>> >>> Let's assume we have 98% availability on any single server. >>> >>> 1. Having one primary and 2 standbys, either of which can acknowledge, >>> and we never lock up if both standbys fail, then we will have 99.9992% >>> server availability. (So PostgreSQL hits "5 Nines", with data >>> guarantees). ("Maximised availability") >> >> I don't agree with this math. ...(snip by Simon)... 99.96%. > > OK, so that is at least 99.96%. Cool. > > The key point here is not (1), but option (4). > > The approach advocated by Heikki and yourself gives us 94% availability. > IMHO that is ridiculous, and I will not accept that as the *only* way > forwards, for that reason, whoever advocates it or for how long they > keep arguing. I do accept that some wish that as an option. No-one is suggesting that to be the only option. The "wait-for-all-to-ack" looks a lot less ridiculous if you also configure a timeout and don't wait for disconnected standbys. I'm not sure what the point of such a timeout in general is, but people have requested that. Also, setting synchronous_standbys="room1, room2" doesn't necessarily mean that you have just two standby servers, room1 and room2 might both represent a group of servers. I believe we all agree that there's different use cases that require different setups. Both "first-past-the-post" and "wait-for-all-to-ack" have their uses. There's no point in arguing over which is better. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > Do you agree that requiring response from 2 sync standbys, or > locking up, gives us 94% server availability, but 99.9992% data > durability? I'm not sure how to answer that. The calculations so far have been based around up-time and the probabilities that you have a machine up at any moment and whether you can have confidence that if you do, you have all committed transactions represented. There's been an implied assumption that the down time is unplanned, but not much else. The above question seems to me to get into too many implied assumptions to feel safe throwing out a number without pinning those down a whole lot better. If, for example, that 2% downtime always means the machine irretrievably went up in smoke, hitting unavailable means things are unrecoverable. That's probably not the best assumption (at least outside of a combat zone), but what is? -Kevin
On Sun, 2011-01-02 at 11:11 -0600, Kevin Grittner wrote: > Simon Riggs wrote: > > > Do you agree that requiring response from 2 sync standbys, or > > locking up, gives us 94% server availability, but 99.9992% data > > durability? > > I'm not sure how to answer that. The calculations so far have been > based around up-time and the probabilities that you have a machine up > at any moment and whether you can have confidence that if you do, you > have all committed transactions represented. There's been an implied > assumption that the down time is unplanned, but not much else. The > above question seems to me to get into too many implied assumptions > to feel safe throwing out a number without pinning those down a whole > lot better. If, for example, that 2% downtime always means the > machine irretrievably went up in smoke, hitting unavailable means > things are unrecoverable. That's probably not the best assumption > (at least outside of a combat zone), but what is? Not really relevant. There's no room at all for downtime of any kind in a situation where all servers must be available. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > http://archives.postgresql.org/message-id/4C80D9B8.2020301@enterprisedb.com > > That just needs to be polished into shape, and documentation. Wow, cool! I don't know how but I've missed it. > +1. Or maybe it would be better make it a separate binary, rather than part > of pg_ctl. Well the thinking was that nowadays we support initdb from pg_ctl, and this is another kind of initdb, really. > I linked above. Running queries requires connecting to a real database, > which means that the user needs to have privileges to do that and you need > to know the name of a valid database. Ideally this would all work through a > replication connection. I think we should go with that from day one. I didn't think about the "connecting to a real database" part of it, versus using a dedicated REPLICATION connection/protocol, and to be honest, I feared it was too much work. Seeing that you already did it, though, +1. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Dec 30, 2010 at 9:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > <reads MySQL documentation> > > I see now that you've tried to design this feature in a way that is > similar to MySQL's offering, which does have some value. But it > appears to me that the documentation you've written here is > substantially similar to the MySQL 5.5 reference documentation. That > could get us into a world of legal trouble - that documentation is not > even open source, let alone BSD. > > http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html The docs originate from work done by my former team at Google. The content license on this is CC 3.0 BY-SA, so I don't think that should be a concern. http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplication http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign From http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html) the MySQL docs don't mention that other transactions can view the committed data on the master between steps 1 and 2. Is that possible in this case? As described in the the MySQL docs, semi-sync has another benefit for some deployments. It rate limits busy clients to prevent them from creating replication lag between the primary and standby servers. I also provided the text for that (http://bugs.mysql.com/bug.php?id=57911) if you are concerned about copying. -- Mark Callaghan mdcallag@gmail.com
On Sat, Jan 1, 2011 at 8:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2011-01-01 at 05:13 -0800, Jeff Janes wrote: >> On 12/31/10, Simon Riggs <simon@2ndquadrant.com> wrote: > >> > 2. "sync" does not guarantee that the updates to the standbys are in any >> > way coordinated. You can run a query on one standby and get one answer >> > and at the exact same time run the same query on another standby and get >> > a different answer (slightly ahead/behind). That also means that if the >> > master crashes one of the servers will be ahead or behind. You can use >> > pg_last_xlog_receive_location() to check which one that is. >> >> If at least one of the standbys is in the same smoking crater as the >> primary, then pg_last_xlog_receive_location on it is unlikely to >> respond. >> >> The guarantee goes away precisely when it is needed. > > Fairly obviously, I would not be advocating anything that forced you to > use a server in the "same smoking crater". You are forced to use the standby which is further ahead, otherwise you might lose transactions which have been reported to have been committed. The mere existence of a commit-releasing stand-by in the same data center as the primary means that a remote standby is not very useful for data preservation after campus-wide disasters. It is probably behind (due to higher latency) and even if it is not behind, there is no way to *know* that is not behind if the on-site standby cannot be contacted. I understand that you are not advocating the use of one local standby and one remote standby, both synchronous. But I think we need to *explicitly* warn against it. After all, the docs do explicitly recommend the use of two standbys. If we assume that the readers are already experts, then they don't need that advice. If they are not experts, then that advice could lead them to shoot themselves in the foot, both kneecaps, and a femur (metaphorically speaking, unlike the smoking crater, which is a literal scenario some people need to plan for). If durability is more important than availability, what would you recommend? Only one synchronous rep, in a remote data center? Two (or more) synchronous reps all in the same remote data center? In two different remote data centers? > I can't see any guarantee > that goes away precisely when it is needed. In order to know that you are not losing data, you have to be able to contact every single semi-synchronous standby and invoke pg_last_xlog_receive_location on it. If your goal is to have data durability protected from major catastrophes (and why else would you do synchronous rep to remote data centers?), then it is expecting a lot to have every single standby survive that major catastrophe. That expectation is an unavoidable consequence of going with single-confirmation-releases. Perhaps you think this consequence is too obvious to document--if so I disagree on that. > Perhaps you could explain the issue you see, because your comments seem > unrelated to my point above. It is directly related to the part of your point about using pg_last_xlog_receive_location. When planning for disaster recovery, it is little comfort that you can do something in a non-disaster case, if you can't also do it in likely disaster cases. It probably wasn't relevant to the first part of your point, but I must admit I did not understand the first part of your point. Obviously they are coordinated in *some* way (I believe commits occur in the same order on each server, for example). Different read-only standbys could give different results, but only from among the universe of results made possible by a given commit sequence. But that is not the part I had intended to comment on, and I don't think it is what other people concerned about durability after major catastrophes were focusing on, either. Cheers, Jeff
On Sun, 2011-01-02 at 12:13 -0800, MARK CALLAGHAN wrote: > On Thu, Dec 30, 2010 at 9:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > <reads MySQL documentation> > > > > I see now that you've tried to design this feature in a way that is > > similar to MySQL's offering, which does have some value. But it > > appears to me that the documentation you've written here is > > substantially similar to the MySQL 5.5 reference documentation. That > > could get us into a world of legal trouble - that documentation is not > > even open source, let alone BSD. > > > > http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html Sorry, only just read that bit. Are they that similar? My docs are about 3 times longer and cover all sorts of things. I didn't intentionally copy anything, but that doesn't really matter, what matters is that if you think they are similar, legal people might. I've only read the URL above, not the other links from it. Robert, Can you identify which paragraphs need to be re-written? I won't argue, I will just rewrite them or delete them and start afresh. Thanks for being eagle-eyed. > The docs originate from work done by my former team at Google. The > content license on this is CC 3.0 BY-SA, so I don't think that should > be a concern. > http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplication > http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign I guess that gets us off the hook a little bit, but not far enough for my liking. Thanks for trying to save me! > >From http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html) > the MySQL docs don't mention that other transactions can view the > committed data on the master between steps 1 and 2. Is that possible > in this case? Other transactions on the master cannot read data until after the confirmation it is on the sync standby. > As described in the the MySQL docs, semi-sync has another benefit for > some deployments. It rate limits busy clients to prevent them from > creating replication lag between the primary and standby servers. I > also provided the text for that > (http://bugs.mysql.com/bug.php?id=57911) if you are concerned about > copying. Yeh, I'm aware of the effect, but I'm not really seeing slowing down the master as a benefit, its more an implication of synchronicity. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, Jan 2, 2011 at 18:53, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> http://archives.postgresql.org/message-id/4C80D9B8.2020301@enterprisedb.com >> >> That just needs to be polished into shape, and documentation. I have an updated version of this somewhere.IIRC it also needs things like tablespace support, ubt it's well on it's way. > Wow, cool! I don't know how but I've missed it. Yes, especially since we discussed it in Stuttgart. I guess it may have been during the party... >> +1. Or maybe it would be better make it a separate binary, rather than part >> of pg_ctl. > > Well the thinking was that nowadays we support initdb from pg_ctl, and > this is another kind of initdb, really. Yes, if it should go in any of the current binaries, initdb would be the reasonable place, not pg_ctl ;) That said, if we're going to wrap pg_streamrecv into 9.1 (which I think we should), then *that* is where it should go. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Sun, 2011-01-02 at 18:54 +0200, Heikki Linnakangas wrote: > I believe we all agree that there's different use cases that require > different setups. Both "first-past-the-post" and "wait-for-all-to-ack" > have their uses. Robert's analysis is that "first-past-the-post" doesn't actually improve the durability guarantee (according to his calcs). Which means that 1 primary, 2 sync standbys with first-past-the-post is actually worse than 1 primary, 1 sync and 1 async standby in terms of its durability guarantees. So ISTM that Robert does not agree that both have their uses. > I'm not > sure what the point of such a timeout in general is, but people have > requested that. Again, this sounds like you think a timeout has no measurable benefit, other than to please some people's perceived needs. > The "wait-for-all-to-ack" looks a lot less ridiculous if you also > configure a timeout and don't wait for disconnected standbys Does it? Do Robert, Stefan and Aidan agree? What are the availability and durability percentages if we do that? Based on those, we may decide to do that instead. But I'd like to see some analysis of your ideas, not just a "we could". Since nobody has commented on my analysis, lets see someone else's. > There's no point in arguing over which is better. I'm trying to compare quantifiable benefit of various options to see what goes into Postgres. I don't want to put anything in that we cannot all agree has a measurable benefit to someone (that has the appropriate preference). -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Magnus Hagander <magnus@hagander.net> writes: > Yes, especially since we discussed it in Stuttgart. I guess it may > have been during the party... I remember we talked about it, I didn't remember a patch had reached the list… > Yes, if it should go in any of the current binaries, initdb would be > the reasonable place, not pg_ctl ;) Well, pg_ctl is able to call initdb for users, but yes. > That said, if we're going to wrap pg_streamrecv into 9.1 (which I > think we should), then *that* is where it should go. They should at least cooperate so that you don't need to setup WAL archiving explicitly while preparing the standby, if at least possible. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 2.1.2011 5:36, Robert Haas wrote: > On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >> Yes, working out the math is a good idea. Things are much clearer if we >> do that. >> >> Let's assume we have 98% availability on any single server. >> >> 1. Having one primary and 2 standbys, either of which can acknowledge, >> and we never lock up if both standbys fail, then we will have 99.9992% >> server availability. (So PostgreSQL hits "5 Nines", with data >> guarantees). ("Maximised availability") > I don't agree with this math. If the master and one standby fail > simultaneously, the other standby is useless, because it may or may > not be caught up with the master. You know that the last transaction > acknowledged as committed by the master is on at least one of the two > standbys, but you don't know which one, and so you can't safely > promote the surviving standby. > (If you are working in an environment where promoting the surviving > standby when it's possibly not caught up is OK, then you don't need > sync rep in the first place: you can just run async rep and get much > better performance.) > So the availability is 98% (you are up when the master is up) + 98%^2 > * 2% (you are up when both slaves are up and the master is down) = > 99.92%. If you had only a single standby, then you could be certain > that any commit acknowledged by the master was on that standby. Thus > your availability would be 98% (up when master is up) + 98% * 2% (you > are up when the master is down and the slave is up) = 99.96%. > OTOH, in the case where you need _all_ the slaves to confirm any failing slave brings the master down, so adding a slave brings down availability by extra 2% The solution to achieving good durability AND availability is requiring N past the post instead of 1 past the post. In this case you can get to 99.9992% availability with master + 3 sync slaves, 2 of which have ACK. --------------------------------------- Hannu Krosing Performance and Infinite Scalability Consultant http://www.2ndQuadrant.com/books/
On Sun, 2011-01-02 at 20:53 +0000, Simon Riggs wrote: > On Sun, 2011-01-02 at 12:13 -0800, MARK CALLAGHAN wrote: > > On Thu, Dec 30, 2010 at 9:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > <reads MySQL documentation> > > > > > > I see now that you've tried to design this feature in a way that is > > > similar to MySQL's offering, which does have some value. But it > > > appears to me that the documentation you've written here is > > > substantially similar to the MySQL 5.5 reference documentation. That > > > could get us into a world of legal trouble - that documentation is not > > > even open source, let alone BSD. > > > > > > http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html > > Sorry, only just read that bit. Are they that similar? My docs are about > 3 times longer and cover all sorts of things. I didn't intentionally > copy anything, but that doesn't really matter, what matters is that if > you think they are similar, legal people might. I've only read the URL > above, not the other links from it. > > Robert, Can you identify which paragraphs need to be re-written? I won't > argue, I will just rewrite them or delete them and start afresh. Thanks > for being eagle-eyed. Been through this to check. The first paragraph was pretty similar, so I've reworded that a little. Trying to make a readable paragraph that introduces the need for sync rep, based upon the deficiencies of async rep is going to be very similar, whatever we do. I can't see any other way to introduce a feature other than to explain the potential problem and then explain how the new feature resolves that. Parameters are set on both primary and standby. That design existed long before I looked at the MySQL manual, and yes there is similarity, which led to similar descriptions of how that works. Bear in mind that the two designs are very different. I can't see any other similarities. If anybody else can, please shout. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
All, This is a pointless argument. Eventually, we will be implementing all possible sync rep configurations, because different users *need* different configurations.Some users care more about durability, some moreabout availability, and some more about response time. And you can't have all three, which was my point about A,D,R (also the point of CAP). For that matter, any single configuration will be useful to a large number of users, and an even larger number will be able to "work around" while they wait for 9.2. Further, the knowledge we gain by having some kind of synch rep in the field will allow us to implement the different configurations correctly, which *no* amount of arguing on e-mail will. The perfect is the enemy of the good. The relevant question is: which configuration(s) can we have ready for the next CommitFest and alpha release? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 1/2/11 12:35 AM, Heikki Linnakangas wrote: > Very likely. A synchronous standby can bring the master to a halt, while > an asynchronous one is rather harmless. If I were a DBA, and the data > wasn't very sensitive, I would liberally hand out async privileges to my > colleagues to set up reporting standbys, test servers etc. But I would > *not* give them synchronous privileges, because sooner or later one > would go "hmm, I wonder what happens if I make this synchronous", or > haphazardly copy the config file from a synchronous standby. That would > either bring down the master, or act as a "fake" standby, acknowledging > commits before they're flushed to the real synchronous standby. Either > one would be bad. That's not "very likely". Shops with enough DBAs to actually have a heirarchy of database access are rare indeed; the average business has trouble employing *one* fulltime DBA. Also, you're mistaken if you think that DBA's wouldn't restrict the ability of development users to set up asynch connections; those have a significant impact on the production server as well. However, the possibility of *accidental* DOS is a bit more likely, per accidentally using the wrong config file. I don't think you can guard against that with permissions, though; if a DBA is copying the config file from a synchronous server accidentally, presumably he's copying the .pgpass file as well, and then the new server is identical to the legit synch rep servers as far as the master is concerned. So, again, I don't agree that a separate synchrep permission is useful, or warranted. However, your arguments *do* make me backpedal on the issue of having a list of synch rep roles on the master. I can easily imagine a DBA needing to rapidly disable synch rep if replication is failing, without taking the time to log in to several separate standbys, one or more of which might be improperly configured and inaccessible. I can't think of a simpler way to do that than having a synch_rep_roles configuration on the master. That would also handle control issues for the senior DBA, since you'd need superuser access to the master to modify it. --Josh Berkus
On Tue, 2011-01-04 at 10:28 -0800, Josh Berkus wrote: > The relevant question is: which configuration(s) can we have ready for > the next CommitFest and alpha release? Based upon that series of conversations, I've reworked the design so that there is (currently) only a single standby offering sync rep at any one time. Other standbys can request to be sync standbys but they only become the sync standby if the first one fails. Which was simple to do and bridges the challenges of an exactly identified sync standby and the fragility of too closely specifying the config. I think you're right that trying to please everyone is not going to be possible in this release, because of the explosion of parameter combinations that require testing and because of the explosion in my head that causes. I'm not feeling well now, so I'm going to go to bed, not just to avoid snapping at people. Even given that short interlude, I see no problem about delivery. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, Jan 2, 2011 at 4:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, 2011-01-02 at 18:54 +0200, Heikki Linnakangas wrote: > >> I believe we all agree that there's different use cases that require >> different setups. Both "first-past-the-post" and "wait-for-all-to-ack" >> have their uses. > > Robert's analysis is that "first-past-the-post" doesn't actually improve > the durability guarantee (according to his calcs). Which means that > 1 primary, 2 sync standbys with first-past-the-post > is actually worse than > 1 primary, 1 sync and 1 async standby > in terms of its durability guarantees. > > So ISTM that Robert does not agree that both have their uses. I think it depends on what failure modes you want to protect against. If you have a primary in New York, a secondary in Los Angeles, and another secondary in London, you might decide that the chances of two standbys being taken out by the same event are negligible, or alternatively that if one event does take out both of them, it'll be something like a meteor where you'll have bigger things to worry about than lost transactions. In that case, requiring one ACK but not two is pretty sensible. If the primary goes down, you'll look at the two remaining machines (which, by presumption, will still be up) and promote whichever one is ahead. In this setup, you get a performance benefit from waiting for either ACK rather than both ACKs, and you haven't compromised any of the cases you care about. However, if you have the traditional close/far setup, things are different. Suppose you have a primary and a secondary in New York and another secondary in Los Angeles. Now it has to be viewed as a reasonable possibility that you could lose the New York site. If that happens, you need to be able to promote the LA standby *without reference to the NY standby*. So you really can't afford to do the 1-of-2 thing, because then when NY goes away you're not sure whether the LA standby is safe to promote. So, IMHO, it just depends on what you want to do. >> I'm not >> sure what the point of such a timeout in general is, but people have >> requested that. > > Again, this sounds like you think a timeout has no measurable benefit, > other than to please some people's perceived needs. > >> The "wait-for-all-to-ack" looks a lot less ridiculous if you also >> configure a timeout and don't wait for disconnected standbys > > Does it? Do Robert, Stefan and Aidan agree? What are the availability > and durability percentages if we do that? Based on those, we may decide > to do that instead. But I'd like to see some analysis of your ideas, not > just a "we could". Since nobody has commented on my analysis, lets see > someone else's. Here's my take on this point. I think there is a use case for waiting for a disconnected standby and a use case for not waiting for a disconnected standby. The danger of NOT waiting for a disconnected standby is that if you then go on to irretrievably lose the primary, you lose transactions. But on the other hand, if you do wait, you've made the primary unavailable. I don't know that there's one right answer here. For some people, if they can't be certain of recording the transaction in two places, then it may be better to not process any transactions at all. For other people, it may be better to process transactions unprotected for a while while you get a new standby up. It's not for us to make that judgment; we're here to provide options. Having said that, I am OK with whichever one we want to implement first so long as we keep the door open to doing the other one later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I'm not feeling well now, so I'm going to go to bed, not just to avoid > snapping at people. Even given that short interlude, I see no problem > about delivery. Cool! Thanks Simon. Feel better. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On 01/04/2011 07:51 PM, Simon Riggs wrote: > On Tue, 2011-01-04 at 10:28 -0800, Josh Berkus wrote: > >> The relevant question is: which configuration(s) can we have ready for >> the next CommitFest and alpha release? > > Based upon that series of conversations, I've reworked the design so > that there is (currently) only a single standby offering sync rep at any > one time. Other standbys can request to be sync standbys but they only > become the sync standby if the first one fails. Which was simple to do > and bridges the challenges of an exactly identified sync standby and the > fragility of too closely specifying the config. ah cool - like that approach for 9.1! Thanks for working on that! Stefan
On Tue, Jan 4, 2011 at 2:50 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > On 01/04/2011 07:51 PM, Simon Riggs wrote: >> >> On Tue, 2011-01-04 at 10:28 -0800, Josh Berkus wrote: >> >>> The relevant question is: which configuration(s) can we have ready for >>> the next CommitFest and alpha release? >> >> Based upon that series of conversations, I've reworked the design so >> that there is (currently) only a single standby offering sync rep at any >> one time. Other standbys can request to be sync standbys but they only >> become the sync standby if the first one fails. Which was simple to do >> and bridges the challenges of an exactly identified sync standby and the >> fragility of too closely specifying the config. > > ah cool - like that approach for 9.1! Yeah, I like that idea too, on first blush. I think we should think it over and see whether we're committing ourselves to any design decisions we may later regret - what parameters will we need to add from that point to get all the configs we ultimately want to support? But it seems a reasonable starting point, and we can argue about the rest once we have working code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Josh Berkus <josh@postgresql.org> writes: > So, again, I don't agree that a separate synchrep permission is useful, > or warranted. +1 > However, your arguments *do* make me backpedal on the issue of having a > list of synch rep roles on the master. I can easily imagine a DBA > needing to rapidly disable synch rep if replication is failing, without > taking the time to log in to several separate standbys, one or more of > which might be improperly configured and inaccessible. I can't think of > a simpler way to do that than having a synch_rep_roles configuration on > the master. That would also handle control issues for the senior DBA, > since you'd need superuser access to the master to modify it. What about the HBA here? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
> What about the HBA here? Hmmm. That's tempting; an "synchronous" HBA instead of a GUC? But that doesn't solve the problem of "standby #6 is failing, I want to kick it off synch rep". I'd be opposed to having a GUC *and* an HBA. making DBAs set things independantly in two places just frustrates our users. --Josh
Josh Berkus <josh@postgresql.org> writes: >> What about the HBA here? > > Hmmm. That's tempting; an "synchronous" HBA instead of a GUC? But that > doesn't solve the problem of "standby #6 is failing, I want to kick it > off synch rep". > > I'd be opposed to having a GUC *and* an HBA. making DBAs set things > independantly in two places just frustrates our users. I was just thinking that you could prepend a reject line at the right place in the file. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
> I was just thinking that you could prepend a reject line at the right > place in the file. Hmmm, that's worth thinking about. How do others feel about this? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Simon Riggs wrote: > Based upon that series of conversations, I've reworked the design so > that there is (currently) only a single standby offering sync rep at any > one time. Other standbys can request to be sync standbys but they only > become the sync standby if the first one fails. Which was simple to do > and bridges the challenges of an exactly identified sync standby and the > fragility of too closely specifying the config. > That seems like a good enough starting point to cover a lot of cases. Presuming the two servers each at two sites config that shows up in a lot of these discussions, people in the "I need sync to a remote spot" can get that, and if that site is unavailable for long enough to be kicked out they'll smoothly degrade to sync on a secondary local copy. And those who want high performance local sync and best-effort for the remote site can configure for that too, and if the local secondary dies they'll just degrade to the slow remote commits. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books