Thread: Sync Rep Design

Sync Rep Design

From
Simon Riggs
Date:
I've mulled over the design for sync rep for awhile now, and have come
up with a feature set that includes the final detailed feedback from
Fujii Masao, Aidan Van Dyk, Josh Berkus and others.

The design also draws from MySQL concepts to make the two interfaces as
similar and as simple as possible. It should be noted that the design
presented here has many features that the MySQL design does not.

I am currently finishing up my patch to offer these features, so its
time to begin final discussions.

As an interim step, I enclose a PDF version of relevant excerpts from
the doc patch. The patch will follow on a later post in the near future.

I would like to separate discussions on "user interface" from that of
internal design, to make it easier for more people to get involved.
Please read the following and post your comments. Thank you.

--
 Simon Riggs           http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services


Attachment

Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/30/2010 06:26 PM, Simon Riggs wrote:
>
> I've mulled over the design for sync rep for awhile now, and have come
> up with a feature set that includes the final detailed feedback from
> Fujii Masao, Aidan Van Dyk, Josh Berkus and others.
>
> The design also draws from MySQL concepts to make the two interfaces as
> similar and as simple as possible. It should be noted that the design
> presented here has many features that the MySQL design does not.
>
> I am currently finishing up my patch to offer these features, so its
> time to begin final discussions.
>
> As an interim step, I enclose a PDF version of relevant excerpts from
> the doc patch. The patch will follow on a later post in the near future.
>
> I would like to separate discussions on "user interface" from that of
> internal design, to make it easier for more people to get involved.
> Please read the following and post your comments. Thank you.

it would help if this would just be a simple text-only description of 
the design that people can actually comment on inline. I don't think 
sending technical design proposals as a pdf (which seems to be written 
in doc-style as well) is a good idea to encourage discussion on -hackers :(


Stefan


Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 18:42 +0100, Stefan Kaltenbrunner wrote:

> it would help if this would just be a simple text-only description of
> the design that people can actually comment on inline. I don't think
> sending technical design proposals as a pdf (which seems to be written
> in doc-style as well) is a good idea to encourage discussion on -hackers :(

25.2.6. Synchronous Replication
Streaming replication is by default asynchronous. Transactions on the
primary server write commit records to WAL, yet do not know whether or
when a standby has received and processed those changes. So with
asynchronous replication, if the primary crashes, transactions committed
on the primary might not have been received by any standby. As a result,
failover from primary to standby could cause data loss because
transaction completions are absent, relative to the primary. The amount
of data loss is proportional to the replication delay at the time of
failover.

Synchronous replication offers the ability to guarantee that all changes
made by a transaction have been transferred to at least one remote
standby server. This is an extension to the standard level of durability
offered by a transaction commit. This is referred to as semi-synchronous
replication.

When synchronous replication is requested, the commit of a write
transaction will wait until confirmation that the commit record has been
transferred successfully to at least one standby server. Waiting for
confirmation increases the user's confidence that the changes will not
be lost in the event of server crashes but it also necessarily increases
the response time for the requesting transaction. The minimum wait time
is the roundtrip time from primary to standby.

Read only transactions and transaction rollbacks need not wait for
replies from standby servers. Subtransaction commits do not wait for
responses from standby servers, only final top-level commits. Long
running actions such as data loading or index building do not wait until
the very final commit message.


25.2.6.1. Basic Configuration
Synchronous replication must be enabled on both the primary and at least
one standby server. If synchronous replication is disabled on the
master, or enabled on the primary but not enabled on any slaves, the
primary will use asynchronous replication by default.

We use a single parameter to enable synchronous replication, set in
postgresql.conf on both primary and standby servers:

synchronous_replication = off (default) | on

On the primary, synchronous_replication can be set for particular users
or databases, or dynamically by applications programs.

If more than one standby server specifies synchronous_replication, then
whichever standby replies first will release waiting commits.

Turning this setting off for a standby allows the administrator to
exclude certain standby servers from releasing waiting transactions.
This is useful if not all standby servers are designated as potential
future primary servers. On the standby, this parameter only takes effect
at server start.


25.2.6.2. Planning for Performance
Synchronous replication usually requires carefully planned and placed
standby servers to ensure applications perform acceptably. Waiting
doesn't utilise system resources, but transaction locks continue to be
held until the transfer is confirmed. As a result, incautious use of
synchronous replication will reduce performance for database
applications because of increased response times and higher contention.

PostgreSQL allows the application developer to specify the durability
level required via replication. This can be specified for the system
overall, though it can also be specified for specific users or
connections, or even individual transactions.

For example, an application workload might consist of: 10% of changes
are important customer details, while 90% of changes are less important
data that the business can more easily survive if it is lost, such as
chat messages between users.

With synchronous replication options specified at the application level
(on the master) we can offer sync rep for the most important changes,
without slowing down the bulk of the total workload. Application level
options are an important and practical tool for allowing the benefits of
synchronous replication for high performance applications. This feature
is unique to PostgreSQL.


25.2.6.3. Planning for High Availability
The easiest and safest method of gaining High Availability using
synchronous replication is to configure at least two standby servers. To
understand why, we need to examine what can happen when you lose all
standby servers.

Commits made when synchronous_replication is set will wait until at
least one standby responds. The response may never occur if the last, or
only, standby should crash or the network drops. What should we do in
that situation?

Sitting and waiting will typically cause operational problems because it
is an effective outage of the primary server. Allowing the primary
server to continue processing in the absence of a standby puts those
latest data changes at risk. How we handle this situation is controlled
by allow_standalone_primary. The default setting is on, allowing
processing to continue, though there is no recommended setting. Choosing
the best setting for allow_standalone_primary is a difficult decision
and best left to those with combined business responsibility for both
data and applications. The difficulty of this choice is the reason why
we recommend that you reduce the possibility of this situation occurring
by using multiple standby servers.

When the primary is started with allow_standalone_primary enabled, the
primary will not allow connections until a standby connects that also
has synchronous_replication enabled. This is a convenience to ensure
that we don't allow connections before write transactions will return
successfully.

When allow_standalone_primary is set, a user will stop waiting once the
replication_timeout has been reached for their specific session. Users
are not waiting for a specific standby to reply, they are waiting for a
reply from any standby, so the unavailability of any one standby is not
significant to a user. It is possible for user sessions to hit timeout
even though standbys are communicating normally. In that case, the
setting of replication_timeout is probably too low.

The standby sends regular status messages to the primary. If no status
messages have been received for replication_timeout the primary server
will assume the connection is dead and terminate it. This happens
whatever the setting of allow_standalone_primary.

If primary crashes while commits are waiting for acknowledgement, those
transactions will be marked fully committed if the primary database
recovers, no matter how allow_standalone_primary is set. There is no way
to be certain that all standbys have received all outstanding WAL data
at time of the crash of the primary. Some transactions may not show as
committed on the standby, even though they show as committed on the
primary. The guarantee we offer is that the application will not receive
explicit acknowledgement of the successful commit of a transaction until
the WAL data is known to be safely received by the standby. Hence this
mechanism is technically "semi synchronous" rather than "fully
synchronous" replication. Note that replication still not be fully
synchronous even if we wait for all standby servers, though this would
reduce availability, as described previously.

If you need to re-create a standby server while transactions are
waiting, make sure that the commands to run pg_start_backup() and
pg_stop_backup() are run in a session with synchronous_replication =
off, otherwise those requests will wait forever for the standby to
appear.



18.5.5. Synchronous Replication
These settings control the behavior of the built-in synchronous
replication feature. These parameters would be set on the primary server
that is to send replication data to one or more standby servers.


synchronous_replication (boolean)
        Specifies whether transaction commit will wait for WAL records
        to be replicated before the command returns a "success"
        indication to the client. The default setting is off. When on,
        there will be a delay while the client waits for confirmation of
        successful replication. That delay will increase depending upon
        the physical distance and network activity between primary and
        standby. The commit wait will last until the first reply from
        any standby. Multiple standby servers allow increased
        availability and possibly increase performance as well.
        The parameter must be set on both primary and standby.
        On the primary, this parameter can be changed at any time; the
        behavior for any one transaction is determined by the setting in
        effect when it commits. It is therefore possible, and useful, to
        have some transactions replicate synchronously and others
        asynchronously. For example, to make a single multistatement
        transaction commit asynchronously when the default is
        synchronous replication, issue SET LOCAL synchronous_replication
        TO OFF within the transaction.
        On the standby, the parameter value is taken only at server
        start.
synchronous_replication_timeout (boolean)
        If the client has synchronous_replication set, and
        allow_standalone_primary is also set, then the commit will wait
        for up to synchronous_replication_timeout milliseconds before it
        returns a "success", or will wait forever if
        synchronous_replication_timeout is set to -1.
        If a standby server does not reply for
        synchronous_replication_timeout the primary will terminate the
        replication connection.
allow_standalone_primary (boolean)
        If allow_standalone_primary is not set, then the server will not
        allow connections until a standby connects that has
        synchronous_replication enabled.
        allow_standalone_primary also affects the behaviour when the
        synchronous_replication_timeout is reached.




25.5.2. Handling query conflicts
….


Remedial possibilities exist if the number of standby-query
cancellations is found to be unacceptable. Typically the best option is
to enable hot_standby_feedback. This prevents VACUUM from removing
recently-dead rows and so cleanup conflicts do not occur. If you do
this, you should note that this will delay cleanup of dead rows on the
primary, which may result in undesirable table bloat. However, the
cleanup situation will be no worse than if the standby queries were
running directly on the primary server. You are still getting the
benefit of off-loading execution onto the standby and the query may
complete faster than it would have done on the primary server.
max_standby_archive_delay must be kept large in this case, because
delayed WAL files might already contain entries that conflict with the
desired standby queries.


…


18.5.6. Standby Servers
These settings control the behavior of a standby server that is to
receive replication data.


hot_standby (boolean)
        Specifies whether or not you can connect and run queries during
        recovery, as described in Section 25.5. The default value is
        off. This parameter can only be set at server start. It only has
        effect during archive recovery or in standby mode.
hot_standby_feedback (boolean)
        Specifies whether or not a hot standby will send feedback to the
        primary about queries currently executing on the standby. This
        parameter can be used to eliminate query cancels caused by
        cleanup records, though it can cause database bloat on the
        primary for some workloads. The default value is off. This
        parameter can only be set at server start. It only has effect if
        hot_standby is enabled.

….


--
 Simon Riggs           http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services


Attachment

Re: Sync Rep Design

From
Robert Treat
Date:
On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, 2010-12-30 at 18:42 +0100, Stefan Kaltenbrunner wrote:

> it would help if this would just be a simple text-only description of
> the design that people can actually comment on inline. I don't think
> sending technical design proposals as a pdf (which seems to be written
> in doc-style as well) is a good idea to encourage discussion on -hackers :(

25.2.6. Synchronous Replication
Streaming replication is by default asynchronous. Transactions on the
primary server write commit records to WAL, yet do not know whether or
when a standby has received and processed those changes. So with
asynchronous replication, if the primary crashes, transactions committed
on the primary might not have been received by any standby. As a result,
failover from primary to standby could cause data loss because
transaction completions are absent, relative to the primary. The amount
of data loss is proportional to the replication delay at the time of
failover.

Synchronous replication offers the ability to guarantee that all changes
made by a transaction have been transferred to at least one remote
standby server. This is an extension to the standard level of durability
offered by a transaction commit. This is referred to as semi-synchronous
replication.

When synchronous replication is requested, the commit of a write
transaction will wait until confirmation that the commit record has been
transferred successfully to at least one standby server. Waiting for
confirmation increases the user's confidence that the changes will not
be lost in the event of server crashes but it also necessarily increases
the response time for the requesting transaction. The minimum wait time
is the roundtrip time from primary to standby.

Read only transactions and transaction rollbacks need not wait for
replies from standby servers. Subtransaction commits do not wait for
responses from standby servers, only final top-level commits. Long
running actions such as data loading or index building do not wait until
the very final commit message.


25.2.6.1. Basic Configuration
Synchronous replication must be enabled on both the primary and at least
one standby server. If synchronous replication is disabled on the
master, or enabled on the primary but not enabled on any slaves, the
primary will use asynchronous replication by default.

We use a single parameter to enable synchronous replication, set in
postgresql.conf on both primary and standby servers:

synchronous_replication = off (default) | on

On the primary, synchronous_replication can be set for particular users
or databases, or dynamically by applications programs.


This seems like a potential issue, where I start a server with this off, and then I start turning it on for specific transactions; it isn't exactly clear what happens, since there may or may not be a running synchronous rep slave available.  (I love the idea though) 
 
If more than one standby server specifies synchronous_replication, then
whichever standby replies first will release waiting commits.


I don't want you to think I am setting an expectation, but I'm curious about the possibility of requiring more than 1 server to reply? 
 
Turning this setting off for a standby allows the administrator to
exclude certain standby servers from releasing waiting transactions.
This is useful if not all standby servers are designated as potential
future primary servers. On the standby, this parameter only takes effect
at server start.


25.2.6.2. Planning for Performance
Synchronous replication usually requires carefully planned and placed
standby servers to ensure applications perform acceptably. Waiting
doesn't utilise system resources, but transaction locks continue to be
held until the transfer is confirmed. As a result, incautious use of
synchronous replication will reduce performance for database
applications because of increased response times and higher contention.

PostgreSQL allows the application developer to specify the durability
level required via replication. This can be specified for the system
overall, though it can also be specified for specific users or
connections, or even individual transactions.

For example, an application workload might consist of: 10% of changes
are important customer details, while 90% of changes are less important
data that the business can more easily survive if it is lost, such as
chat messages between users.

With synchronous replication options specified at the application level
(on the master) we can offer sync rep for the most important changes,
without slowing down the bulk of the total workload. Application level
options are an important and practical tool for allowing the benefits of
synchronous replication for high performance applications. This feature
is unique to PostgreSQL.


25.2.6.3. Planning for High Availability
The easiest and safest method of gaining High Availability using
synchronous replication is to configure at least two standby servers. To
understand why, we need to examine what can happen when you lose all
standby servers.

Commits made when synchronous_replication is set will wait until at
least one standby responds. The response may never occur if the last, or
only, standby should crash or the network drops. What should we do in
that situation?

Sitting and waiting will typically cause operational problems because it
is an effective outage of the primary server. Allowing the primary
server to continue processing in the absence of a standby puts those
latest data changes at risk. How we handle this situation is controlled
by allow_standalone_primary. The default setting is on, allowing
processing to continue, though there is no recommended setting. Choosing
the best setting for allow_standalone_primary is a difficult decision
and best left to those with combined business responsibility for both
data and applications. The difficulty of this choice is the reason why
we recommend that you reduce the possibility of this situation occurring
by using multiple standby servers.

When the primary is started with allow_standalone_primary enabled, the
primary will not allow connections until a standby connects that also
has synchronous_replication enabled. This is a convenience to ensure
that we don't allow connections before write transactions will return
successfully.


I think you mean "disabled" at the start here, right? 
 
When allow_standalone_primary is set, a user will stop waiting once the
replication_timeout has been reached for their specific session. Users
are not waiting for a specific standby to reply, they are waiting for a
reply from any standby, so the unavailability of any one standby is not
significant to a user. It is possible for user sessions to hit timeout
even though standbys are communicating normally. In that case, the
setting of replication_timeout is probably too low.


will a notice or warning be thrown in these cases? I'm thinking something like the checkpoint timeout warning, but could be something else; it just seems to me you need some way to know you're timing out. 
 
The standby sends regular status messages to the primary. If no status
messages have been received for replication_timeout the primary server
will assume the connection is dead and terminate it. This happens
whatever the setting of allow_standalone_primary.


Does the standby attempt to reconnect in these scenarios? 
 
If primary crashes while commits are waiting for acknowledgement, those
transactions will be marked fully committed if the primary database
recovers, no matter how allow_standalone_primary is set.

This seems backwards; if you are waiting for acknowledgement, wouldn't the normal assumption be that the transactions *didnt* make it to any standby, and should be rolled back ? 
 
There is no way
to be certain that all standbys have received all outstanding WAL data
at time of the crash of the primary. Some transactions may not show as
committed on the standby, even though they show as committed on the
primary. The guarantee we offer is that the application will not receive
explicit acknowledgement of the successful commit of a transaction until
the WAL data is known to be safely received by the standby. Hence this
mechanism is technically "semi synchronous" rather than "fully
synchronous" replication. Note that replication still not be fully
synchronous even if we wait for all standby servers, though this would
reduce availability, as described previously.


I think we ought to have an example of the best configuration for "cannot afford to lose any data" scenarios, where we would prefer an overall service interruption over the chance of having the primary / secondary out of synch. 
 
If you need to re-create a standby server while transactions are
waiting, make sure that the commands to run pg_start_backup() and
pg_stop_backup() are run in a session with synchronous_replication =
off, otherwise those requests will wait forever for the standby to
appear.



18.5.5. Synchronous Replication
These settings control the behavior of the built-in synchronous
replication feature. These parameters would be set on the primary server
that is to send replication data to one or more standby servers.


synchronous_replication (boolean)
       Specifies whether transaction commit will wait for WAL records
       to be replicated before the command returns a "success"
       indication to the client. The default setting is off. When on,
       there will be a delay while the client waits for confirmation of
       successful replication. That delay will increase depending upon
       the physical distance and network activity between primary and
       standby. The commit wait will last until the first reply from
       any standby. Multiple standby servers allow increased
       availability and possibly increase performance as well.
       The parameter must be set on both primary and standby.
       On the primary, this parameter can be changed at any time; the
       behavior for any one transaction is determined by the setting in
       effect when it commits. It is therefore possible, and useful, to
       have some transactions replicate synchronously and others
       asynchronously. For example, to make a single multistatement
       transaction commit asynchronously when the default is
       synchronous replication, issue SET LOCAL synchronous_replication
       TO OFF within the transaction.
       On the standby, the parameter value is taken only at server
       start.
synchronous_replication_timeout (boolean)
       If the client has synchronous_replication set, and
       allow_standalone_primary is also set, then the commit will wait
       for up to synchronous_replication_timeout milliseconds before it
       returns a "success", or will wait forever if
       synchronous_replication_timeout is set to -1.
       If a standby server does not reply for
       synchronous_replication_timeout the primary will terminate the
       replication connection.
allow_standalone_primary (boolean)
       If allow_standalone_primary is not set, then the server will not
       allow connections until a standby connects that has
       synchronous_replication enabled.
       allow_standalone_primary also affects the behaviour when the
       synchronous_replication_timeout is reached.


somewhat concerned that we seem to need to use double negatives to describe whats going on here. it makes me think we ought to rename this to require_synchronous_standby or similar. 
 



25.5.2. Handling query conflicts
….


Remedial possibilities exist if the number of standby-query
cancellations is found to be unacceptable. Typically the best option is
to enable hot_standby_feedback. This prevents VACUUM from removing
recently-dead rows and so cleanup conflicts do not occur. If you do
this, you should note that this will delay cleanup of dead rows on the
primary, which may result in undesirable table bloat. However, the
cleanup situation will be no worse than if the standby queries were
running directly on the primary server. You are still getting the
benefit of off-loading execution onto the standby and the query may
complete faster than it would have done on the primary server.
max_standby_archive_delay must be kept large in this case, because
delayed WAL files might already contain entries that conflict with the
desired standby queries.





18.5.6. Standby Servers
These settings control the behavior of a standby server that is to
receive replication data.


hot_standby (boolean)
       Specifies whether or not you can connect and run queries during
       recovery, as described in Section 25.5. The default value is
       off. This parameter can only be set at server start. It only has
       effect during archive recovery or in standby mode.
hot_standby_feedback (boolean)
       Specifies whether or not a hot standby will send feedback to the
       primary about queries currently executing on the standby. This
       parameter can be used to eliminate query cancels caused by
       cleanup records, though it can cause database bloat on the
       primary for some workloads. The default value is off. This
       parameter can only be set at server start. It only has effect if
       hot_standby is enabled.


i was expecting this section to mention the synchronous_replication (bool) somewhere, to control if the standby will participate synchronously or asynch; granted it's the same config as listed in 18.5.5 right? Just that the heading of that section specifically targets the primary. 

HTH, looks pretty good at first glance. 


Robert Treat

Re: Sync Rep Design

From
Marti Raudsepp
Date:
Most of your doc uses the terms "primary" and "standby", but a few
instances of "master" and "slave" have slipped in. I think it's better
to stick to consistent terminology.

On Thu, Dec 30, 2010 at 21:04, Simon Riggs <simon@2ndquadrant.com> wrote:
> With synchronous replication options specified at the application level
> (on the master) we can offer sync rep for the most important changes,
> without slowing down the bulk of the total workload. Application level
> options are an important and practical tool for allowing the benefits of
> synchronous replication for high performance applications. This feature
> is unique to PostgreSQL.

I think a comment about the "head-of-line blocking" nature of
streaming repliaction is in order. If you execute massive writes in
async mode and then run a transaction in sync mode, its commit will be
delayed until all the async transactions before it have been applied
on the slave.

> synchronous_replication_timeout (boolean)

Doesn't look like a boolean to me :)

Regards,
Marti


Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote:
> > If more than one standby server specifies synchronous_replication,
> then
> > whichever standby replies first will release waiting commits.

> I don't want you to think I am setting an expectation, but I'm curious
> about the possibility of requiring more than 1 server to reply?

I was initially interested in this myself, but after a long discussion
on "quorum commit" it was decided to go with "first past post".

That is easier to manage, requires one less parameter, performs better
and doesn't really add that much additional confidence.

It was also discussed that we would have a plugin API, but I'm less sure
about that now. Perhaps we can add that option in the future, but its
not high on my list of things for this release.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote:
> > We use a single parameter to enable synchronous replication, set in
> > postgresql.conf on both primary and standby servers:
> >
> > synchronous_replication = off (default) | on
> >
> > On the primary, synchronous_replication can be set for particular
> users
> > or databases, or dynamically by applications programs.
> >
> >
> This seems like a potential issue, where I start a server with this
> off, and then I start turning it on for specific transactions; it
> isn't exactly clear what happens, since there may or may not be a
> running synchronous rep slave available.  (I love the idea though)

Not really an issue. Even if there was a standby there a moment ago, the
standby can go away at any time. So we must cope gracefully with what
happens if you do this. By default, the parameters specify that in the
case you mention we will just use async replication (no wait!).
Options exist to change that, since some people want to wait until the
sysadmin adds a standby.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/30/2010 08:04 PM, Simon Riggs wrote:
> On Thu, 2010-12-30 at 18:42 +0100, Stefan Kaltenbrunner wrote:
>
>> it would help if this would just be a simple text-only description of
>> the design that people can actually comment on inline. I don't think
>> sending technical design proposals as a pdf (which seems to be written
>> in doc-style as well) is a good idea to encourage discussion on -hackers :(
>
> 25.2.6. Synchronous Replication
> Streaming replication is by default asynchronous. Transactions on the
> primary server write commit records to WAL, yet do not know whether or
> when a standby has received and processed those changes. So with
> asynchronous replication, if the primary crashes, transactions committed
> on the primary might not have been received by any standby. As a result,
> failover from primary to standby could cause data loss because
> transaction completions are absent, relative to the primary. The amount
> of data loss is proportional to the replication delay at the time of
> failover.
>
> Synchronous replication offers the ability to guarantee that all changes
> made by a transaction have been transferred to at least one remote
> standby server. This is an extension to the standard level of durability
> offered by a transaction commit. This is referred to as semi-synchronous
> replication.
>
> When synchronous replication is requested, the commit of a write
> transaction will wait until confirmation that the commit record has been
> transferred successfully to at least one standby server. Waiting for
> confirmation increases the user's confidence that the changes will not
> be lost in the event of server crashes but it also necessarily increases
> the response time for the requesting transaction. The minimum wait time
> is the roundtrip time from primary to standby.

hmm this is one of the main problems I see with the proposed "master is 
sometimes aware of the standby"(as in the feedback mode) concept this 
proposal has. If it waits for only one of the standbys there is some 
issue with the terminology. As a DBA I would expect the master to only 
return if ALL of the "sync replication" declared nodes replied ok.


>
> Read only transactions and transaction rollbacks need not wait for
> replies from standby servers. Subtransaction commits do not wait for
> responses from standby servers, only final top-level commits. Long
> running actions such as data loading or index building do not wait until
> the very final commit message.
>
>
> 25.2.6.1. Basic Configuration
> Synchronous replication must be enabled on both the primary and at least
> one standby server. If synchronous replication is disabled on the
> master, or enabled on the primary but not enabled on any slaves, the
> primary will use asynchronous replication by default.
>
> We use a single parameter to enable synchronous replication, set in
> postgresql.conf on both primary and standby servers:

this reads as if you can only set it there

>
> synchronous_replication = off (default) | on
>
> On the primary, synchronous_replication can be set for particular users
> or databases, or dynamically by applications programs.

this says otherwise

>
> If more than one standby server specifies synchronous_replication, then
> whichever standby replies first will release waiting commits.

see above for why I think this violates the configuration promise - if I 
say "this is a sync standby" I better expect it to be...

>
> Turning this setting off for a standby allows the administrator to
> exclude certain standby servers from releasing waiting transactions.
> This is useful if not all standby servers are designated as potential
> future primary servers. On the standby, this parameter only takes effect
> at server start.
>
>
> 25.2.6.2. Planning for Performance
> Synchronous replication usually requires carefully planned and placed
> standby servers to ensure applications perform acceptably. Waiting
> doesn't utilise system resources, but transaction locks continue to be
> held until the transfer is confirmed. As a result, incautious use of
> synchronous replication will reduce performance for database
> applications because of increased response times and higher contention.
>
> PostgreSQL allows the application developer to specify the durability
> level required via replication. This can be specified for the system
> overall, though it can also be specified for specific users or
> connections, or even individual transactions.
>
> For example, an application workload might consist of: 10% of changes
> are important customer details, while 90% of changes are less important
> data that the business can more easily survive if it is lost, such as
> chat messages between users.
>
> With synchronous replication options specified at the application level
> (on the master) we can offer sync rep for the most important changes,
> without slowing down the bulk of the total workload. Application level
> options are an important and practical tool for allowing the benefits of
> synchronous replication for high performance applications. This feature
> is unique to PostgreSQL.

that seems to be a bit too much marketing for a reference level document

>
>
> 25.2.6.3. Planning for High Availability
> The easiest and safest method of gaining High Availability using
> synchronous replication is to configure at least two standby servers. To
> understand why, we need to examine what can happen when you lose all
> standby servers.
>
> Commits made when synchronous_replication is set will wait until at
> least one standby responds. The response may never occur if the last, or
> only, standby should crash or the network drops. What should we do in
> that situation?
>
> Sitting and waiting will typically cause operational problems because it
> is an effective outage of the primary server. Allowing the primary
> server to continue processing in the absence of a standby puts those
> latest data changes at risk. How we handle this situation is controlled
> by allow_standalone_primary. The default setting is on, allowing
> processing to continue, though there is no recommended setting. Choosing
> the best setting for allow_standalone_primary is a difficult decision
> and best left to those with combined business responsibility for both
> data and applications. The difficulty of this choice is the reason why
> we recommend that you reduce the possibility of this situation occurring
> by using multiple standby servers.

if there is no recommended setting what will be the default?

[...]
>
> 25.5.2. Handling query conflicts
> ….
>
>
> Remedial possibilities exist if the number of standby-query
> cancellations is found to be unacceptable. Typically the best option is
> to enable hot_standby_feedback. This prevents VACUUM from removing
> recently-dead rows and so cleanup conflicts do not occur. If you do
> this, you should note that this will delay cleanup of dead rows on the
> primary, which may result in undesirable table bloat. However, the
> cleanup situation will be no worse than if the standby queries were
> running directly on the primary server. You are still getting the
> benefit of off-loading execution onto the standby and the query may
> complete faster than it would have done on the primary server.
> max_standby_archive_delay must be kept large in this case, because
> delayed WAL files might already contain entries that conflict with the
> desired standby queries.
>
>
> …
>
>
> 18.5.6. Standby Servers
> These settings control the behavior of a standby server that is to
> receive replication data.
>
>
> hot_standby (boolean)
>          Specifies whether or not you can connect and run queries during
>          recovery, as described in Section 25.5. The default value is
>          off. This parameter can only be set at server start. It only has
>          effect during archive recovery or in standby mode.
> hot_standby_feedback (boolean)
>          Specifies whether or not a hot standby will send feedback to the
>          primary about queries currently executing on the standby. This
>          parameter can be used to eliminate query cancels caused by
>          cleanup records, though it can cause database bloat on the
>          primary for some workloads. The default value is off. This
>          parameter can only be set at server start. It only has effect if
>          hot_standby is enabled.

so if this is enabled - suddenly the master becomes (kinda) aware of the 
specifics of a given standby - but what happens when one of the standby 
is offline for a while how does the master know that?
What I'm really missing with that proposal is how people expect that 
solution to be managed - given there is only sometimes a feedback 
channel into the master you can't do the monitoring.
Even if you could (which we really need!) there is nothing in the 
proposal yet that will help to determine on what the most recent standby 
(in the case of more >1 sync standby) might be.
It also does not address the more general (not sync rep specific) 
problem of how to deal with max_keep_segments which is a wart and I was 
hoping we could get rid of in 9.1 - but it would require a real standby 
registration or at least standby management possibility on the master 
not a halfway done one - so do we really need hot_standby_feedback as 
part of the inital sync-rep patch?


Stefan


Re: Sync Rep Design

From
Aidan Van Dyk
Date:
On Thu, Dec 30, 2010 at 3:07 PM, Robert Treat <rob@xzilla.net> wrote:

>> If primary crashes while commits are waiting for acknowledgement, those
>> transactions will be marked fully committed if the primary database
>> recovers, no matter how allow_standalone_primary is set.
>
> This seems backwards; if you are waiting for acknowledgement, wouldn't the
> normal assumption be that the transactions *didnt* make it to any standby,
> and should be rolled back ?

This is the standard 2-phase commit problem.  The primary server *has*
committed it, it's fsync has returned, and the only thing keeping it
from returning the commit to the client is that it's waiting on a
synchronous "ack" from a slave.

You've got 2 options:
1) initiate fsync on the slave first  - In this case, the slave is farther ahead than the primary, and if
primary fails, you're *forced* to have a failover.  The standby is
head of the primary, so the primary recovering can cause divergence.
And you'll likely have to do a base-backup style sync to get a new
primary/standby setup.
2) initiate fsync on the primary first  - In this case, the slave is always slightly behind.  If if your
primary falls over, you don't give commit messages to the clients, but
if it recovers, it might have committed data, and slaves will still be
able to catch up.

The thing is that currently, even without replication, #2 can happen.
If your db falls over before it gets the commit packet stuffed out the
network, you're in the same boat.  The data might be committed, even
though you didn't get the commit packet, and when  your DB recovers,
it's got the committed data that you never "knew" was committed.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Sync Rep Design

From
Robert Treat
Date:
On Thu, Dec 30, 2010 at 3:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote:
> > > If more than one standby server specifies synchronous_replication,
> > then
> > > whichever standby replies first will release waiting commits.
>
> > I don't want you to think I am setting an expectation, but I'm curious
> > about the possibility of requiring more than 1 server to reply?
>
> I was initially interested in this myself, but after a long discussion
> on "quorum commit" it was decided to go with "first past post".
>
> That is easier to manage, requires one less parameter, performs better
> and doesn't really add that much additional confidence.
>

Yes, I think with a single master, you are probably right (been
dealing with more than my fair share of multi-master based nosql
solutions lately)

Still, one thing that has me concerned is that in the case of two
slaves, you don't know which one is the more up-to-date one if you
need to failover. It'd be nice if you could just guarantee they both
are, but in lieu of that, I guess whatever decision tree is being
used, it needs to look at current xlog location of any potential
failover targets.

> It was also discussed that we would have a plugin API, but I'm less sure
> about that now. Perhaps we can add that option in the future, but its
> not high on my list of things for this release.
>

Agreed.

Robert Treat
http://www.xzilla.net


Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 15:07 -0500, Robert Treat wrote:
> > When allow_standalone_primary is set, a user will stop waiting once
> the
> > replication_timeout has been reached for their specific session.
> Users
> > are not waiting for a specific standby to reply, they are waiting
> for a
> > reply from any standby, so the unavailability of any one standby is
> not
> > significant to a user. It is possible for user sessions to hit
> timeout
> > even though standbys are communicating normally. In that case, the
> > setting of replication_timeout is probably too low.
> >
> >
> will a notice or warning be thrown in these cases? I'm thinking
> something
> like the checkpoint timeout warning, but could be something else; it
> just
> seems to me you need some way to know you're timing out.

We can do that, yes.

> > The standby sends regular status messages to the primary. If no
> status
> > messages have been received for replication_timeout the primary
> server
> > will assume the connection is dead and terminate it. This happens
> > whatever the setting of allow_standalone_primary.
> >
> >
> Does the standby attempt to reconnect in these scenarios?

Yes it would, but the reason why we terminated the connection was it
wasn't talking any more, so it is probably dead.

> > If primary crashes while commits are waiting for acknowledgement,
> those
> > transactions will be marked fully committed if the primary database
> > recovers, no matter how allow_standalone_primary is set.
> 
> 
> This seems backwards; if you are waiting for acknowledgement, wouldn't
> the
> normal assumption be that the transactions *didnt* make it to any
> standby,
> and should be rolled back ?

Well, we can't roll it back. We have already written the commit record
to WAL.

> > There is no way
> > to be certain that all standbys have received all outstanding WAL
> data
> > at time of the crash of the primary. Some transactions may not show
> as
> > committed on the standby, even though they show as committed on the
> > primary. The guarantee we offer is that the application will not
> receive
> > explicit acknowledgement of the successful commit of a transaction
> until
> > the WAL data is known to be safely received by the standby. Hence
> this
> > mechanism is technically "semi synchronous" rather than "fully
> > synchronous" replication. Note that replication still not be fully
> > synchronous even if we wait for all standby servers, though this
> would
> > reduce availability, as described previously.
> >
> >
> I think we ought to have an example of the best configuration for
> "cannot
> afford to lose any data" scenarios, where we would prefer an overall
> service
> interruption over the chance of having the primary / secondary out of
> synch.

I say "use two or more standbys" more than once...

> >>
> >
> somewhat concerned that we seem to need to use double negatives to
> describe
> whats going on here. it makes me think we ought to rename this to
> require_synchronous_standby or similar.

Don't see why we can't use double negatives. ;-)

The parameter is named directly from Fujii Masao's suggestion.

> > 18.5.6. Standby Servers
> > These settings control the behavior of a standby server that is to
> > receive replication data.
> >

...

> i was expecting this section to mention the synchronous_replication
> (bool)
> somewhere, to control if the standby will participate synchronously or
> asynch; granted it's the same config as listed in 18.5.5 right? Just
> that
> the heading of that section specifically targets the primary.

OK, good idea.

> HTH, looks pretty good at first glance. 

Thanks.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote:

> Still, one thing that has me concerned is that in the case of two
> slaves, you don't know which one is the more up-to-date one if you
> need to failover. It'd be nice if you could just guarantee they both
> are...

Regrettably, nobody can know that, without checking.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/30/2010 10:01 PM, Simon Riggs wrote:
> On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote:
>
>> Still, one thing that has me concerned is that in the case of two
>> slaves, you don't know which one is the more up-to-date one if you
>> need to failover. It'd be nice if you could just guarantee they both
>> are...
>
> Regrettably, nobody can know that, without checking.

how exactly would you check? - this seems like something that needs to 
be done from the SQL and the CLI level and also very well documented 
(which I cannot see in your proposal).



Stefan



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 22:11 +0200, Marti Raudsepp wrote:

> I think a comment about the "head-of-line blocking" nature of
> streaming repliaction is in order. If you execute massive writes in
> async mode and then run a transaction in sync mode, its commit will be
> delayed until all the async transactions before it have been applied
> on the slave.

Not really sure I understand what you want me to add there. The case you
mention is identical whether we use the word "async" or "sync" where you
mention "in async mode".

Replication doesn't wait until a sync commit is requested, it is
continuously active.

Sync rep's only addition are the reply messages.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 21:42 +0100, Stefan Kaltenbrunner wrote:
> >
> > Synchronous replication offers the ability to guarantee that all changes
> > made by a transaction have been transferred to at least one remote
> > standby server. This is an extension to the standard level of durability
> > offered by a transaction commit. This is referred to as semi-synchronous
> > replication.
> >
> > When synchronous replication is requested, the commit of a write
> > transaction will wait until confirmation that the commit record has been
> > transferred successfully to at least one standby server. Waiting for
> > confirmation increases the user's confidence that the changes will not
> > be lost in the event of server crashes but it also necessarily increases
> > the response time for the requesting transaction. The minimum wait time
> > is the roundtrip time from primary to standby.
> 
> hmm this is one of the main problems I see with the proposed "master is 
> sometimes aware of the standby"(as in the feedback mode) concept this 
> proposal has. If it waits for only one of the standbys there is some 
> issue with the terminology. As a DBA I would expect the master to only 
> return if ALL of the "sync replication" declared nodes replied ok.

Well, as a DBA, I expect it to work with just one. That's how MySQL and
Oracle work at least. If ALL standbys reply, it takes longer, makes the
code harder, how do you determine what "all" is robustly etc.. Plus its
been discussed already.

> What I'm really missing with that proposal is how people expect that 
> solution to be managed - 

What aspect do you wish to monitor? I'm happy to consider your
suggestions.

> given there is only sometimes a feedback 
> channel into the master you can't do the monitoring.

Not sure what you mean. Please explain more.

> Even if you could (which we really need!) there is nothing in the 
> proposal yet that will help to determine on what the most recent standby 
> (in the case of more >1 sync standby) might be.

Functions to determine that already exist.

>  - but it would require a real standby 
> registration or at least standby management possibility on the master 
> not a halfway done one - so do we really need hot_standby_feedback as 
> part of the inital sync-rep patch?

It is a Hot Standby feature, but so tightly integrated with this code
that it isn't possible for me to submit as two separate patches.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Robert Haas
Date:
On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> We use a single parameter to enable synchronous replication, set in
> postgresql.conf on both primary and standby servers:
>
> synchronous_replication = off (default) | on
>
> On the primary, synchronous_replication can be set for particular users
> or databases, or dynamically by applications programs.
>
> If more than one standby server specifies synchronous_replication, then
> whichever standby replies first will release waiting commits.
>
> Turning this setting off for a standby allows the administrator to
> exclude certain standby servers from releasing waiting transactions.
> This is useful if not all standby servers are designated as potential
> future primary servers. On the standby, this parameter only takes effect
> at server start.

I think it's a bad idea to use the same parameter to mean different
things on the master and standby.  You proposed this kind of double
meaning for the hot_standby parameter (possibly back when it was
called standby_connections, or something like that) and we (rightly, I
think) did not adopt that, instead ending up with wal_level to control
the master's behavior and hot_standby to control the slave's behavior.

> synchronous_replication (boolean)
>        Specifies whether transaction commit will wait for WAL records
>        to be replicated before the command returns a "success"
>        indication to the client.

The word "replicated" here could be taken to mean different things,
most obviously:

- slave has received the WAL
- slave has fsync'd the WAL
- slave has applied the WAL

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 22:08 +0100, Stefan Kaltenbrunner wrote:
> On 12/30/2010 10:01 PM, Simon Riggs wrote:
> > On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote:
> >
> >> Still, one thing that has me concerned is that in the case of two
> >> slaves, you don't know which one is the more up-to-date one if you
> >> need to failover. It'd be nice if you could just guarantee they both
> >> are...
> >
> > Regrettably, nobody can know that, without checking.
> 
> how exactly would you check? - this seems like something that needs to 
> be done from the SQL and the CLI level and also very well documented 
> (which I cannot see in your proposal).

This is a proposal for sync rep, not multi-node failover. I'm definitely
not going to widen the scope of this project.

Functions already exist to check the thing you're asking.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Robert Haas
Date:
On Thu, Dec 30, 2010 at 3:42 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
>> synchronous replication for high performance applications. This feature
>> is unique to PostgreSQL.
>
> that seems to be a bit too much marketing for a reference level document

+1.

> It also does not address the more general (not sync rep specific) problem of
> how to deal with max_keep_segments which is a wart and I was hoping we could
> get rid of in 9.1 - but it would require a real standby registration or at
> least standby management possibility on the master not a halfway done one -
> so do we really need hot_standby_feedback as part of the inital sync-rep
> patch?

And this is really the key point on which previous discussions of sync
rep stalled.  Simon is clearly of the opinion that any system where
the slaves have an individual identities (aka "standby registration")
is a bad idea, but the only justification he's offered for that
position is the assertion that it doesn't allow any added
functionality.  As you point out, and as has been pointed out before,
this is not true, but unless Simon has changed his position since the
last time we discussed this, he will not only refuse to include any
kind of standby identifier in any of his proposals, but will also
argue against including any such code even if it is written by someone
else.  I don't understand why, but that's how it is.

Synchronous replication would probably be done and committed by now if
it weren't for this issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Jim Nasby
Date:
On Dec 30, 2010, at 3:27 PM, Robert Haas wrote:
>> synchronous_replication (boolean)
>>        Specifies whether transaction commit will wait for WAL records
>>        to be replicated before the command returns a "success"
>>        indication to the client.
>
> The word "replicated" here could be taken to mean different things,
> most obviously:
>
> - slave has received the WAL
> - slave has fsync'd the WAL
> - slave has applied the WAL

I think that comment is valid for the entire set of docs, actually. The document goes out of its way to avoid simple
phraseslike "replicated", but doesn't spell out exactly what is happening, ie: 

"Synchronous replication offers the ability to guarantee that all changes
made by a transaction have been transferred to at least one remote
standby server. This is an extension to the standard level of durability
offered by a transaction commit. This is referred to as semi-synchronous
replication."

Reading that, I'm left with the sense that this isn't a simple matter of "Oh, the data has been replicated to the slave
beforecommit returns", but nothing does a good job of clearly explaining what the distinction is and what it means.
Thissection: 

"The guarantee we offer is that the application will not receive
explicit acknowledgement of the successful commit of a transaction until
the WAL data is known to be safely received by the standby. Hence this
mechanism is technically "semi synchronous" rather than "fully
synchronous" replication."

does provide some enlightenment, but it's at the end of the section. I think it would be best if there was a section
rightat the beginning that talked about the data quality issue of sync replication and how we're avoiding it with our
semi-syncsolution. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 16:27 -0500, Robert Haas wrote:

> I think it's a bad idea to use the same parameter to mean different
> things on the master and standby.  

Obviously if you phrase it like that, nobody would disagree. I would say
I have used the same parameter on both sides in a balanced way to
simplify the configuration, which had been an important factor in the
debate.

"You need to set parameter X on both primary and standby" seems simple
and clear. It certainly works OK for MySQL.

It's no bother to change, whichever way we decide and I'm happy to do
so.

My previous patch had two parameters:

primary: synchronous_replication = ...
standby: synchronous_replication_service = on | off

Which do people prefer?

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 16:47 -0500, Robert Haas wrote:
> On Thu, Dec 30, 2010 at 3:42 PM, Stefan Kaltenbrunner
> <stefan@kaltenbrunner.cc> wrote:
> >> synchronous replication for high performance applications. This feature
> >> is unique to PostgreSQL.
> >
> > that seems to be a bit too much marketing for a reference level document
> 
> +1.

I've removed the "This feature is unique to PostgreSQL", which I agree
belongs in a press release, not docs. The explanation of a use case that
would benefit from the feature seems valid and I've left that in.

PostgreSQL docs are more technical and precise than any other DBMS, even
DB2. Having read everybody else's docs, I'm inclined to say it would be
easier to explain if I left out the details, as they do. You won't find
a detailed explanation of commit guarantees in MySQL docs, for example.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 18:47 -0600, Jim Nasby wrote:
> On Dec 30, 2010, at 3:27 PM, Robert Haas wrote:
> >> synchronous_replication (boolean)
> >>        Specifies whether transaction commit will wait for WAL records
> >>        to be replicated before the command returns a "success"
> >>        indication to the client.
> > 
> > The word "replicated" here could be taken to mean different things,
> > most obviously:
> > 
> > - slave has received the WAL
> > - slave has fsync'd the WAL
> > - slave has applied the WAL
> 
> I think that comment is valid for the entire set of docs, actually. The document goes out of its way to avoid simple
phraseslike "replicated", but doesn't spell out exactly what is happening, ie:
 
> 
> "Synchronous replication offers the ability to guarantee that all changes
> made by a transaction have been transferred to at least one remote
> standby server. This is an extension to the standard level of durability
> offered by a transaction commit. This is referred to as semi-synchronous
> replication."
> 
> Reading that, I'm left with the sense that this isn't a simple matter of "Oh, the data has been replicated to the
slavebefore commit returns", but nothing does a good job of clearly explaining what the distinction is and what it
means.This section:
 
> 
> "The guarantee we offer is that the application will not receive
> explicit acknowledgement of the successful commit of a transaction until
> the WAL data is known to be safely received by the standby. Hence this
> mechanism is technically "semi synchronous" rather than "fully
> synchronous" replication."
> 
> does provide some enlightenment, but it's at the end of the section. I think it would be best if there was a section
rightat the beginning that talked about the data quality issue of sync replication and how we're avoiding it with our
semi-syncsolution.
 

I'm happy to change the docs. It's the first draft...

If that's the only problem you've got, then I'm feeling good.

Any problems with the user interface itself?

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 16:47 -0500, Robert Haas wrote:

> > It also does not address the more general (not sync rep specific) problem of
> > how to deal with max_keep_segments which is a wart and I was hoping we could
> > get rid of in 9.1 - but it would require a real standby registration or at
> > least standby management possibility on the master not a halfway done one -
> > so do we really need hot_standby_feedback as part of the inital sync-rep
> > patch?
> 
> And this is really the key point on which previous discussions of sync
> rep stalled.  Simon is clearly of the opinion that any system where
> the slaves have an individual identities (aka "standby registration")
> is a bad idea, but the only justification he's offered for that
> position is the assertion that it doesn't allow any added
> functionality.  As you point out, and as has been pointed out before,
> this is not true, but unless Simon has changed his position since the
> last time we discussed this, he will not only refuse to include any
> kind of standby identifier in any of his proposals, but will also
> argue against including any such code even if it is written by someone
> else.  I don't understand why, but that's how it is.
> 
> Synchronous replication would probably be done and committed by now if
> it weren't for this issue.

I'm not very clear what your response has to do with Stefan's comments.

My general perspective is that MySQL released a simple design a year
ahead of us, which should be to our collective shame. I will be working
towards delivering something useful in this release.

Standby registration is complicated and not necessary. If anybody needs
to justify anything, it is the people that claim it is somehow
essential. If you want increased complexity and features, you can have
it, one day, but don't prevent everybody else from benefiting from
simplicity, now. What we do need is performance, otherwise the feature
is mostly unusable for production systems, without splitting your
application into pieces.

I would rather concentrate on a minimal set of functionality that we can
all agree on. To show that, I have gone out of my way to include
features specified by others, including exact names and behaviours of
parameters.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Joshua Tolley
Date:
On Thu, Dec 30, 2010 at 03:24:09PM -0500, Aidan Van Dyk wrote:
> On Thu, Dec 30, 2010 at 3:07 PM, Robert Treat <rob@xzilla.net> wrote:
>
> >> If primary crashes while commits are waiting for acknowledgement, those
> >> transactions will be marked fully committed if the primary database
> >> recovers, no matter how allow_standalone_primary is set.
> >
> > This seems backwards; if you are waiting for acknowledgement, wouldn't the
> > normal assumption be that the transactions *didnt* make it to any standby,
> > and should be rolled back ?
>
> This is the standard 2-phase commit problem.  The primary server *has*
> committed it, it's fsync has returned, and the only thing keeping it
> from returning the commit to the client is that it's waiting on a
> synchronous "ack" from a slave.

<snip>

> 2) initiate fsync on the primary first
>    - In this case, the slave is always slightly behind.  If if your
> primary falls over, you don't give commit messages to the clients, but
> if it recovers, it might have committed data, and slaves will still be
> able to catch up.
>
> The thing is that currently, even without replication, #2 can happen.

For what little it's worth, I vote for this option, because it's a problem
that can already happen (as opposed to adding an entirely new type of problem
to the mix).

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Re: Sync Rep Design

From
Robert Haas
Date:
On Thu, Dec 30, 2010 at 8:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'm not very clear what your response has to do with Stefan's comments.
>
> My general perspective is that MySQL released a simple design a year
> ahead of us, which should be to our collective shame. I will be working
> towards delivering something useful in this release.

I don't feel ashamed of our feature set and I am not out to beat MySQL
or anyone else, just to deliver the best product that we can.  Our
community has different interests than the MySQL community and that is
fine.  Still, I don't disagree that we should be aiming at feature
parity.

<reads MySQL documentation>

I see now that you've tried to design this feature in a way that is
similar to MySQL's offering, which does have some value.  But it
appears to me that the documentation you've written here is
substantially similar to the MySQL 5.5 reference documentation.  That
could get us into a world of legal trouble - that documentation is not
even open source, let alone BSD.

http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html

> I would rather concentrate on a minimal set of functionality that we can
> all agree on.

Me too; and perhaps your proposal is it.  But I think it's a shame we
didn't put more work into standby registration when we had time to get
that done.  It might not be necessary, but it would have delivered
some nice functionality that we are now not going to have for 9.1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Hannu Krosing
Date:
On 30.12.2010 22:27, Robert Haas wrote:
> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> synchronous_replication (boolean)
>>         Specifies whether transaction commit will wait for WAL records
>>         to be replicated before the command returns a "success"
>>         indication to the client.
> The word "replicated" here could be taken to mean different things,
> most obviously:
>
> - slave has received the WAL
> - slave has fsync'd the WAL
> - slave has applied the WAL
Perhaps the level of "replication guarantee" should be decided on the 
slave side, by
having a configuration parameter there

report_as_replicated = received|written_to_disk|fsynced|applied

for different types of hosts may have wildly different guarantees and 
performance
parameters for these. One could envision a WAL-archive type "standby" 
which is
there for data persistence only will and never "apply" WAL.

of couse we could put a bitmap in the status update messages from slave 
and have
some quorum on options on master for when the data is "in sync", say 
"need 5 received
or (1 applied and 1 fsynced)", but I am pretty sure that trying to get 
anywhere with this
before applying the basic sync rep patch would push back sync rep to at 
least 9.2 if not 9.5

---------------------
Hannu Krosing



Re: Sync Rep Design

From
Hannu Krosing
Date:
On 31.12.2010 6:02, Robert Haas wrote:
> On Thu, Dec 30, 2010 at 8:57 PM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> I'm not very clear what your response has to do with Stefan's comments.
>>
>> My general perspective is that MySQL released a simple design a year
>> ahead of us, which should be to our collective shame. I will be working
>> towards delivering something useful in this release.
> I don't feel ashamed of our feature set and I am not out to beat MySQL
> or anyone else, just to deliver the best product that we can.
The key word here is "deliver" .  The aim is to "deliver" sync rep, not 
"specify, leaving out
controversial details". The registration part has been left out for a 
reason - while the
registration itself is easy, deciding all the interactions with already 
running replication is
not. Doing just the minimal support for sync rep (need acknowledge from 
at least one
standby) and leaving the management of standbys to user enables us to 
get to actual
working code instead of a pie-in-the-sky wishlist.

>   Our
> community has different interests than the MySQL community and that is
> fine.  Still, I don't disagree that we should be aiming at feature
> parity.
>
> <reads MySQL documentation>
>
> I see now that you've tried to design this feature in a way that is
> similar to MySQL's offering, which does have some value.  But it
> appears to me that the documentation you've written here is
> substantially similar to the MySQL 5.5 reference documentation.  That
> could get us into a world of legal trouble - that documentation is not
> even open source, let alone BSD.
>
> http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html
Maybe we should get someone who has not read mysql docs to re-write a 
spec in a
"clean room" fashion, by just inspecting code and asking Simon et.al.

>> I would rather concentrate on a minimal set of functionality that we can
>> all agree on.
> Me too; and perhaps your proposal is it.  But I think it's a shame we
> didn't put more work into standby registration when we had time to get
> that done.
When you need _just_ the registration, then make a table and two functions
pg_standby_register(name) and pg_standby_unregister(name)
For a little more added functionality add a third one 
pg_standby_last_seen(name)
to update last seen timestamp and a script that polls all standbys and 
calls this.
>    It might not be necessary, but it would have delivered
> some nice functionality that we are now not going to have for 9.1.
There are tons of "nice functionality we are not going to have for 9.1", 
lets just not
make this cause even more nice functionality being left out !

---------------------
Hannu Krosing



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/30/2010 10:27 PM, Simon Riggs wrote:
> On Thu, 2010-12-30 at 22:08 +0100, Stefan Kaltenbrunner wrote:
>> On 12/30/2010 10:01 PM, Simon Riggs wrote:
>>> On Thu, 2010-12-30 at 15:51 -0500, Robert Treat wrote:
>>>
>>>> Still, one thing that has me concerned is that in the case of two
>>>> slaves, you don't know which one is the more up-to-date one if you
>>>> need to failover. It'd be nice if you could just guarantee they both
>>>> are...
>>>
>>> Regrettably, nobody can know that, without checking.
>>
>> how exactly would you check? - this seems like something that needs to
>> be done from the SQL and the CLI level and also very well documented
>> (which I cannot see in your proposal).
>
> This is a proposal for sync rep, not multi-node failover. I'm definitely
> not going to widen the scope of this project.
>
> Functions already exist to check the thing you're asking.

well your proposal includes a lot of stuff on how to avoid dataloss and 
getting High Availability - so I think it is a requirement for us to 
tell the DBA what the procedures are for both setting it up (which is 
what is in the docs - but only 50% of the thing) and what to do in the 
case of a desaster (which is the other part of the problem).
Or said otherwise - sync rep is not very useful if there is no easy and 
reliable way to get that information, if that stuff is already available 
even better but I'm not aware of what is there and what not, so I expect 
others to have the same problem.



Stefan


Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/30/2010 10:23 PM, Simon Riggs wrote:
> On Thu, 2010-12-30 at 21:42 +0100, Stefan Kaltenbrunner wrote:
>>>
>>> Synchronous replication offers the ability to guarantee that all changes
>>> made by a transaction have been transferred to at least one remote
>>> standby server. This is an extension to the standard level of durability
>>> offered by a transaction commit. This is referred to as semi-synchronous
>>> replication.
>>>
>>> When synchronous replication is requested, the commit of a write
>>> transaction will wait until confirmation that the commit record has been
>>> transferred successfully to at least one standby server. Waiting for
>>> confirmation increases the user's confidence that the changes will not
>>> be lost in the event of server crashes but it also necessarily increases
>>> the response time for the requesting transaction. The minimum wait time
>>> is the roundtrip time from primary to standby.
>>
>> hmm this is one of the main problems I see with the proposed "master is
>> sometimes aware of the standby"(as in the feedback mode) concept this
>> proposal has. If it waits for only one of the standbys there is some
>> issue with the terminology. As a DBA I would expect the master to only
>> return if ALL of the "sync replication" declared nodes replied ok.
>
> Well, as a DBA, I expect it to work with just one. That's how MySQL and
> Oracle work at least. If ALL standbys reply, it takes longer, makes the
> code harder, how do you determine what "all" is robustly etc.. Plus its
> been discussed already.

Maybe it has been discussed but I still don't see way it makes any 
sense. If I declare a standby a sync standby I better want it sync - not 
"maybe sync". consider the case of a 1 master and two identical sync 
standbys - one sync standby is in the same datacenter the other is in a 
backup location say 15km away.
Given there is a small constant latency to the second box (even if you 
have fast networks) the end effect is that the second standby will NEVER 
be sync (because the local one will always be faster) and you end up 
with an async slave that cannot be used per your business rules?

>
>> What I'm really missing with that proposal is how people expect that
>> solution to be managed -
>
> What aspect do you wish to monitor? I'm happy to consider your
> suggestions.
>
>> given there is only sometimes a feedback
>> channel into the master you can't do the monitoring.
>
> Not sure what you mean. Please explain more.

well hot_standby_feedback(not sure I like the name but I can't think of 
anything better either) - provides feedback to the master(like sync rep) 
and the master is actually acting on that feedback. so in effect we have 
stuff affecting the master that we need to be able to monitor clearly.
But to make that information useful it would help to see what standby 
provided what kind of feedback so we are back to having the master being 
aware of what standbys are connected and what standbys are supposed to 
be there - which is the issue sync rep got stalled before...

>
>> Even if you could (which we really need!) there is nothing in the
>> proposal yet that will help to determine on what the most recent standby
>> (in the case of more>1 sync standby) might be.
>
> Functions to determine that already exist.
>
>>   - but it would require a real standby
>> registration or at least standby management possibility on the master
>> not a halfway done one - so do we really need hot_standby_feedback as
>> part of the inital sync-rep patch?
>
> It is a Hot Standby feature, but so tightly integrated with this code
> that it isn't possible for me to submit as two separate patches.

well you are the developer of that feature but if it is already a know 
that can be turned on and off it seems not to hard to submit as a 
seperate feature...



Stefan


Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 31.12.2010 09:50, Hannu Krosing wrote:
> On 30.12.2010 22:27, Robert Haas wrote:
>> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com>
>> wrote:
>>> synchronous_replication (boolean)
>>> Specifies whether transaction commit will wait for WAL records
>>> to be replicated before the command returns a "success"
>>> indication to the client.
>> The word "replicated" here could be taken to mean different things,
>> most obviously:
>>
>> - slave has received the WAL
>> - slave has fsync'd the WAL
>> - slave has applied the WAL
> Perhaps the level of "replication guarantee" should be decided on the
> slave side, by
> having a configuration parameter there
>
> report_as_replicated = received|written_to_disk|fsynced|applied
>
> for different types of hosts may have wildly different guarantees and
> performance
> parameters for these. One could envision a WAL-archive type "standby"
> which is
> there for data persistence only will and never "apply" WAL.

Agreed, it feels natural to specify when a piece of WAL is acknowledged 
in the standby.

Regarding the rest of the proposal, I would still prefer the UI 
discussed here:

http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com

It ought to be the same amount of work to implement, and provides the 
same feature set, but makes administration a bit easier by being able to 
name the standbys. Also, I dislike the idea of having the standby 
specify that it's a synchronous standby that the master has to wait for. 
Behavior on the master should be configured on the master.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Simon Riggs
Date:
On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote:

> Maybe it has been discussed but I still don't see way it makes any 
> sense. If I declare a standby a sync standby I better want it sync - not 
> "maybe sync". consider the case of a 1 master and two identical sync 
> standbys - one sync standby is in the same datacenter the other is in a 
> backup location say 15km away.
> Given there is a small constant latency to the second box (even if you 
> have fast networks) the end effect is that the second standby will NEVER 
> be sync (because the local one will always be faster) and you end up 
> with an async slave that cannot be used per your business rules?

Your picture above is a common misconception. I will add something to
the docs to explain this.

1. "sync" is a guarantee about how we respond to the client when we
commit. If we wait for more than one response that slows things down,
makes the cluster more fragile, complicates the code and doesn't
appreciably improve the guarantee.

2. "sync" does not guarantee that the updates to the standbys are in any
way coordinated. You can run a query on one standby and get one answer
and at the exact same time run the same query on another standby and get
a different answer (slightly ahead/behind). That also means that if the
master crashes one of the servers will be ahead or behind. You can use
pg_last_xlog_receive_location() to check which one that is.

When people say they want *all* servers to respond, its usually because
they want (2), but that is literally impossible in a distributed system.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/31/2010 11:06 AM, Heikki Linnakangas wrote:
> On 31.12.2010 09:50, Hannu Krosing wrote:
>> On 30.12.2010 22:27, Robert Haas wrote:
>>> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com>
>>> wrote:
>>>> synchronous_replication (boolean)
>>>> Specifies whether transaction commit will wait for WAL records
>>>> to be replicated before the command returns a "success"
>>>> indication to the client.
>>> The word "replicated" here could be taken to mean different things,
>>> most obviously:
>>>
>>> - slave has received the WAL
>>> - slave has fsync'd the WAL
>>> - slave has applied the WAL
>> Perhaps the level of "replication guarantee" should be decided on the
>> slave side, by
>> having a configuration parameter there
>>
>> report_as_replicated = received|written_to_disk|fsynced|applied
>>
>> for different types of hosts may have wildly different guarantees and
>> performance
>> parameters for these. One could envision a WAL-archive type "standby"
>> which is
>> there for data persistence only will and never "apply" WAL.
>
> Agreed, it feels natural to specify when a piece of WAL is acknowledged
> in the standby.
>
> Regarding the rest of the proposal, I would still prefer the UI
> discussed here:
>
> http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com
>
> It ought to be the same amount of work to implement, and provides the
> same feature set, but makes administration a bit easier by being able to
> name the standbys. Also, I dislike the idea of having the standby
> specify that it's a synchronous standby that the master has to wait for.
> Behavior on the master should be configured on the master.

well that proposal is much closer to what I want as an admin - except 
that it would be nice to configure that through actual DDL.
My wish would be more like:

* standby provides a unique name identifier
* standby has a flag to say the maximum(or minimum?) 
replication_reported support it can do
* standby connects to the master async by default and the master 
registers the standby automatically
* on the master I can do the following with every standby that is 
visible to the master or has been in the past:* enable/disable and add/remove permanently(if not added permanently 
the registration is transient) - enabled if not set explicitly* sync_rep_enabled (boolean) it (so you can still do per
transactionor 
 
per database or whatever sync rep) - disabled if not set explicitly* sync_reply_required (booleant), (per sync standby
flagto require a 
 
reply before the master returns - if there is only one sync standby this 
is default behaviour if there are more the admin can choose)* wait_forever or timeout per standby* maybe a way to set
thereport_as_replicated from the master (if 
 
feasible) up to the max of what the standby can do

so you would get the proposed "semi sync rep" mode by simply setting 
more than one standby as "sync_rep_enabled" and sync_reply_required is 
false for all of them (ie any one of them can reply and the master 
returns) - if you want better than that just require a reply from a 
specific one or more than one.

this would also help in us providing a simple view with a nice and handy 
status report on the slaves (which ones are there, which ones should be 
there, how far are they in terms of applying wal, what status do they have).

Imho an interface like this would be:

a) convinient because it would not require any additional setup 
requirements for async rep except providing a "name" on the standby by 
default
b) it would enable the master to specify the business rules clearly
c) would still support the simple "one sync reply is enough" semisync 
replication case people like to have
d) would also enable the admin to get more than ONE sync standby that is 
really sync - not maybe sync
e) hot_standby_feedback (if enabled) would look at only the permanently 
enabled slaves so only an "DBA approved" standby would be able to affect 
the master for table bloat
f) provide the necessary meta information for providing the handy "quick 
& nice" replication status overview reporting feature people want and need
g) for all the permanently enabled async nodes we could keep track of 
the required oldest required WAL and keep that (optionally) so we could 
get rid of the hard to size max_keep_segements and maintain that 
automatically.

the only disadvantage I can see would be that you would have to manually 
remove a non-functional slave from the master(and only one that you set 
some explicit configuration for!) if you decide you don't need it any more.



Stefan


Re: Sync Rep Design

From
Simon Riggs
Date:
On Fri, 2010-12-31 at 12:06 +0200, Heikki Linnakangas wrote:
> On 31.12.2010 09:50, Hannu Krosing wrote:
> > On 30.12.2010 22:27, Robert Haas wrote:
> >> On Thu, Dec 30, 2010 at 2:04 PM, Simon Riggs<simon@2ndquadrant.com>
> >> wrote:
> >>> synchronous_replication (boolean)
> >>> Specifies whether transaction commit will wait for WAL records
> >>> to be replicated before the command returns a "success"
> >>> indication to the client.
> >> The word "replicated" here could be taken to mean different things,
> >> most obviously:
> >>
> >> - slave has received the WAL
> >> - slave has fsync'd the WAL
> >> - slave has applied the WAL
> > Perhaps the level of "replication guarantee" should be decided on the
> > slave side, by
> > having a configuration parameter there
> >
> > report_as_replicated = received|written_to_disk|fsynced|applied
> >
> > for different types of hosts may have wildly different guarantees and
> > performance
> > parameters for these. One could envision a WAL-archive type "standby"
> > which is
> > there for data persistence only will and never "apply" WAL.
> 
> Agreed, it feels natural to specify when a piece of WAL is acknowledged 
> in the standby.

That can also be done, its not a problem.

Many people asked for just "on" or "off".

Currently "on" <--> "slave has fsynced" WAL.

> Also, I dislike the idea of having the standby 
> specify that it's a synchronous standby that the master has to wait for. 
> Behavior on the master should be configured on the master.

The parameter on the standby affects the behaviour of the standby. The
standby is saying "don't pick me, I'm not a good target".

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Fri, 2010-12-31 at 12:06 +0200, Heikki Linnakangas wrote:

> Regarding the rest of the proposal, I would still prefer the UI 
> discussed here:
> 
> http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com
> 
> It ought to be the same amount of work to implement, and provides the 
> same feature set, but makes administration a bit easier by being able to 
> name the standbys. Also, I dislike the idea of having the standby 
> specify that it's a synchronous standby that the master has to wait for. 
> Behavior on the master should be configured on the master.

Good point; I've added the people on the copy list from that post. This
question is they key, so please respond after careful thought on my
points below.

There are ways to blend together the two approaches, discussed later,
though first we need to look at the reasons behind my proposals.

I see significant real-world issues with configuring replication using
multiple named servers, as described in the link above:

1. Syncing to multiple standbys does not guarantee that the updates to
the standbys are in any way coordinated. You can run a query on one
standby and get one answer and at the exact same time run the same query
on another standby and get a different answer (slightly ahead/behind).
That also means that if the master crashes one of the servers can still
be ahead or behind, even though you asked them to be the same. So you
don't actually get what you think you're getting.

2. Availability of the cluster just went down. If *any* of the
"important nodes" goes down, then everything just freezes. (I accept
that you want that, and have provided that as an option).

3. Administrative complexity just jumped a huge amount. 

(a) If you add or remove servers to the config you need to respecify all
the parameters, which need to be specific to the exact set of servers.
There is no way to test that you have configured the parameters
correctly without a testbed that exactly mirrors production with same
names etc., or trying it in directly in production. So availability
takes another potential hit because of user error.

(b) After failover, the list of synchronous_standbys needs to be
re-specified, yet what is the correct list of servers? The only way to
make that config work is with complex middleware that automatically
generates new config files. I don't think that is "the same amount of
work to implement", its an order of magnitude harder overall.

4. As a result of the administrative complexity, testing the full range
of function will take significantly longer and that is likely to have a
direct impact on the robustness of PostgreSQL 9.1.

5. Requesting sync from more than one server performs poorly, since you
must wait for additional servers. If there are sporadic or systemic
network performance issues you will be badly hit by them. Monitoring
that just got harder also. First-response-wins is more robust in the
case of volatile resources since it implies responsiveness to changing
conditions.

6. You just lost the ability to control performance on the master, with
a userset. Performance is a huge issue with sync rep. If you can't
control it, you'll simply turn it off. Having a feature that we daren't
ever use because it performs poorly helps nobody. This is not a tick-box
in our marketing checklist, I want it to be genuinely real-world usable.

I understand very well that Oracle provides that level of configuration,
though I think it is undesirable in 90% of real world use cases. I also
understand how sexy that level of configuration *sounds*, but I
genuinely believe trying to deliver that would be a mistake for
PostgreSQL. IMHO we should take the same road here as we do in other
things: simplicity encouraged, complexity allowed. So I don't have any
objection to supporting that functionality in the future, but I believe
it is not something we should be encouraging (ever), nor is it something
we need for this release.

I suppose we might regard the feature set I am proposing as being the
same as making synchronous_standbys a USERSET parameter, and allowing
just two options: 
"none" - allowing the user to specify async if they wish it
"*" - allowing people to specify that syncing to *any* standby is
acceptable

We can blend the two approaches together, if we wish, by having two
parameters (plus server naming) synchronous_replication = on | off (USERSET) synchronous_standbys = '...'
If synchronous_standbys is not set and synchronous_replication = on then
we sync to any standby. If  synchronous_replication = off then we use
async replication, whatever synchronous_standbys is set to.
If synchronous_standbys is set, then we use sync rep to all listed
servers.

My proposal amounts to "lets add synchronous_standbys as a parameter in
9.2". If you really think that we need that functionality in this
release, lets get the basic stuff added now and then fold in those ideas
on top afterwards. If we do that, I will help. However, my only
insistence is that we explain the above points very clearly in the docs
to specifically dissuade people from using those features for typical
cases.

If you wondered why I ignored your post previously, its because I
understood that Fujii's post of 15 Oct, one week later, effectively
accepted my approach, albeit with two additional parameters. That is the
UI that I had been following.
http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Thu, 2010-12-30 at 20:26 -0700, Joshua Tolley wrote:
> 2) initiate fsync on the primary first
> >    - In this case, the slave is always slightly behind.  If if your
> > primary falls over, you don't give commit messages to the clients,
> but
> > if it recovers, it might have committed data, and slaves will still
> be
> > able to catch up.
> > 
> > The thing is that currently, even without replication, #2 can
> happen.
> 
> For what little it's worth, I vote for this option, because it's a
> problem that can already happen (as opposed to adding an entirely new
> type of problem to the mix).

This proposal provides #2, so your wish is met.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Aidan Van Dyk
Date:
On Fri, Dec 31, 2010 at 5:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> Your picture above is a common misconception. I will add something to
> the docs to explain this.

> 2. "sync" does not guarantee that the updates to the standbys are in any
> way coordinated. You can run a query on one standby and get one answer
> and at the exact same time run the same query on another standby and get
> a different answer (slightly ahead/behind). That also means that if the
> master crashes one of the servers will be ahead or behind. You can use
> pg_last_xlog_receive_location() to check which one that is.
>
> When people say they want *all* servers to respond, its usually because
> they want (2), but that is literally impossible in a distributed system.

Just to try and be clear again, in "sync" that Stefan and I are
talking about, we really don't care that the slave could be a "hot
standby" answering queries.  In fact, mine wouldn't be. Mine would
likely be pg_streamrecv or something.   I'm just looking for a
guarantee that I've got a copy of the data safely in the next rack,
and a separate building before I tell the client I've moved his money.

I want a synchronous replication of the *data*, and not a system where
I can distribute queries.  I'm looking for disaster mitigation, not
load mitigation.  A replacement for clustered/replicated
devices/filesystems under pg_xlog.

Having the next rack slave be "hot" in terms of applying WAL and ready
to take over instantly would be a bonus, as long as I can guarantee
it's current (i.e has all data a primary's COMMIT has acknowledged).

So, that's what I want, and that's what your docs suggest is
impossible currently; 1st past post means that I can only ever
reliably configure 1 sync slave and be sure it will have all
acknowledged commits.  I can likely get *close* to that by putting
only my "slowest" slave as the only sync slave, and monitoring the
heck out of my "asynchronous but I want to be synchronous" slave, but
I'ld rather trust the PG community to build robust synchronization
than myself to build robust enough monitoring to catch that my slave
is farther behind than the slower synchronous one.


That said, I think the expectation is that if I were building a
query-able "hot standby" cluster in sync rep mode, once I get a commit
confirmation, I should be able to then initiate a new transaction on
any member of that sync rep cluster and see the data I just committed.

Yes, I know I could see *newer* data.  And I know that the primary
could already have newer data. Yes, we have the problem even on a
single pg cluster on a single machine.  But the point is that if
you've committed, any new transactions see *at least* that data or
newer.  But no chance of older.

But personally, I'm not interested in that ;-)
--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Sync Rep Design

From
Robert Haas
Date:
On Fri, Dec 31, 2010 at 6:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I suppose we might regard the feature set I am proposing as being the
> same as making synchronous_standbys a USERSET parameter, and allowing
> just two options:
> "none" - allowing the user to specify async if they wish it
> "*" - allowing people to specify that syncing to *any* standby is
> acceptable
>
> We can blend the two approaches together, if we wish, by having two
> parameters (plus server naming)
>  synchronous_replication = on | off (USERSET)
>  synchronous_standbys = '...'
> If synchronous_standbys is not set and synchronous_replication = on then
> we sync to any standby. If  synchronous_replication = off then we use
> async replication, whatever synchronous_standbys is set to.
> If synchronous_standbys is set, then we use sync rep to all listed
> servers.
>
> My proposal amounts to "lets add synchronous_standbys as a parameter in
> 9.2".

FWIW, this plan of attack would be OK with me.  I had taken your
previous comments to imply that you were opposed to EVER having a
parameter like this on the master, but I could see deferring it.  In
my ideal world, we'd not have the synchronous_replication_service
parameter on the standbys at all - all standbys would be candidates,
unless someone decides to name them and set synchronous_standbys on
the master.  But maybe having a simple on/off on the standby is
reasonable.

*thinks a little bit*

Someone may have proposed this before, but one way of getting standby
naming "for free" would be to make the standby names the same as the
roles used to log in, rather than adding a separate parameter.  We
could just recommend to people that they use a separate, descriptive
role name for each standby.  Then the synchronous_standbys parameter -
when added - would mean "a standby from one of these roles".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 31.12.2010 13:48, Simon Riggs wrote:
> On Fri, 2010-12-31 at 12:06 +0200, Heikki Linnakangas wrote:
>
>> Regarding the rest of the proposal, I would still prefer the UI
>> discussed here:
>>
>> http://archives.postgresql.org/message-id/4CAE030A.2060701@enterprisedb.com
>>
>> It ought to be the same amount of work to implement, and provides the
>> same feature set, but makes administration a bit easier by being able to
>> name the standbys. Also, I dislike the idea of having the standby
>> specify that it's a synchronous standby that the master has to wait for.
>> Behavior on the master should be configured on the master.
>
> Good point; I've added the people on the copy list from that post. This
> question is they key, so please respond after careful thought on my
> points below.
>
> There are ways to blend together the two approaches, discussed later,
> though first we need to look at the reasons behind my proposals.
>
> I see significant real-world issues with configuring replication using
> multiple named servers, as described in the link above:

All of these points only apply to specifying *multiple* named servers in 
the synchronous_standbys='...' list. That's certainly a more complicated 
scenario, and the configuration is more complicated as a result. With 
your proposal, it's not possible in the first place.

Multiple synchronous standbys probably isn't needed by most people, so 
I'm fine with leaving that out for now, keeping the design the same 
otherwise. I included it in the proposal because it easily falls out of 
the design. So, if you're worried about the complexities of multiple 
synchronous standbys, let's keep the UI exactly the same as what I 
described in the link above, but only allow one name in the 
synchronous_standbys setting, instead of a list.

> 3. Administrative complexity just jumped a huge amount.
>
> (a) If you add or remove servers to the config you need to respecify all
> the parameters, which need to be specific to the exact set of servers.

Hmm, this could be alleviated by allowing the master to have a name too. 
All the configs could then be identical, except for the unique name for 
each server. For example, for a configuration with three servers that 
are all synchronous with each other, each server would have 
"synchronous_standbys='server1, server2, server3'" in the config file. 
The master would simply ignore the entry for itself.

> (b) After failover, the list of synchronous_standbys needs to be
> re-specified, yet what is the correct list of servers? The only way to
> make that config work is with complex middleware that automatically
> generates new config files.

It depends on what you want. I think you're envisioning that the 
original server is taken out of the system and not waited for, meaning 
that you accept a lower level of persistence after failover. Yes, then 
you need to change the config. Or more likely you prepare the config 
file in the standby that way to begin with.

> I don't think that is "the same amount of
> work to implement", its an order of magnitude harder overall.

I meant it's the same amount of work to implement the feature in 
PostgreSQL. No doubt that maintaining such a setup in production is more 
complicated.

> 5. Requesting sync from more than one server performs poorly, since you
> must wait for additional servers. If there are sporadic or systemic
> network performance issues you will be badly hit by them. Monitoring
> that just got harder also. First-response-wins is more robust in the
> case of volatile resources since it implies responsiveness to changing
> conditions.
>
> 6. You just lost the ability to control performance on the master, with
> a userset. Performance is a huge issue with sync rep. If you can't
> control it, you'll simply turn it off. Having a feature that we daren't
> ever use because it performs poorly helps nobody. This is not a tick-box
> in our marketing checklist, I want it to be genuinely real-world usable.

You could make synchronous_standbys a user-settable GUC, just like your 
proposed boolean switch. You could then control on a per-transaction 
basis which servers you want to wait to respond. Although perhaps it 
would be more user-friendly to just have an additional boolean GUC, 
similar to synchronous_commit=on/off. Or maybe synchronous_commit is 
enough to control that.

> I suppose we might regard the feature set I am proposing as being the
> same as making synchronous_standbys a USERSET parameter, and allowing
> just two options:
> "none" - allowing the user to specify async if they wish it
> "*" - allowing people to specify that syncing to *any* standby is
> acceptable
>
> We can blend the two approaches together, if we wish, by having two
> parameters (plus server naming)
>    synchronous_replication = on | off (USERSET)
>    synchronous_standbys = '...'
> If synchronous_standbys is not set and synchronous_replication = on then
> we sync to any standby. If  synchronous_replication = off then we use
> async replication, whatever synchronous_standbys is set to.
> If synchronous_standbys is set, then we use sync rep to all listed
> servers.

Sounds good.

I still don't like the synchronous_standbys='' and 
synchronous_replication=on combination, though. IMHO that still amounts 
to letting the standby control the behavior on master, and it makes it 
impossible to temporarily add an asynchronous standby to the mix. I 
could live with it, you wouldn't be forced to use it that way after all, 
but I would still prefer to throw an error on that combination. Or at 
least document the pitfalls and recommend always naming the standbys.

> My proposal amounts to "lets add synchronous_standbys as a parameter in
> 9.2". If you really think that we need that functionality in this
> release, lets get the basic stuff added now and then fold in those ideas
> on top afterwards. If we do that, I will help. However, my only
> insistence is that we explain the above points very clearly in the docs
> to specifically dissuade people from using those features for typical
> cases.

Huh, wait, if you leave out synchronous_standbys, that's a completely 
different UI again. I think we've finally reached agreement on how this 
should be configured, let's stick to that, please.

(I would be fine with limiting synchronous_standbys to just one server 
in this release though.)

> If you wondered why I ignored your post previously, its because I
> understood that Fujii's post of 15 Oct, one week later, effectively
> accepted my approach, albeit with two additional parameters. That is the
> UI that I had been following.
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php

That thread makes no mention of how to specify which standbys are 
synchronous and which are not. It's about specifying the timeout and 
whether to wait for a disconnected standby. Yeah, Fujii-san's proposal 
seems reasonable for configuring that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 31.12.2010 14:40, Robert Haas wrote:
> Someone may have proposed this before, but one way of getting standby
> naming "for free" would be to make the standby names the same as the
> roles used to log in, rather than adding a separate parameter.  We
> could just recommend to people that they use a separate, descriptive
> role name for each standby.  Then the synchronous_standbys parameter -
> when added - would mean "a standby from one of these roles".

Seems a bit weird. It's not a lot of effort to give each standby a name. 
But if you want something automatic, how about gethostname() ?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Robert Haas
Date:
On Fri, Dec 31, 2010 at 7:57 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 31.12.2010 14:40, Robert Haas wrote:
>>
>> Someone may have proposed this before, but one way of getting standby
>> naming "for free" would be to make the standby names the same as the
>> roles used to log in, rather than adding a separate parameter.  We
>> could just recommend to people that they use a separate, descriptive
>> role name for each standby.  Then the synchronous_standbys parameter -
>> when added - would mean "a standby from one of these roles".
>
> Seems a bit weird. It's not a lot of effort to give each standby a name. But
> if you want something automatic, how about gethostname() ?

Uh, wow, no way.  That would be making a critical aspect of system
reliability depend on something way, way outside of our configuration.

What's weird about using the role name?  That's our standard way of
distinguishing between two or more users.  Why invent something new?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/31/2010 02:39 PM, Robert Haas wrote:
> On Fri, Dec 31, 2010 at 7:57 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 31.12.2010 14:40, Robert Haas wrote:
>>>
>>> Someone may have proposed this before, but one way of getting standby
>>> naming "for free" would be to make the standby names the same as the
>>> roles used to log in, rather than adding a separate parameter.  We
>>> could just recommend to people that they use a separate, descriptive
>>> role name for each standby.  Then the synchronous_standbys parameter -
>>> when added - would mean "a standby from one of these roles".
>>
>> Seems a bit weird. It's not a lot of effort to give each standby a name. But
>> if you want something automatic, how about gethostname() ?
>
> Uh, wow, no way.  That would be making a critical aspect of system
> reliability depend on something way, way outside of our configuration.

+1

>
> What's weird about using the role name?  That's our standard way of
> distinguishing between two or more users.  Why invent something new?

wel a user is not a host/server for me - given there is no real benefit 
from using distinct roles for each standby yet I don't see why we should 
complicate the replication setup procedure by requiring 1 role per standby.
So I'm all for giving each standby a name but please make it an explicit 
one and not something that is only vaguely related to the actual standby 
host.


Stefan


Re: Sync Rep Design

From
Robert Haas
Date:
On Fri, Dec 31, 2010 at 8:48 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
>> What's weird about using the role name?  That's our standard way of
>> distinguishing between two or more users.  Why invent something new?
>
> wel a user is not a host/server for me - given there is no real benefit from
> using distinct roles for each standby yet I don't see why we should
> complicate the replication setup procedure by requiring 1 role per standby.
> So I'm all for giving each standby a name but please make it an explicit one
> and not something that is only vaguely related to the actual standby host.

OK.  If that's the consensus, can someone post a patch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Simon Riggs
Date:
On Fri, 2010-12-31 at 07:33 -0500, Aidan Van Dyk wrote:
> On Fri, Dec 31, 2010 at 5:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> > Your picture above is a common misconception. I will add something to
> > the docs to explain this.
> 
> > 2. "sync" does not guarantee that the updates to the standbys are in any
> > way coordinated. You can run a query on one standby and get one answer
> > and at the exact same time run the same query on another standby and get
> > a different answer (slightly ahead/behind). That also means that if the
> > master crashes one of the servers will be ahead or behind. You can use
> > pg_last_xlog_receive_location() to check which one that is.
> >
> > When people say they want *all* servers to respond, its usually because
> > they want (2), but that is literally impossible in a distributed system.
> 
> Just to try and be clear again, in "sync" that Stefan and I are
> talking about, we really don't care that the slave could be a "hot
> standby" answering queries.  In fact, mine wouldn't be. Mine would
> likely be pg_streamrecv or something.   I'm just looking for a
> guarantee that I've got a copy of the data safely in the next rack,
> and a separate building before I tell the client I've moved his money.
> 
> I want a synchronous replication of the *data*, and not a system where
> I can distribute queries.  I'm looking for disaster mitigation, not
> load mitigation.  A replacement for clustered/replicated
> devices/filesystems under pg_xlog.
> 
> Having the next rack slave be "hot" in terms of applying WAL and ready
> to take over instantly would be a bonus, as long as I can guarantee
> it's current (i.e has all data a primary's COMMIT has acknowledged).

> So, that's what I want, and that's what your docs suggest is
> impossible currently; 1st past post means that I can only ever
> reliably configure 1 sync slave and be sure it will have all
> acknowledged commits.  I can likely get *close* to that by putting
> only my "slowest" slave as the only sync slave, and monitoring the
> heck out of my "asynchronous but I want to be synchronous" slave, but
> I'ld rather trust the PG community to build robust synchronization
> than myself to build robust enough monitoring to catch that my slave
> is farther behind than the slower synchronous one.

> That said, I think the expectation is that if I were building a
> query-able "hot standby" cluster in sync rep mode, once I get a commit
> confirmation, I should be able to then initiate a new transaction on
> any member of that sync rep cluster and see the data I just committed.

> Yes, I know I could see *newer* data.  And I know that the primary
> could already have newer data. Yes, we have the problem even on a
> single pg cluster on a single machine.  But the point is that if
> you've committed, any new transactions see *at least* that data or
> newer.  But no chance of older.
> 
> But personally, I'm not interested in that ;-)

I understand your requirements, listed above.

There are good technical reasons why trying to achieve *all* of the
above lets slip the other unstated requirements of availability,
complexity, performance etc.. 

Inventing parameter combinations merely hides the fact that these things
aren't all simultaneously achievable. In light of that, I have been
espousing a simple approach to the typical case, and for the first
release. I can see that people may assume my words have various other
reasons behind them, but that's not the case. If I could give it all to
you, I would.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Fri, 2010-12-31 at 14:40 +0200, Heikki Linnakangas wrote:
> On 31.12.2010 13:48, Simon Riggs wrote:
> >
> > I see significant real-world issues with configuring replication using
> > multiple named servers, as described in the link above:
> 
> All of these points only apply to specifying *multiple* named servers in 
> the synchronous_standbys='...' list.

Unfortunately, some of the points apply to using named servers ever,
even if there is only one.

>  That's certainly a more complicated 
> scenario, and the configuration is more complicated as a result. 
> With your proposal, it's not possible in the first place.
> 
> Multiple synchronous standbys probably isn't needed by most people, so 
> I'm fine with leaving that out for now, keeping the design the same 
> otherwise. I included it in the proposal because it easily falls out of 
> the design. So, if you're worried about the complexities of multiple 
> synchronous standbys, let's keep the UI exactly the same as what I 
> described in the link above, but only allow one name in the 
> synchronous_standbys setting, instead of a list.

The best usage recommendation is still to have 2+ standbys, *any* of
which can be used to provide sync rep. That is the best performance,
best availability and easiest to configure that I know of. That best
usage is not achievable with uniquely named servers; using non-unique
names defeats the point of having names in the first place.

I accept that the "best usage" is a general case and there may be
circumstances that make the difficulties of named servers worth the
trouble.

So replicating to multiple synchronous standbys is definitely needed in
this release. *Confirming* replication to multiple named sync standbys
is the thing we don't need in this release.

> > I suppose we might regard the feature set I am proposing as being the
> > same as making synchronous_standbys a USERSET parameter, and allowing
> > just two options:
> > "none" - allowing the user to specify async if they wish it
> > "*" - allowing people to specify that syncing to *any* standby is
> > acceptable
> >
> > We can blend the two approaches together, if we wish, by having two
> > parameters (plus server naming)
> >    synchronous_replication = on | off (USERSET)
> >    synchronous_standbys = '...'
> > If synchronous_standbys is not set and synchronous_replication = on then
> > we sync to any standby. If  synchronous_replication = off then we use
> > async replication, whatever synchronous_standbys is set to.
> > If synchronous_standbys is set, then we use sync rep to all listed
> > servers.
> 
> Sounds good.

> I still don't like the synchronous_standbys='' and 
> synchronous_replication=on combination, though. 
> IMHO that still amounts 
> to letting the standby control the behavior on master, and it makes it 
> impossible to temporarily add an asynchronous standby to the mix. I 
> could live with it, you wouldn't be forced to use it that way after all, 
> but I would still prefer to throw an error on that combination. Or at 
> least document the pitfalls and recommend always naming the standbys.

We need a parameter set that makes the best practice easy/easiest to
specify, and yet more complicated configurations possible. So I'm happy
to add "synchronous_standbys" parameter, as long as it is possible to
specify "any" (for which I would use "*"), which would be the default.
Initially that would be restricted to just one name.

Will pass the server name as an option after IDENTIFY SYSTEM <name>.

Anyway, lets continue the discussion next year.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Hannu Krosing
Date:
On 31.12.2010 13:40, Heikki Linnakangas wrote:
>
> Sounds good.
>
> I still don't like the synchronous_standbys='' and 
> synchronous_replication=on combination, though. IMHO that still 
> amounts to letting the standby control the behavior on master, and it 
> makes it impossible to temporarily add an asynchronous standby to the mix.
A sync standby _will_have_ the ability to control the master anyway by 
simply being there or not.

What is currently proposed is having dual power lines / dual UPS' and 
working happily on when one of them fails.
Requiring both of them to be present defeats the original purpose  of 
doubling them.

So following Simons design of 2 standbys and only one required to ACK to 
commit you get 2X reliability of single standby.

In a design where you have 2 standbys and both are required to ACK to 
commit you get only 1/2 the reliability of single standby.

Having a list of 10 standbys and requiring ACK from all, you get only 
10% of the reliability.

I agree that there can be scenarios where you may need 10 sync copies 
before committing on master - usually for non-technical reasons like 
some accounting law or whatever - these are far rarer than requirement 
to have reasonable performance and 99.999% system uptime when using only 
99.99% reliable hardware. And in such cases where you need multiple 
copies you will need some out-of-database technology (like signed 
timestamps) to make the data non-falsifiable as well, so you can't solve 
this with just configuring sync rep.

> I could live with it, you wouldn't be forced to use it that way after 
> all, but I would still prefer to throw an error on that combination. 
> Or at least document the pitfalls and recommend always naming the 
> standbys.
>
>> My proposal amounts to "lets add synchronous_standbys as a parameter in
>> 9.2". If you really think that we need that functionality in this
>> release, lets get the basic stuff added now and then fold in those ideas
>> on top afterwards. If we do that, I will help. However, my only
>> insistence is that we explain the above points very clearly in the docs
>> to specifically dissuade people from using those features for typical
>> cases.
>
> Huh, wait, if you leave out synchronous_standbys, that's a completely 
> different UI again. I think we've finally reached agreement on how 
> this should be configured, let's stick to that, please.
>
> (I would be fine with limiting synchronous_standbys to just one server 
> in this release though.)
>
>> If you wondered why I ignored your post previously, its because I
>> understood that Fujii's post of 15 Oct, one week later, effectively
>> accepted my approach, albeit with two additional parameters. That is the
>> UI that I had been following.
>> http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php
>
> That thread makes no mention of how to specify which standbys are 
> synchronous and which are not.
The simplest way would be to have separate database users for sync and 
async standbys ?

That would allow any standby with right credentials act as a sync user, 
and those who are not eligible are not accepted even if they try to act 
as "a synchronity (?) provider".

> It's about specifying the timeout and whether to wait for a 
> disconnected standby. Yeah, Fujii-san's proposal seems reasonable for 
> configuring that.
>
--------------------
Hannu Krosing



Re: Sync Rep Design

From
Josh Berkus
Date:
On 12/31/10 4:40 AM, Robert Haas wrote:
> Someone may have proposed this before, but one way of getting standby
> naming "for free" would be to make the standby names the same as the
> roles used to log in, rather than adding a separate parameter.  We
> could just recommend to people that they use a separate, descriptive
> role name for each standby.  Then the synchronous_standbys parameter -
> when added - would mean "a standby from one of these roles".

I like this idea; it has an elegant simplicity about it and right now I
can't think of any real faults.  It would have the added benefit that
each standby "group" would show up by the group name in ps and on
pg_stat_activity.

However, I agree strongly with Simon that we really want the simplest
possible synch rep implementation for 9.1, given that we're currently 15
days away from the *last* commitfest.  Nobody, at this point, has really
even test any of the sync rep patches for reliability or performance.

Here's how I could see us developing:

Simplest (9.1):
-- synch/async capability set on the standby in recovery.conf
-- synch/async transaction status (async, recv, fsync, apply) declared
as a userset by the executing session (from Simon's patch)
-- only one ack for sync, regardless of the number of standbys

This would allow users to have a single sync standby plus a pool of
async standbys, which is what I think 90% of users who care about sync
standby want.

More Complex (9.2):
-- all of the above, *plus* the ability to have standbys with ROLEs and
require acks from one server in each ROLE.

"Web-Scale" (9.3):
-- all of the above, plus "group commit", the ability to specific a
number of servers in each ROLE who must ack.

... but the most important thing is to make the feature for 9.1 simple,
simple, simple.  We know we won't get it right on the first try.

--Josh Berkus


Re: Sync Rep Design

From
Simon Riggs
Date:
On Fri, 2010-12-31 at 22:18 +0100, Hannu Krosing wrote:
> On 31.12.2010 13:40, Heikki Linnakangas wrote:
> >
> > Sounds good.
> >
> > I still don't like the synchronous_standbys='' and 
> > synchronous_replication=on combination, though. IMHO that still 
> > amounts to letting the standby control the behavior on master, and it 
> > makes it impossible to temporarily add an asynchronous standby to the mix.
> A sync standby _will_have_ the ability to control the master anyway by 
> simply being there or not.
> 
> What is currently proposed is having dual power lines / dual UPS' and 
> working happily on when one of them fails.
> Requiring both of them to be present defeats the original purpose  of 
> doubling them.
> 
> So following Simons design of 2 standbys and only one required to ACK to 
> commit you get 2X reliability of single standby.
...

Yes, working out the math is a good idea. Things are much clearer if we
do that.

Let's assume we have 98% availability on any single server.

1. Having one primary and 2 standbys, either of which can acknowledge,
and we never lock up if both standbys fail, then we will have 99.9992%
server availability. (So PostgreSQL hits "5 Nines", with data
guarantees). ("Maximised availability")

2. Having one primary and 2 standbys, either of which can acknowledge,
and we lock up if both standbys fail to protect the data, then we will
have 99.996% availability. Slightly less availability, but we don't put
data at risk at any time, since any commit is always covered by at least
2 servers. ("Maximised protection")

3. If we have a primary and a single standby which must acknowledge, and
we choose to lock up if the standby fails, then we will have only 96.04%
availability.

4. If we have a primary and two standbys (named or otherwise), both of
which must acknowledge or we lock up the master, then we have an awesome
94.12% availability.

On the last two, there is also an increased likelihood of administrative
cock-ups because of more specific and complex config requirements.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Jeff Janes
Date:
On 12/31/10, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote:
>
>> Maybe it has been discussed but I still don't see way it makes any
>> sense. If I declare a standby a sync standby I better want it sync - not
>> "maybe sync". consider the case of a 1 master and two identical sync
>> standbys - one sync standby is in the same datacenter the other is in a
>> backup location say 15km away.
>> Given there is a small constant latency to the second box (even if you
>> have fast networks) the end effect is that the second standby will NEVER
>> be sync (because the local one will always be faster) and you end up
>> with an async slave that cannot be used per your business rules?
>
> Your picture above is a common misconception. I will add something to
> the docs to explain this.
>
> 1. "sync" is a guarantee about how we respond to the client when we
> commit. If we wait for more than one response that slows things down,
> makes the cluster more fragile, complicates the code and doesn't
> appreciably improve the guarantee.

Whether it is more fragile depends on if you look at up-time fragility
or durability fragility.  I think it can appreciably improve the
guarantee.

>
> 2. "sync" does not guarantee that the updates to the standbys are in any
> way coordinated. You can run a query on one standby and get one answer
> and at the exact same time run the same query on another standby and get
> a different answer (slightly ahead/behind). That also means that if the
> master crashes one of the servers will be ahead or behind. You can use
> pg_last_xlog_receive_location() to check which one that is.

If at least one of the standbys is in the same smoking crater as the
primary, then pg_last_xlog_receive_location on it is unlikely to
respond.

The guarantee goes away precisely when it is needed.

Cheers,

Jeff


Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 12/31/2010 08:15 PM, Simon Riggs wrote:
> On Fri, 2010-12-31 at 14:40 +0200, Heikki Linnakangas wrote:
>> On 31.12.2010 13:48, Simon Riggs wrote:
>>>
>>> I see significant real-world issues with configuring replication using
>>> multiple named servers, as described in the link above:
>>
>> All of these points only apply to specifying *multiple* named servers in
>> the synchronous_standbys='...' list.
>
> Unfortunately, some of the points apply to using named servers ever,
> even if there is only one.
>
>>   That's certainly a more complicated
>> scenario, and the configuration is more complicated as a result.
>> With your proposal, it's not possible in the first place.
>>
>> Multiple synchronous standbys probably isn't needed by most people, so
>> I'm fine with leaving that out for now, keeping the design the same
>> otherwise. I included it in the proposal because it easily falls out of
>> the design. So, if you're worried about the complexities of multiple
>> synchronous standbys, let's keep the UI exactly the same as what I
>> described in the link above, but only allow one name in the
>> synchronous_standbys setting, instead of a list.
>
> The best usage recommendation is still to have 2+ standbys, *any* of
> which can be used to provide sync rep. That is the best performance,
> best availability and easiest to configure that I know of. That best
> usage is not achievable with uniquely named servers; using non-unique
> names defeats the point of having names in the first place.

I disagree with that usage recommendation, if we ask for sync we should 
get sync, your definition is more like "we should have fsync=on only do 
fsync sometimes and still claim it is safe". Also it is very much 
possible to do that semisync style replication feature with named 
servers (see my post about the design I would like to see as a dba) and 
STILL keep the flexibility to do what other people (like me) in that 
thread want (at least from an UI perspective).
As I said before I would very much prefer to have us restricted to 
exactly ONE sync capable standby and x async ones if we cannot agree on 
a reasonable interface :(

>
> I accept that the "best usage" is a general case and there may be
> circumstances that make the difficulties of named servers worth the
> trouble.
>
> So replicating to multiple synchronous standbys is definitely needed in
> this release. *Confirming* replication to multiple named sync standbys
> is the thing we don't need in this release.

well you keep saying that but to be honest I cannot really even see a 
usecase for me - what is "only a random one of a set of servers is sync 
at any time and I don't really know which one".
My usecases would al involved 2 sync standbys and 1 or more async ones. 
but the second sync one would be in a different datacenter and I NEED to 
protect against a datacenter failure which your proposals says I cannot 
do :(

Stefan


Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/01/2011 02:13 PM, Jeff Janes wrote:
> On 12/31/10, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote:
>>
>>> Maybe it has been discussed but I still don't see way it makes any
>>> sense. If I declare a standby a sync standby I better want it sync - not
>>> "maybe sync". consider the case of a 1 master and two identical sync
>>> standbys - one sync standby is in the same datacenter the other is in a
>>> backup location say 15km away.
>>> Given there is a small constant latency to the second box (even if you
>>> have fast networks) the end effect is that the second standby will NEVER
>>> be sync (because the local one will always be faster) and you end up
>>> with an async slave that cannot be used per your business rules?
>>
>> Your picture above is a common misconception. I will add something to
>> the docs to explain this.
>>
>> 1. "sync" is a guarantee about how we respond to the client when we
>> commit. If we wait for more than one response that slows things down,
>> makes the cluster more fragile, complicates the code and doesn't
>> appreciably improve the guarantee.
>
> Whether it is more fragile depends on if you look at up-time fragility
> or durability fragility.  I think it can appreciably improve the
> guarantee.
>
>>
>> 2. "sync" does not guarantee that the updates to the standbys are in any
>> way coordinated. You can run a query on one standby and get one answer
>> and at the exact same time run the same query on another standby and get
>> a different answer (slightly ahead/behind). That also means that if the
>> master crashes one of the servers will be ahead or behind. You can use
>> pg_last_xlog_receive_location() to check which one that is.
>
> If at least one of the standbys is in the same smoking crater as the
> primary, then pg_last_xlog_receive_location on it is unlikely to
> respond.
>
> The guarantee goes away precisely when it is needed.

that is exactly my point - if have no guarantee that your SYNC standby 
is actually sync there is no use for it being used in business cases 
that require sync replication.
If we cannot support that usecase I would either like to see us 
restricting to only one sync capable standby or by putting a big CAVEAT 
into the docs saying that sync replication in pg only is a hint and not 
a guarantee that might or might not be honored in the case of more than 
one standby.



Stefan


Re: Sync Rep Design

From
Robert Haas
Date:
On Sat, Jan 1, 2011 at 9:03 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> that is exactly my point - if have no guarantee that your SYNC standby is
> actually sync there is no use for it being used in business cases that
> require sync replication.
> If we cannot support that usecase I would either like to see us restricting
> to only one sync capable standby or by putting a big CAVEAT into the docs
> saying that sync replication in pg only is a hint and not a guarantee that
> might or might not be honored in the case of more than one standby.

I think it's clear that different people want to different things.  I
understand Simon's point, but I think the point Stefan and Jeff are
making is equally valid.  I think the solution is:

- Simon gets to implement his version first because he's writing the
code.  If someone else writes the code then they get to pick.

- Whoever wants to make the other thing work can write a patch for that after.

- The docs should not allege that either setup is preferable to the
other, because there is not now and will never be consensus that this
is in fact true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/01/2011 03:15 PM, Robert Haas wrote:
> On Sat, Jan 1, 2011 at 9:03 AM, Stefan Kaltenbrunner
> <stefan@kaltenbrunner.cc>  wrote:
>> that is exactly my point - if have no guarantee that your SYNC standby is
>> actually sync there is no use for it being used in business cases that
>> require sync replication.
>> If we cannot support that usecase I would either like to see us restricting
>> to only one sync capable standby or by putting a big CAVEAT into the docs
>> saying that sync replication in pg only is a hint and not a guarantee that
>> might or might not be honored in the case of more than one standby.
>
> I think it's clear that different people want to different things.  I
> understand Simon's point, but I think the point Stefan and Jeff are
> making is equally valid.  I think the solution is:
>
> - Simon gets to implement his version first because he's writing the
> code.  If someone else writes the code then they get to pick.

fair point ;)

>
> - Whoever wants to make the other thing work can write a patch for that after.

yeah but I still would like to get a statement on why simon thinks that 
the design heikki and others have proposed for supporting multiple sync 
standby that are actually sync (and also supports semi-sync as his patch 
which i consider a degraded case of full sync).

if you take the syncronous_standbys=<list> thing as an example what 
about considering it as:

foo: sync capable standby
bar  sync capable standby
baz: sync capable standby

with

syncronous_standbys=<standbyname>:<sync required(bool)

syncronous_standbys=foo,bar,baz you get sems sync - whatever standby 
returns first causes the master to return as well (as in what simons 
patch does)
syncronous_standbys=foo:true,bar:true,baz - require at least foo and bar 
to reply before the master returns

** the syntax chosen ist just a random example and could be anything **


that one could as well be used to do other per standby configurations 
(timeouts, wait behaviour etc) or not only being a 
syncronous_standby=<list> thing but more a standby_list = <list> thingy 
that also includes async slaves (defaulting to * or whatever so 
everything is async with default settings unless anything else is specified)

>
> - The docs should not allege that either setup is preferable to the
> other, because there is not now and will never be consensus that this
> is in fact true.

well I should think we need to clearly spell out everything that affects 
reliability and if we only support semi-sync for more than 1 standby we 
have only that setup :)
Anyway as long as sync rep is disabled by default I'm fine with that.


Stefan


Re: Sync Rep Design

From
Dimitri Fontaine
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> well you keep saying that but to be honest I cannot really even see a
> usecase for me - what is "only a random one of a set of servers is sync at
> any time and I don't really know which one".

It looks easy enough to get to know which one it is.  Surely the primary
knows and could update something visible through a system view for
users?  This as been asked for before and I was thinking there was a
consensus on this.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 05:13 -0800, Jeff Janes wrote:
> On 12/31/10, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Fri, 2010-12-31 at 09:27 +0100, Stefan Kaltenbrunner wrote:
> >
> >> Maybe it has been discussed but I still don't see way it makes any
> >> sense. If I declare a standby a sync standby I better want it sync - not
> >> "maybe sync". consider the case of a 1 master and two identical sync
> >> standbys - one sync standby is in the same datacenter the other is in a
> >> backup location say 15km away.
> >> Given there is a small constant latency to the second box (even if you
> >> have fast networks) the end effect is that the second standby will NEVER
> >> be sync (because the local one will always be faster) and you end up
> >> with an async slave that cannot be used per your business rules?
> >
> > Your picture above is a common misconception. I will add something to
> > the docs to explain this.
> >
> > 1. "sync" is a guarantee about how we respond to the client when we
> > commit. If we wait for more than one response that slows things down,
> > makes the cluster more fragile, complicates the code and doesn't
> > appreciably improve the guarantee.
> 
> Whether it is more fragile depends on if you look at up-time fragility
> or durability fragility.  I think it can appreciably improve the
> guarantee.

Yes, agreed. That is why I proposed quorum commit earlier in 2010, as a
way to improve the durability guarantee. That was bogged down by the
requirement for named servers, which I see as unnecessary.

> > 2. "sync" does not guarantee that the updates to the standbys are in any
> > way coordinated. You can run a query on one standby and get one answer
> > and at the exact same time run the same query on another standby and get
> > a different answer (slightly ahead/behind). That also means that if the
> > master crashes one of the servers will be ahead or behind. You can use
> > pg_last_xlog_receive_location() to check which one that is.
> 
> If at least one of the standbys is in the same smoking crater as the
> primary, then pg_last_xlog_receive_location on it is unlikely to
> respond.
> 
> The guarantee goes away precisely when it is needed.

Fairly obviously, I would not be advocating anything that forced you to
use a server in the "same smoking crater". I can't see any guarantee
that goes away precisely when it is needed.

Perhaps you could explain the issue you see, because your comments seem
unrelated to my point above.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/01/2011 05:28 PM, Dimitri Fontaine wrote:
> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>  writes:
>> well you keep saying that but to be honest I cannot really even see a
>> usecase for me - what is "only a random one of a set of servers is sync at
>> any time and I don't really know which one".
>
> It looks easy enough to get to know which one it is.  Surely the primary
> knows and could update something visible through a system view for
> users?  This as been asked for before and I was thinking there was a
> consensus on this.

well as jeff janes already said - anything that requires the master to 
still exist is not useful for a desaster. Consider the now often 
mentioned 2 sync standby scenario with one standby in the same location 
and one in a secondary location.
If you have a desaster(fire,water,explosion,admin fail,...) at the 
primary location and you have no access to either the master or the 
standby you will never be sure that the standby on the secondary 
location is actually "in sync" - it could be but you will never know if 
you lost that 1B$ invoice just commited on the master and the closeby 
standby and therefor confirmed to the client...
Most of my requirements have very hard requirements on the integrity of 
the data, very high requirements on the read-only availability and 
somewhat high requirements on the availability of a master for writes, 
but data integrity will always trump that.



Stefan


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 16:12 +0100, Stefan Kaltenbrunner wrote:

> I still would like to get a statement on why simon thinks that 
> the design heikki and others have proposed 

I've explained in huge detail why I think what I think, nor avoided any
technical issue.

It appears to me there has been substantial confusion over alternatives,
because of a misunderstanding about how synchronisation works. Requiring
confirmation that standbys are in sync is *not* the same thing as them
actually being in sync. Every single proposal made by anybody here on
hackers that supports multiple standby servers suffers from the same
issue: when the primary crashes you need to work out which standby
server is ahead.


> - The docs should not allege that either setup is preferable to the
> > other, because there is not now and will never be consensus that this
> > is in fact true.

I remain hopeful that people will read what I have read and understand
it. Having taken the trouble to do that publicly, my conscious is clear
that I've done the very best to explain things and make it easy for
users to avoid error. If I am prevented from putting sound advice into
the docs, I'll not worry too much.


> well I should think we need to clearly spell out everything that affects 
> reliability and if we only support semi-sync for more than 1 standby we 
> have only that setup :)

You can use sync rep with 1 or more standby servers.

At the end of the day, I can't stop anyone from saying "What an idiot,
he designed something that gave the same durability and availability as
Oracle and MySQL do, yet with additional performance management
features".

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 17:37 +0100, Stefan Kaltenbrunner wrote:
> On 01/01/2011 05:28 PM, Dimitri Fontaine wrote:
> > Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>  writes:
> >> well you keep saying that but to be honest I cannot really even see a
> >> usecase for me - what is "only a random one of a set of servers is sync at
> >> any time and I don't really know which one".
> >
> > It looks easy enough to get to know which one it is.  Surely the primary
> > knows and could update something visible through a system view for
> > users?  This as been asked for before and I was thinking there was a
> > consensus on this.
> 
> well as jeff janes already said - anything that requires the master to 
> still exist is not useful for a desaster.

Nobody has suggested that the master needs to still exist after a
disaster.

>  Consider the now often 
> mentioned 2 sync standby scenario with one standby in the same location 
> and one in a secondary location.
> If you have a desaster(fire,water,explosion,admin fail,...) at the 
> primary location and you have no access to either the master or the 
> standby you will never be sure that the standby on the secondary 
> location is actually "in sync" - it could be but you will never know if 
> you lost that 1B$ invoice just commited on the master and the closeby 
> standby and therefor confirmed to the client...

I've never suggested you configure your systems like that. It would of
course be stupid.

This is not a sensible technical discussion. I'll go back to coding.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/01/2011 05:55 PM, Simon Riggs wrote:
> On Sat, 2011-01-01 at 16:12 +0100, Stefan Kaltenbrunner wrote:
>
>> I still would like to get a statement on why simon thinks that
>> the design heikki and others have proposed
>
> I've explained in huge detail why I think what I think, nor avoided any
> technical issue.
>
> It appears to me there has been substantial confusion over alternatives,
> because of a misunderstanding about how synchronisation works. Requiring
> confirmation that standbys are in sync is *not* the same thing as them
> actually being in sync. Every single proposal made by anybody here on
> hackers that supports multiple standby servers suffers from the same
> issue: when the primary crashes you need to work out which standby
> server is ahead.

aaah that was exactly what I was after - so the problem is that when you 
have a sync standby it will technically always be "in front" of the 
master (because it needs to fsync/apply/whatever before the master).
In the end the question boils down to what is "the bigger problem" in 
the case of a lost master:

a) a transaction that was confirmed on the master but might not be on 
any of the surviving sync standbys (or you will never know if it is) - 
this is how I understand the proposal so far
b) a transaction that was not yet confirmed on the master but might have 
been applied on the surving standby before the desaster - this is what I 
understand "confirm from all sync standbys" could result in.

Spelled out that more clearly now makes me a bit reconsider on what I 
said before but I still wonder if ultimately we will have to provide 
both modes to satisfy different business requirements (a might provide 
the more accurate answer on average but b might at least provide a way 
to identify the "wild" transaction buy looking at additional data)


>
>
>> - The docs should not allege that either setup is preferable to the
>>> other, because there is not now and will never be consensus that this
>>> is in fact true.
>
> I remain hopeful that people will read what I have read and understand
> it. Having taken the trouble to do that publicly, my conscious is clear
> that I've done the very best to explain things and make it easy for
> users to avoid error. If I am prevented from putting sound advice into
> the docs, I'll not worry too much.
>
>
>> well I should think we need to clearly spell out everything that affects
>> reliability and if we only support semi-sync for more than 1 standby we
>> have only that setup :)
>
> You can use sync rep with 1 or more standby servers.
>
> At the end of the day, I can't stop anyone from saying "What an idiot,
> he designed something that gave the same durability and availability as
> Oracle and MySQL do, yet with additional performance management
> features".

ok


Stefan


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 18:13 +0100, Stefan Kaltenbrunner wrote:
> On 01/01/2011 05:55 PM, Simon Riggs wrote:
> >
> > It appears to me there has been substantial confusion over alternatives,
> > because of a misunderstanding about how synchronisation works. Requiring
> > confirmation that standbys are in sync is *not* the same thing as them
> > actually being in sync. Every single proposal made by anybody here on
> > hackers that supports multiple standby servers suffers from the same
> > issue: when the primary crashes you need to work out which standby
> > server is ahead.
> 
> aaah that was exactly what I was after - so the problem is that when you 
> have a sync standby it will technically always be "in front" of the 
> master (because it needs to fsync/apply/whatever before the master).
> In the end the question boils down to what is "the bigger problem" in 
> the case of a lost master:

> a) a transaction that was confirmed on the master but might not be on 
> any of the surviving sync standbys (or you will never know if it is) - 
> this is how I understand the proposal so far

No that cannot happen, the current situation is that we will fsync WAL
on the master, then fsync WAL on the standby, then reply to the master.
The standby is never ahead of the master, at any point.

> b) a transaction that was not yet confirmed on the master but might have 
> been applied on the surving standby before the desaster - this is what I 
> understand "confirm from all sync standbys" could result in.

Yes, that is described in the docs changes I published.

(a) was discussed, but ruled out, since it would require any crash/immed
shutdown of the master to become a failover, or have some kind of weird
back channel to give the missing data back.

There hasn't been any difference of opinion in this area, that I am
aware of. All proposals have offered (b).

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 17:28 +0100, Dimitri Fontaine wrote:
> something visible through a system view for
> users?  This as been asked for before and I was thinking there was a
> consensus on this. 

Yes, it will be there.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/01/2011 06:29 PM, Simon Riggs wrote:
> On Sat, 2011-01-01 at 18:13 +0100, Stefan Kaltenbrunner wrote:
>> On 01/01/2011 05:55 PM, Simon Riggs wrote:
>>>
>>> It appears to me there has been substantial confusion over alternatives,
>>> because of a misunderstanding about how synchronisation works. Requiring
>>> confirmation that standbys are in sync is *not* the same thing as them
>>> actually being in sync. Every single proposal made by anybody here on
>>> hackers that supports multiple standby servers suffers from the same
>>> issue: when the primary crashes you need to work out which standby
>>> server is ahead.
>>
>> aaah that was exactly what I was after - so the problem is that when you
>> have a sync standby it will technically always be "in front" of the
>> master (because it needs to fsync/apply/whatever before the master).
>> In the end the question boils down to what is "the bigger problem" in
>> the case of a lost master:
>
>> a) a transaction that was confirmed on the master but might not be on
>> any of the surviving sync standbys (or you will never know if it is) -
>> this is how I understand the proposal so far
>
> No that cannot happen, the current situation is that we will fsync WAL
> on the master, then fsync WAL on the standby, then reply to the master.
> The standby is never ahead of the master, at any point.

hmm maybe my "surviving" standbys(the case I'm wondering about is whole 
datacenter failures which might take out more than just the master) was 
not clear - consider three boxes, one master and two standby and 
semisync replication(ie any one of the standbys is enough to reply).

1. master fsyncs wal
2. standby #1 fsyncs and replies
3. master confirms commit
4. desaster strikes and destroys master and standby #1 while standby m2 
never had time to apply the change(IO/CPU load, latency, whatever)
5. now you have a sync standby that is missing something that was 
commited on the master and confirmed to the client and no way to verify 
that this thing happened (same problem with more than two standbys - as 
long as you lose ONE standby and the master at the same time you will 
never be sure)



what is it that I'm missing here?


>
>> b) a transaction that was not yet confirmed on the master but might have
>> been applied on the surving standby before the desaster - this is what I
>> understand "confirm from all sync standbys" could result in.
>
> Yes, that is described in the docs changes I published.
>
> (a) was discussed, but ruled out, since it would require any crash/immed
> shutdown of the master to become a failover, or have some kind of weird
> back channel to give the missing data back.
>
> There hasn't been any difference of opinion in this area, that I am
> aware of. All proposals have offered (b).

hmm I'm confused now - any chance you mixed up a & b here because in a) 
no backchannel is needed because the standby could just fetch the 
missing data from the master?
If that is the case I agree that it would be hard to get the replication 
up again after a crash of the master with a standby that is ahead but in 
the end it would be a business decision (as in conflict resolution) on 
what to do - take the "ahead" standbys data and use that or destroy the 
old standby and recreate.



Stefan


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 18:49 +0100, Stefan Kaltenbrunner wrote:
> hmm maybe my "surviving" standbys(the case I'm wondering about is
> whole 
> datacenter failures which might take out more than just the master)
> was 
> not clear - consider three boxes, one master and two standby and 
> semisync replication(ie any one of the standbys is enough to reply).
> 
> 1. master fsyncs wal
> 2. standby #1 fsyncs and replies
> 3. master confirms commit
> 4. desaster strikes and destroys master and standby #1 while standby
> m2 
> never had time to apply the change(IO/CPU load, latency, whatever)
> 5. now you have a sync standby that is missing something that was 
> commited on the master and confirmed to the client and no way to
> verify 
> that this thing happened (same problem with more than two standbys -
> as 
> long as you lose ONE standby and the master at the same time you will 
> never be sure)

This is obvious misconfiguration that anybody with HA experience would
spot immediately. If you have local standbys then you should mark them
as not available for sync rep, as described in the docs I've written.

> what is it that I'm missing here? 

The fact that we've discussed this already and agreed to do 9.1 with
quorum_commit = 1. I proposed making this a parameter; other solutions
were also proposed, but it was considered too complex for this release.

This is a trade-off between availability and data guarantees.

MySQL and Oracle "suffer" from exactly this "problem". DB2 supports only
one master and SQLServer doesn't have sync rep at all.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 01.01.2011 19:03, Simon Riggs wrote:
> On Sat, 2011-01-01 at 17:37 +0100, Stefan Kaltenbrunner wrote:
>> On 01/01/2011 05:28 PM, Dimitri Fontaine wrote:
>>> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>   writes:
>>>> well you keep saying that but to be honest I cannot really even see a
>>>> usecase for me - what is "only a random one of a set of servers is sync at
>>>> any time and I don't really know which one".
>>>
>>> It looks easy enough to get to know which one it is.  Surely the primary
>>> knows and could update something visible through a system view for
>>> users?  This as been asked for before and I was thinking there was a
>>> consensus on this.
>>
>> well as jeff janes already said - anything that requires the master to
>> still exist is not useful for a desaster.
>
> Nobody has suggested that the master needs to still exist after a
> disaster.

Dimitri just did, see above. I agree it's not very useful.

I don't think there's any other solution to knowing which standby is 
ahead than connect to both standbys and ask how far each is. I don't see 
a problem with that, whatever middleware handles the failover and 
STONITH etc. should be able to do that too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 31.12.2010 23:18, Hannu Krosing wrote:
> On 31.12.2010 13:40, Heikki Linnakangas wrote:
>> That thread makes no mention of how to specify which standbys are
>> synchronous and which are not.
> The simplest way would be to have separate database users for sync and
> async standbys ?
>
> That would allow any standby with right credentials act as a sync user,
> and those who are not eligible are not accepted even if they try to act
> as "a synchronity (?) provider".

Hmm, access control... We haven't yet discussed what privileges a 
standby needs to become synchronous. Perhaps it needs to be a separate 
privilege that can be granted, in addition to the replication privilege?

Robert's suggestion of using the roles instead of server names would 
also solve that. With that you would list the roles in 
synchronous_standbys, and no-one else could become a synchronous 
standby. The downside is that if you want to have two standbys in the 
mode that it's enough that either one acknowledges a commit, they would 
have to use the same user account.

If we don't adopt Robert's suggestion, do we want to restrict what 
standby name a user can claim, to stop one standby from spoofing another?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Josh Berkus
Date:
On 1/1/11 5:59 AM, Stefan Kaltenbrunner wrote:
> well you keep saying that but to be honest I cannot really even see a
> usecase for me - what is "only a random one of a set of servers is sync
> at any time and I don't really know which one".
> My usecases would al involved 2 sync standbys and 1 or more async ones.
> but the second sync one would be in a different datacenter and I NEED to
> protect against a datacenter failure which your proposals says I cannot
> do :(

As far as I know, *nobody* has written the bookkeeping code to actually
track which standbys have ack'd.  We need to get single-ack synch
standby merged, tested and working before we add anything as complicated
as "each standby on this list must ack".  That means that it's extremely
unlikely for 9.1 at this point.

Frankly, if Simon hadn't already submitted code, I'd be pushing for
single-standby-only for 9.1, instead of "any one".

Standby in general deals with the A,D,R triangle (Availability,
Durability, Response time).  "Any one" configuration is the A,R
configuration, and the only reason to go out with it for 9.1 is because
it's simpler to implement than the D,R configuration (all standbys must
ack).

> Hmm, access control... We haven't yet discussed what privileges a
> standby needs to become synchronous. Perhaps it needs to be a separate
> privilege that can be granted, in addition to the replication privilege?

No, I don't think so.  An additional priv would just complicate life for
DBAs without providing any real benefit.  You'd be guarding against the
very narrow hypothetical case where there's a server admin with limited
privs on the master, and authorization to create async standbies, but
not the authorization to create s synch standby.  How likely is that to
*ever* happen?

> Robert's suggestion of using the roles instead of server names would
> also solve that. With that you would list the roles in
> synchronous_standbys, and no-one else could become a synchronous
> standby. The downside is that if you want to have two standbys in the
> mode that it's enough that either one acknowledges a commit, they would
> have to use the same user account.

I really don't think that Robert was suggesting that we have
predetermined Roles with "magic names" like synchronous_standbys (were
you, Robert?).  That would defeat eventually having the feature which
people like Stefan want: the ability to define pools of servers with
custom names to represent various data centers.  It also overloads and
"perverts" the concept of Roles.

While I quite like the idea of having Roles for replication, synch/async
should be a property (CREATE ROLE seattle_dac NOLOGIN REPLICATION
SYNCHRONOUS*), not a special role.  Also, I'll repeat: I see this as 9.2
work, not 9.1 work.  There's going to be far too much bookkeeping code
to write to make it happen without severely delaying 9.1.

BTW, I don't see *any* of this requiring us to have a
synchronous_standbys = "list" GUC if we embrace the Roles concept.

--Josh

( here's how I can see Roles working:

1) Create a group role for the synch standbyes (NOLOGIN REPLICATION)
2) Create one or more replication roles which are members of that group
role.
3) File-synch those standbys and get them replicating (asynchronously)
4) ALTER ROLE synch_standbys SYNCHRONOUS
5) The postmaster scans the list of Roles for synchronous roles.  For
each synchronous role, the must be one replication ack in order commit.

Since the above means that it would be possible to have a replication
connection which was a member of more than one synch group, you can see
that the bookkeeping involved will be substantial.  So, 9.2.
)






Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 21:41 +0200, Heikki Linnakangas wrote:
> On 31.12.2010 23:18, Hannu Krosing wrote:
> > On 31.12.2010 13:40, Heikki Linnakangas wrote:
> >> That thread makes no mention of how to specify which standbys are
> >> synchronous and which are not.
> > The simplest way would be to have separate database users for sync and
> > async standbys ?
> >
> > That would allow any standby with right credentials act as a sync user,
> > and those who are not eligible are not accepted even if they try to act
> > as "a synchronity (?) provider".
> 
> Hmm, access control... We haven't yet discussed what privileges a 
> standby needs to become synchronous. Perhaps it needs to be a separate 
> privilege that can be granted, in addition to the replication privilege?

Perhaps we don't need it, also. Why is that essential in this release?

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 14:40 -0800, Josh Berkus wrote:

> Standby in general deals with the A,D,R triangle (Availability,
> Durability, Response time).  "Any one" configuration is the A,R
> configuration, and the only reason to go out with it for 9.1 is
> because it's simpler to implement than the D,R configuration (all
> standbys must ack).

Nicely put. Not the "only reason" though...

As I showed earlier, the AR gives you 99.999% availability and the DR
gives you 94% availability, considering a 3 server config. If you add
more servers, the availability of the DR option gets much worse, very
quickly.

The performance of AR is much better also, and stays same or better as
cluster size increases. DR choice makes performance degrade as cluster
size increases, since it works at the speed of the slowest node.

The fact that I can get "5 Nines" with simpler code makes it even
sweeter.

quorum commit > 1 can improve the durability guarantee of data, but it
also seems that many people wouldn't understand its implications and
would mis-configure it, to our collective embarrassment.
-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Aidan Van Dyk
Date:
On Sat, Jan 1, 2011 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2011-01-01 at 14:40 -0800, Josh Berkus wrote:
>
>> Standby in general deals with the A,D,R triangle (Availability,
>> Durability, Response time).  "Any one" configuration is the A,R
>> configuration, and the only reason to go out with it for 9.1 is
>> because it's simpler to implement than the D,R configuration (all
>> standbys must ack).
>
> Nicely put. Not the "only reason" though...
>
> As I showed earlier, the AR gives you 99.999% availability and the DR
> gives you 94% availability, considering a 3 server config. If you add
> more servers, the availability of the DR option gets much worse, very
> quickly.
>
> The performance of AR is much better also, and stays same or better as
> cluster size increases. DR choice makes performance degrade as cluster
> size increases, since it works at the speed of the slowest node.

I'm all for getting first-past-post in for 9.1.  Otherwise I fear
we'll get nothing.

Stephen and I will only be able to use 1 sync slave, the "DR-site"
one.  That's fine.  I can live with it, and make my local slave be
async.  Or replicate the FS/block under WAL.  I can monitor the ****
out of it, and unless it goes "down", it should easily be able to keep
up with the remote sync one beind a slower WAN link.

And I think both Stephen and I understand your availability math.
We're not arguing that the 1st past post both gives better query
availabiliyt, and cluster scale performance.

But when the primary datacenter servers are dust in the crater (or
boats in the flood, or ash in the fire), I either keep my job, or I
don't.  And that depends on whether there is a chance I (my database
system) confirmed a transaction that I can't recover.

So sync rep with 1st past post already makes my job easier.  I'll take
it over nothing ;-)

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Sync Rep Design

From
Robert Haas
Date:
On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Yes, working out the math is a good idea. Things are much clearer if we
> do that.
>
> Let's assume we have 98% availability on any single server.
>
> 1. Having one primary and 2 standbys, either of which can acknowledge,
> and we never lock up if both standbys fail, then we will have 99.9992%
> server availability. (So PostgreSQL hits "5 Nines", with data
> guarantees). ("Maximised availability")

I don't agree with this math.  If the master and one standby fail
simultaneously, the other standby is useless, because it may or may
not be caught up with the master.  You know that the last transaction
acknowledged as committed by the master is on at least one of the two
standbys, but you don't know which one, and so you can't safely
promote the surviving standby.

(If you are working in an environment where promoting the surviving
standby when it's possibly not caught up is OK, then you don't need
sync rep in the first place: you can just run async rep and get much
better performance.)

So the availability is 98% (you are up when the master is up) + 98%^2
* 2% (you are up when both slaves are up and the master is down) =
99.92%.  If you had only a single standby, then you could be certain
that any commit acknowledged by the master was on that standby.  Thus
your availability would be 98% (up when master is up) + 98% * 2% (you
are up when the master is down and the slave is up) = 99.96%.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 02.01.2011 00:40, Josh Berkus wrote:
> On 1/1/11 5:59 AM, Stefan Kaltenbrunner wrote:
>> well you keep saying that but to be honest I cannot really even see a
>> usecase for me - what is "only a random one of a set of servers is sync
>> at any time and I don't really know which one".
>> My usecases would al involved 2 sync standbys and 1 or more async ones.
>> but the second sync one would be in a different datacenter and I NEED to
>> protect against a datacenter failure which your proposals says I cannot
>> do :(
>
> As far as I know, *nobody* has written the bookkeeping code to actually
> track which standbys have ack'd.  We need to get single-ack synch
> standby merged, tested and working before we add anything as complicated
> as "each standby on this list must ack".  That means that it's extremely
> unlikely for 9.1 at this point.

The bookkeeping will presumably consist of an XLogRecPtr in shared 
memory for each standby, tracking how far the standby has acknowledged. 
At commit, you scan the standby slots in shared memory and check that 
the required standbys have acknowledged your commit record. The 
bookkeeping required is the same whether or not we support a list of 
standbys that must ack or just one.

> Frankly, if Simon hadn't already submitted code, I'd be pushing for
> single-standby-only for 9.1, instead of "any one".

Yes, we are awfully late, but let's not panic.

BTW, there's a bunch of replication related stuff that we should work to 
close, that are IMHO more important than synchronous replication. Like 
making the standby follow timeline changes, to make failovers smoother, 
and the facility to stream a base-backup over the wire. I wish someone 
worked on those...

>> Hmm, access control... We haven't yet discussed what privileges a
>> standby needs to become synchronous. Perhaps it needs to be a separate
>> privilege that can be granted, in addition to the replication privilege?
>
> No, I don't think so.  An additional priv would just complicate life for
> DBAs without providing any real benefit.  You'd be guarding against the
> very narrow hypothetical case where there's a server admin with limited
> privs on the master, and authorization to create async standbies, but
> not the authorization to create s synch standby.  How likely is that to
> *ever* happen?

Very likely. A synchronous standby can bring the master to a halt, while 
an asynchronous one is rather harmless. If I were a DBA, and the data 
wasn't very sensitive, I would liberally hand out async privileges to my 
colleagues to set up reporting standbys, test servers etc. But I would 
*not* give them synchronous privileges, because sooner or later one 
would go "hmm, I wonder what happens if I make this synchronous", or 
haphazardly copy the config file from a synchronous standby. That would 
either bring down the master, or act as a "fake" standby, acknowledging 
commits before they're flushed to the real synchronous standby. Either 
one would be bad.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/02/2011 09:35 AM, Heikki Linnakangas wrote:
> On 02.01.2011 00:40, Josh Berkus wrote:
>> On 1/1/11 5:59 AM, Stefan Kaltenbrunner wrote:
>>> well you keep saying that but to be honest I cannot really even see a
>>> usecase for me - what is "only a random one of a set of servers is sync
>>> at any time and I don't really know which one".
>>> My usecases would al involved 2 sync standbys and 1 or more async ones.
>>> but the second sync one would be in a different datacenter and I NEED to
>>> protect against a datacenter failure which your proposals says I cannot
>>> do :(
>>
>> As far as I know, *nobody* has written the bookkeeping code to actually
>> track which standbys have ack'd. We need to get single-ack synch
>> standby merged, tested and working before we add anything as complicated
>> as "each standby on this list must ack". That means that it's extremely
>> unlikely for 9.1 at this point.
>
> The bookkeeping will presumably consist of an XLogRecPtr in shared
> memory for each standby, tracking how far the standby has acknowledged.
> At commit, you scan the standby slots in shared memory and check that
> the required standbys have acknowledged your commit record. The
> bookkeeping required is the same whether or not we support a list of
> standbys that must ack or just one.
>
>> Frankly, if Simon hadn't already submitted code, I'd be pushing for
>> single-standby-only for 9.1, instead of "any one".
>
> Yes, we are awfully late, but let's not panic.
>
> BTW, there's a bunch of replication related stuff that we should work to
> close, that are IMHO more important than synchronous replication. Like
> making the standby follow timeline changes, to make failovers smoother,
> and the facility to stream a base-backup over the wire. I wish someone
> worked on those...

yeah I agree that those two are much more of a problem for the general 
user base. Whatever people think about our current system - it is very 
easy to configure(in terms of knobs to toggle) but extremely hard to get 
set up and dealt with during failovers(and I know nobody who got it 
right the first few times or has not fucked up one thing in the process).
Syncrep is importantant but I would argue that getting those two fixed 
is even more so ;)



Stefan


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 10:35 +0200, Heikki Linnakangas wrote:

> BTW, there's a bunch of replication related stuff that we should work
> to close, that are IMHO more important than synchronous replication.
> Like making the standby follow timeline changes, to make failovers
> smoother, and the facility to stream a base-backup over the wire. I
> wish someone worked on those... 

Hopefully, you'll be allowed to work on those, if they are more
important?

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 22:11 -0500, Aidan Van Dyk wrote:
> On Sat, Jan 1, 2011 at 6:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Sat, 2011-01-01 at 14:40 -0800, Josh Berkus wrote:
> >
> >> Standby in general deals with the A,D,R triangle (Availability,
> >> Durability, Response time).  "Any one" configuration is the A,R
> >> configuration, and the only reason to go out with it for 9.1 is
> >> because it's simpler to implement than the D,R configuration (all
> >> standbys must ack).
> >
> > Nicely put. Not the "only reason" though...
> >
> > As I showed earlier, the AR gives you 99.999% availability and the DR
> > gives you 94% availability, considering a 3 server config. If you add
> > more servers, the availability of the DR option gets much worse, very
> > quickly.
> >
> > The performance of AR is much better also, and stays same or better as
> > cluster size increases. DR choice makes performance degrade as cluster
> > size increases, since it works at the speed of the slowest node.
> 
> I'm all for getting first-past-post in for 9.1.  Otherwise I fear
> we'll get nothing.
> 
> Stephen and I will only be able to use 1 sync slave, the "DR-site"
> one.  

No, the AR and DR options are identical with just one sync standby.

You've been requesting the DR option with 2 standbys, which is what
gives you 94% availability.

> That's fine.  I can live with it, and make my local slave be
> async.  Or replicate the FS/block under WAL.  I can monitor the ****
> out of it, and unless it goes "down", it should easily be able to keep
> up with the remote sync one beind a slower WAN link.
> 
> And I think both Stephen and I understand your availability math.
> We're not arguing that the 1st past post both gives better query
> availabiliyt, and cluster scale performance.
> 
> But when the primary datacenter servers are dust in the crater (or
> boats in the flood, or ash in the fire), I either keep my job, or I
> don't.  And that depends on whether there is a chance I (my database
> system) confirmed a transaction that I can't recover.

I'm not impressed. You neglect to mention that Oracle and MySQL would
put you in exactly the same position.

You also neglect to say that if the local standby goes down, you were
advocating a design that would take the whole application down. If you
actually did what you have been suggesting, and the cluster went down as
it inevitably would do, once your colleagues realise that you knowingly
configured the cluster to have only 94% availability, you won't have a
job anymore, you'll be escorted off the premises while shouting "but
while it was down, it lost no data". When that never happens, thank me.

There are people that need more durability than availability, but not
many. If the database handles high value transactions, they very
probably want it to keep on processing high value transactions.

You'll have the choice of how to configure it, because of me listening
to other people's views and selecting only the ideas that make sense.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 10:35 +0200, Heikki Linnakangas wrote:
> 
> > Frankly, if Simon hadn't already submitted code, I'd be pushing for
> > single-standby-only for 9.1, instead of "any one".
> 
> Yes, we are awfully late, but let's not panic.

Yes, we're about a year late. Getting a simple feature like this into
the code could have been done in 9.0.

We must stop returning to overcomplex features, especially if they
aren't backed up with solid analysis of things like server availability
and query visibility. 

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Hannu Krosing
Date:
On 2.1.2011 5:36, Robert Haas wrote:
> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> Yes, working out the math is a good idea. Things are much clearer if we
>> do that.
>>
>> Let's assume we have 98% availability on any single server.
>>
>> 1. Having one primary and 2 standbys, either of which can acknowledge,
>> and we never lock up if both standbys fail, then we will have 99.9992%
>> server availability. (So PostgreSQL hits "5 Nines", with data
>> guarantees). ("Maximised availability")
> I don't agree with this math.  If the master and one standby fail
> simultaneously, the other standby is useless, because it may or may
> not be caught up with the master.  You know that the last transaction
> acknowledged as committed by the master is on at least one of the two
> standbys, but you don't know which one, and so you can't safely
> promote the surviving standby.
> (If you are working in an environment where promoting the surviving
> standby when it's possibly not caught up is OK, then you don't need
> sync rep in the first place: you can just run async rep and get much
> better performance.)
> So the availability is 98% (you are up when the master is up) + 98%^2
> * 2% (you are up when both slaves are up and the master is down) =
> 99.92%.  If you had only a single standby, then you could be certain
> that any commit acknowledged by the master was on that standby.  Thus
> your availability would be 98% (up when master is up) + 98% * 2% (you
> are up when the master is down and the slave is up) = 99.96%.
>
OTOH, in the case where you need _all_ the slaves to confirm any failing 
slave brings
the master down, so adding a slave brings down availability by extra 2%

The solution to achieving good durability AND availability is requiring 
N past the
post instead of 1 past the post.

In this case you can get to 99.9992% availability with master + 3 sync 
slaves, 2 of which have ACK.

---------------------------------------
Hannu Krosing
Performance and Infinite Scalability Consultant
http://www.2ndQuadrant.com/books/





Base Backup Streaming (was: Sync Rep Design)

From
Dimitri Fontaine
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> BTW, there's a bunch of replication related stuff that we should work to
> close, that are IMHO more important than synchronous replication. Like
> making the standby follow timeline changes, to make failovers smoother, and
> the facility to stream a base-backup over the wire. I wish someone worked on
> those...

So, we've been talking about base backup streaming at conferences and we
have a working prototype.  We even have a needed piece of it in core
now, that's the pg_read_binary_file() function.  What we still miss is
an overall design and some integration effort.  Let's design first.

I propose the following new pg_ctl command to initiate the cloning:
pg_ctl clone [-D datadir] [-s on|off] [-t filename]  "primary_conninfo"

As far as user are concerned, that would be the only novelty.  Once that
command is finished (successfully) they would edit postgresql.conf and
start the service as usual.  A basic recovery.conf file is created with
the given options, standby_mode is driven by -s and defaults to off, and
trigger_file defaults to being omitted and is given by -t.  Of course
the primary_conninfo given on the command line is what ends up into the
recovery.conf file.

That alone would allow for making base backups for recovery purposes and
for standby preparing.

To support for this new tool, the simplest would be to just copy what
I've been doing in the prototype, that is run a query to get the primary
file listing (per tablespace, not done in the prototype) then get their
bytea content over the wire.  That means there's no further backend
support code to write.
 https://github.com/dimitri/pg_basebackup

We could prefer to have a backend function prepare a tar archive and
stream it using the COPY protocol, with some compression support, but
that's more complex to code now and to parallelize down the road.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sat, 2011-01-01 at 23:36 -0500, Robert Haas wrote:
> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Yes, working out the math is a good idea. Things are much clearer if we
> > do that.
> >
> > Let's assume we have 98% availability on any single server.
> >
> > 1. Having one primary and 2 standbys, either of which can acknowledge,
> > and we never lock up if both standbys fail, then we will have 99.9992%
> > server availability. (So PostgreSQL hits "5 Nines", with data
> > guarantees). ("Maximised availability")
> 
> I don't agree with this math. ...(snip by Simon)... 99.96%.

OK, so that is at least 99.96%. Cool.

The key point here is not (1), but option (4).

The approach advocated by Heikki and yourself gives us 94% availability.
IMHO that is ridiculous, and I will not accept that as the *only* way
forwards, for that reason, whoever advocates it or for how long they
keep arguing. I do accept that some wish that as an option.

If we are to have a sensible technical debate with an eventual end, you
must answer the points placed in front of you, not just sidestep and try
to point out problems somewhere else. All analysis must be applied to
all options, not just those options advocated by someone else. I've been
asking for a failure mode analysis for months and it never comes in
full.

I'm more than happy to discuss your additional points once we are clear
on the 94% because it is pivotal to everything I've been proposing.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
"Kevin Grittner"
Date:
Simon Riggs  wrote:
> On Sat, 2011-01-01 at 23:36 -0500, Robert Haas wrote:
>> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggswrote:
>>> Yes, working out the math is a good idea. Things are much clearer
>>> if we do that.
>>>
>>> Let's assume we have 98% availability on any single server.
>>>
>>> 1. Having one primary and 2 standbys, either of which can
>>> acknowledge, and we never lock up if both standbys fail, then we
>>> will have 99.9992% server availability. (So PostgreSQL hits "5
>>> Nines", with data guarantees). ("Maximised availability")
>>
>> I don't agree with this math. ...(snip by Simon)... 99.96%.
> 
> OK, so that is at least 99.96%. Cool.
I think you're talking about different metrics, and you're both
right.  With two servers configured in sync rep your chance of having
an available (running) server is 99.9992%.  The chance that you know
that you have one that is totally up to date, with no lost
transactions is 99.9208%.  The chance that you *actually* have
up-to-date data would be higher, but you'd have no way to be sure. 
The 99.96% number is your certainty that you have a running server
with up-to-date data if only one machine is sync rep.
It's a matter of whether your shop needs five nines of availability
or the highest probability of not losing data.  You get to choose.
-Kevin


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 08:08 -0600, Kevin Grittner wrote:

> I think you're talking about different metrics, and you're both
> right.  With two servers configured in sync rep your chance of having
> an available (running) server is 99.9992%.  The chance that you know
> that you have one that is totally up to date, with no lost
> transactions is 99.9208%.  The chance that you *actually* have
> up-to-date data would be higher, but you'd have no way to be sure. 
> The 99.96% number is your certainty that you have a running server
> with up-to-date data if only one machine is sync rep.
>  
> It's a matter of whether your shop needs five nines of availability
> or the highest probability of not losing data.  You get to choose.

Thanks for those calculations.

Do you agree that requiring response from 2 sync standbys, or locking
up, gives us 94% server availability, but 99.9992% data durability? And
that adding additional async servers would not increase the server
availability of that cluster?

Now lets look at what happens when we first start a standby: we do the
base backup, configure the standby, it connects and then <wham> we
cannot process any new transactions until the standby has caught up,
which could well be hours on a big database. So if we don't have a
processing mode that allows work to continue, how will we ever enable
synchronous replication on a 24/7 database? How will we ever allow
standbys to catch up if they drop out for a while?

We should factor that into the availability calcs as well.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Base Backup Streaming

From
Heikki Linnakangas
Date:
On 02.01.2011 14:47, Dimitri Fontaine wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> BTW, there's a bunch of replication related stuff that we should work to
>> close, that are IMHO more important than synchronous replication. Like
>> making the standby follow timeline changes, to make failovers smoother, and
>> the facility to stream a base-backup over the wire. I wish someone worked on
>> those...
>
> So, we've been talking about base backup streaming at conferences and we
> have a working prototype.  We even have a needed piece of it in core
> now, that's the pg_read_binary_file() function.  What we still miss is
> an overall design and some integration effort.  Let's design first.

We even have a rudimentary patch to add the required backend support:

http://archives.postgresql.org/message-id/4C80D9B8.2020301@enterprisedb.com

That just needs to be polished into shape, and documentation.

> I propose the following new pg_ctl command to initiate the cloning:
>
>   pg_ctl clone [-D datadir] [-s on|off] [-t filename]  "primary_conninfo"
>
> As far as user are concerned, that would be the only novelty.  Once that
> command is finished (successfully) they would edit postgresql.conf and
> start the service as usual.  A basic recovery.conf file is created with
> the given options, standby_mode is driven by -s and defaults to off, and
> trigger_file defaults to being omitted and is given by -t.  Of course
> the primary_conninfo given on the command line is what ends up into the
> recovery.conf file.
>
> That alone would allow for making base backups for recovery purposes and
> for standby preparing.

+1. Or maybe it would be better make it a separate binary, rather than 
part of pg_ctl.

> To support for this new tool, the simplest would be to just copy what
> I've been doing in the prototype, that is run a query to get the primary
> file listing (per tablespace, not done in the prototype) then get their
> bytea content over the wire.  That means there's no further backend
> support code to write.

It would be so much nicer to have something more integrated, like the 
patch I linked above. Running queries requires connecting to a real 
database, which means that the user needs to have privileges to do that 
and you need to know the name of a valid database. Ideally this would 
all work through a replication connection. I think we should go with 
that from day one.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
Heikki Linnakangas
Date:
On 02.01.2011 15:41, Simon Riggs wrote:
> On Sat, 2011-01-01 at 23:36 -0500, Robert Haas wrote:
>> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>> Yes, working out the math is a good idea. Things are much clearer if we
>>> do that.
>>>
>>> Let's assume we have 98% availability on any single server.
>>>
>>> 1. Having one primary and 2 standbys, either of which can acknowledge,
>>> and we never lock up if both standbys fail, then we will have 99.9992%
>>> server availability. (So PostgreSQL hits "5 Nines", with data
>>> guarantees). ("Maximised availability")
>>
>> I don't agree with this math. ...(snip by Simon)... 99.96%.
>
> OK, so that is at least 99.96%. Cool.
>
> The key point here is not (1), but option (4).
>
> The approach advocated by Heikki and yourself gives us 94% availability.
> IMHO that is ridiculous, and I will not accept that as the *only* way
> forwards, for that reason, whoever advocates it or for how long they
> keep arguing. I do accept that some wish that as an option.

No-one is suggesting that to be the only option.

The "wait-for-all-to-ack" looks a lot less ridiculous if you also 
configure a timeout and don't wait for disconnected standbys. I'm not 
sure what the point of such a timeout in general is, but people have 
requested that. Also, setting synchronous_standbys="room1, room2" 
doesn't necessarily mean that you have just two standby servers, room1 
and room2 might both represent a group of servers.

I believe we all agree that there's different use cases that require 
different setups. Both "first-past-the-post" and "wait-for-all-to-ack" 
have their uses. There's no point in arguing over which is better.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Sync Rep Design

From
"Kevin Grittner"
Date:
Simon Riggs  wrote:
> Do you agree that requiring response from 2 sync standbys, or
> locking up, gives us 94% server availability, but 99.9992% data
> durability?
I'm not sure how to answer that.  The calculations so far have been
based around up-time and the probabilities that you have a machine up
at any moment and whether you can have confidence that if you do, you
have all committed transactions represented.  There's been an implied
assumption that the down time is unplanned, but not much else.  The
above question seems to me to get into too many implied assumptions
to feel safe throwing out a number without pinning those down a whole
lot better.  If, for example, that 2% downtime always means the
machine irretrievably went up in smoke, hitting unavailable means
things are unrecoverable.  That's probably not the best assumption
(at least outside of a combat zone), but what is?
-Kevin


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 11:11 -0600, Kevin Grittner wrote:
> Simon Riggs  wrote:
>  
> > Do you agree that requiring response from 2 sync standbys, or
> > locking up, gives us 94% server availability, but 99.9992% data
> > durability?
>  
> I'm not sure how to answer that.  The calculations so far have been
> based around up-time and the probabilities that you have a machine up
> at any moment and whether you can have confidence that if you do, you
> have all committed transactions represented.  There's been an implied
> assumption that the down time is unplanned, but not much else.  The
> above question seems to me to get into too many implied assumptions
> to feel safe throwing out a number without pinning those down a whole
> lot better.  If, for example, that 2% downtime always means the
> machine irretrievably went up in smoke, hitting unavailable means
> things are unrecoverable.  That's probably not the best assumption
> (at least outside of a combat zone), but what is?

Not really relevant. There's no room at all for downtime of any kind in
a situation where all servers must be available.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Base Backup Streaming

From
Dimitri Fontaine
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> http://archives.postgresql.org/message-id/4C80D9B8.2020301@enterprisedb.com
>
> That just needs to be polished into shape, and documentation.

Wow, cool!  I don't know how but I've missed it.

> +1. Or maybe it would be better make it a separate binary, rather than part
> of pg_ctl.

Well the thinking was that nowadays we support initdb from pg_ctl, and
this is another kind of initdb, really.

> I linked above. Running queries requires connecting to a real database,
> which means that the user needs to have privileges to do that and you need
> to know the name of a valid database. Ideally this would all work through a
> replication connection. I think we should go with that from day one.

I didn't think about the "connecting to a real database" part of it,
versus using a dedicated REPLICATION connection/protocol, and to be
honest, I feared it was too much work.  Seeing that you already did it,
though, +1.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Sync Rep Design

From
MARK CALLAGHAN
Date:
On Thu, Dec 30, 2010 at 9:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> <reads MySQL documentation>
>
> I see now that you've tried to design this feature in a way that is
> similar to MySQL's offering, which does have some value.  But it
> appears to me that the documentation you've written here is
> substantially similar to the MySQL 5.5 reference documentation.  That
> could get us into a world of legal trouble - that documentation is not
> even open source, let alone BSD.
>
> http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html

The docs originate from work done by my former team at Google. The
content license on this is CC 3.0 BY-SA, so I don't think that should
be a concern.
http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplication
http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

From http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html)
the MySQL docs don't mention that other transactions can view the
committed data on the master between steps 1 and 2. Is that possible
in this case?

As described in the the MySQL docs, semi-sync has another benefit for
some deployments. It rate limits busy clients to prevent them from
creating replication lag between the primary and standby servers. I
also provided the text for that
(http://bugs.mysql.com/bug.php?id=57911) if you are concerned about
copying.

--
Mark Callaghan
mdcallag@gmail.com


Re: Sync Rep Design

From
Jeff Janes
Date:
On Sat, Jan 1, 2011 at 8:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2011-01-01 at 05:13 -0800, Jeff Janes wrote:
>> On 12/31/10, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> > 2. "sync" does not guarantee that the updates to the standbys are in any
>> > way coordinated. You can run a query on one standby and get one answer
>> > and at the exact same time run the same query on another standby and get
>> > a different answer (slightly ahead/behind). That also means that if the
>> > master crashes one of the servers will be ahead or behind. You can use
>> > pg_last_xlog_receive_location() to check which one that is.
>>
>> If at least one of the standbys is in the same smoking crater as the
>> primary, then pg_last_xlog_receive_location on it is unlikely to
>> respond.
>>
>> The guarantee goes away precisely when it is needed.
>
> Fairly obviously, I would not be advocating anything that forced you to
> use a server in the "same smoking crater".

You are forced to use the standby which is further ahead, otherwise
you might lose transactions which have been reported to have been
committed.

The mere existence of a commit-releasing stand-by in the same data
center as the primary means that a remote standby is not very useful
for data preservation after campus-wide disasters.  It is probably
behind (due to higher latency) and even if it is not behind, there is
no way to *know* that is not behind if the on-site standby cannot be
contacted.

I understand that you are not advocating the use of one local standby
and one remote standby, both synchronous. But I think we need to
*explicitly* warn against it.  After all, the docs do explicitly
recommend the use of two standbys.  If we assume that the readers are
already experts, then they don't need that advice.  If they are not
experts, then that advice could lead them to shoot themselves in the
foot, both kneecaps, and a femur (metaphorically speaking, unlike the
smoking crater, which is a literal scenario some people need to plan
for).

If durability is more important than availability, what would you
recommend?  Only one synchronous rep, in a remote data center?  Two
(or more) synchronous reps all in the same remote data center?  In two
different remote data centers?


> I can't see any guarantee
> that goes away precisely when it is needed.

In order to know that you are not losing data, you have to be able to
contact every single semi-synchronous standby and invoke
pg_last_xlog_receive_location on it.

If your goal is to have data durability protected from major
catastrophes (and why else would you do synchronous rep to remote data
centers?), then it is expecting a lot to have every single standby
survive that major catastrophe.  That expectation is an unavoidable
consequence of going with single-confirmation-releases.  Perhaps you
think this consequence is too obvious to document--if so I disagree on
that.

> Perhaps you could explain the issue you see, because your comments seem
> unrelated to my point above.

It is directly related to the part of your point about using
pg_last_xlog_receive_location.  When planning for disaster recovery,
it is little comfort that you can do something in a non-disaster case,
if you can't also do it in likely disaster cases.

It probably wasn't relevant to the first part of your point, but I
must admit I did not understand the first part of your point.
Obviously they are coordinated in *some* way (I believe commits occur
in the same order on each server, for example).  Different read-only
standbys could give different results, but only from among the
universe of results made possible by a given commit sequence.  But
that is not the part I had intended to comment on, and I don't think
it is what other people concerned about durability after major
catastrophes were focusing on, either.

Cheers,

Jeff


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 12:13 -0800, MARK CALLAGHAN wrote:
> On Thu, Dec 30, 2010 at 9:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > <reads MySQL documentation>
> >
> > I see now that you've tried to design this feature in a way that is
> > similar to MySQL's offering, which does have some value.  But it
> > appears to me that the documentation you've written here is
> > substantially similar to the MySQL 5.5 reference documentation.  That
> > could get us into a world of legal trouble - that documentation is not
> > even open source, let alone BSD.
> >
> > http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html

Sorry, only just read that bit. Are they that similar? My docs are about
3 times longer and cover all sorts of things. I didn't intentionally
copy anything, but that doesn't really matter, what matters is that if
you think they are similar, legal people might. I've only read the URL
above, not the other links from it.

Robert, Can you identify which paragraphs need to be re-written? I won't
argue, I will just rewrite them or delete them and start afresh. Thanks
for being eagle-eyed.

> The docs originate from work done by my former team at Google. The
> content license on this is CC 3.0 BY-SA, so I don't think that should
> be a concern.
> http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplication
> http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

I guess that gets us off the hook a little bit, but not far enough for
my liking. Thanks for trying to save me!

> >From http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html)
> the MySQL docs don't mention that other transactions can view the
> committed data on the master between steps 1 and 2. Is that possible
> in this case?

Other transactions on the master cannot read data until after the
confirmation it is on the sync standby.

> As described in the the MySQL docs, semi-sync has another benefit for
> some deployments. It rate limits busy clients to prevent them from
> creating replication lag between the primary and standby servers. I
> also provided the text for that
> (http://bugs.mysql.com/bug.php?id=57911) if you are concerned about
> copying.

Yeh, I'm aware of the effect, but I'm not really seeing slowing down the
master as a benefit, its more an implication of synchronicity.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Base Backup Streaming

From
Magnus Hagander
Date:
On Sun, Jan 2, 2011 at 18:53, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> http://archives.postgresql.org/message-id/4C80D9B8.2020301@enterprisedb.com
>>
>> That just needs to be polished into shape, and documentation.

I have an updated version of this somewhere.IIRC it also needs things
like tablespace support, ubt it's well on it's way.

> Wow, cool!  I don't know how but I've missed it.

Yes, especially since we discussed it in Stuttgart. I guess it may
have been during the party...


>> +1. Or maybe it would be better make it a separate binary, rather than part
>> of pg_ctl.
>
> Well the thinking was that nowadays we support initdb from pg_ctl, and
> this is another kind of initdb, really.

Yes, if it should go in any of the current binaries, initdb would be
the reasonable place, not pg_ctl ;)

That said, if we're going to wrap pg_streamrecv into 9.1 (which I
think we should), then *that* is where it should go.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 18:54 +0200, Heikki Linnakangas wrote:

> I believe we all agree that there's different use cases that require 
> different setups. Both "first-past-the-post" and "wait-for-all-to-ack" 
> have their uses. 

Robert's analysis is that "first-past-the-post" doesn't actually improve
the durability guarantee (according to his calcs). Which means that 1 primary, 2 sync standbys with
first-past-the-post
is actually worse than 1 primary, 1 sync and 1 async standby
in terms of its durability guarantees.

So ISTM that Robert does not agree that both have their uses.

> I'm not 
> sure what the point of such a timeout in general is, but people have 
> requested that. 

Again, this sounds like you think a timeout has no measurable benefit,
other than to please some people's perceived needs.

> The "wait-for-all-to-ack" looks a lot less ridiculous if you also 
> configure a timeout and don't wait for disconnected standbys

Does it? Do Robert, Stefan and Aidan agree? What are the availability
and durability percentages if we do that? Based on those, we may decide
to do that instead. But I'd like to see some analysis of your ideas, not
just a "we could". Since nobody has commented on my analysis, lets see
someone else's.

> There's no point in arguing over which is better.

I'm trying to compare quantifiable benefit of various options to see
what goes into Postgres. I don't want to put anything in that we cannot
all agree has a measurable benefit to someone (that has the appropriate
preference).

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Base Backup Streaming

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Yes, especially since we discussed it in Stuttgart. I guess it may
> have been during the party...

I remember we talked about it, I didn't remember a patch had reached the list…

> Yes, if it should go in any of the current binaries, initdb would be
> the reasonable place, not pg_ctl ;)

Well, pg_ctl is able to call initdb for users, but yes.

> That said, if we're going to wrap pg_streamrecv into 9.1 (which I
> think we should), then *that* is where it should go.

They should at least cooperate so that you don't need to setup WAL
archiving explicitly while preparing the standby, if at least possible.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Sync Rep Design

From
Hannu Krosing
Date:
On 2.1.2011 5:36, Robert Haas wrote:
> On Sat, Jan 1, 2011 at 6:54 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> Yes, working out the math is a good idea. Things are much clearer if we
>> do that.
>>
>> Let's assume we have 98% availability on any single server.
>>
>> 1. Having one primary and 2 standbys, either of which can acknowledge,
>> and we never lock up if both standbys fail, then we will have 99.9992%
>> server availability. (So PostgreSQL hits "5 Nines", with data
>> guarantees). ("Maximised availability")
> I don't agree with this math.  If the master and one standby fail
> simultaneously, the other standby is useless, because it may or may
> not be caught up with the master.  You know that the last transaction
> acknowledged as committed by the master is on at least one of the two
> standbys, but you don't know which one, and so you can't safely
> promote the surviving standby.
> (If you are working in an environment where promoting the surviving
> standby when it's possibly not caught up is OK, then you don't need
> sync rep in the first place: you can just run async rep and get much
> better performance.)
> So the availability is 98% (you are up when the master is up) + 98%^2
> * 2% (you are up when both slaves are up and the master is down) =
> 99.92%.  If you had only a single standby, then you could be certain
> that any commit acknowledged by the master was on that standby.  Thus
> your availability would be 98% (up when master is up) + 98% * 2% (you
> are up when the master is down and the slave is up) = 99.96%.
>
OTOH, in the case where you need _all_ the slaves to confirm any failing 
slave brings
the master down, so adding a slave brings down availability by extra 2%

The solution to achieving good durability AND availability is requiring 
N past the
post instead of 1 past the post.

In this case you can get to 99.9992% availability with master + 3 sync 
slaves, 2 of which have ACK.

---------------------------------------
Hannu Krosing
Performance and Infinite Scalability Consultant
http://www.2ndQuadrant.com/books/





Re: Sync Rep Design

From
Simon Riggs
Date:
On Sun, 2011-01-02 at 20:53 +0000, Simon Riggs wrote:
> On Sun, 2011-01-02 at 12:13 -0800, MARK CALLAGHAN wrote:
> > On Thu, Dec 30, 2010 at 9:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > <reads MySQL documentation>
> > >
> > > I see now that you've tried to design this feature in a way that is
> > > similar to MySQL's offering, which does have some value.  But it
> > > appears to me that the documentation you've written here is
> > > substantially similar to the MySQL 5.5 reference documentation.  That
> > > could get us into a world of legal trouble - that documentation is not
> > > even open source, let alone BSD.
> > >
> > > http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html
> 
> Sorry, only just read that bit. Are they that similar? My docs are about
> 3 times longer and cover all sorts of things. I didn't intentionally
> copy anything, but that doesn't really matter, what matters is that if
> you think they are similar, legal people might. I've only read the URL
> above, not the other links from it.
> 
> Robert, Can you identify which paragraphs need to be re-written? I won't
> argue, I will just rewrite them or delete them and start afresh. Thanks
> for being eagle-eyed.

Been through this to check. The first paragraph was pretty similar, so
I've reworded that a little. Trying to make a readable paragraph that
introduces the need for sync rep, based upon the deficiencies of async
rep is going to be very similar, whatever we do. I can't see any other
way to introduce a feature other than to explain the potential problem
and then explain how the new feature resolves that.

Parameters are set on both primary and standby. That design existed long
before I looked at the MySQL manual, and yes there is similarity, which
led to similar descriptions of how that works. Bear in mind that the two
designs are very different.

I can't see any other similarities. If anybody else can, please shout.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Josh Berkus
Date:
All,

This is a pointless argument.

Eventually, we will be implementing all possible sync rep
configurations, because different users *need* different configurations.Some users care more about durability, some
moreabout availability,
 
and some more about response time.  And you can't have all three, which
was my point about A,D,R (also the point of CAP).

For that matter, any single configuration will be useful to a large
number of users, and an even larger number will be able to "work around"
while they wait for 9.2.  Further, the knowledge we gain by having some
kind of synch rep in the field will allow us to implement the different
configurations correctly, which *no* amount of arguing on e-mail will.

The perfect is the enemy of the good.

The relevant question is: which configuration(s) can we have ready for
the next CommitFest and alpha release?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Sync Rep Design

From
Josh Berkus
Date:
On 1/2/11 12:35 AM, Heikki Linnakangas wrote:
> Very likely. A synchronous standby can bring the master to a halt, while
> an asynchronous one is rather harmless. If I were a DBA, and the data
> wasn't very sensitive, I would liberally hand out async privileges to my
> colleagues to set up reporting standbys, test servers etc. But I would
> *not* give them synchronous privileges, because sooner or later one
> would go "hmm, I wonder what happens if I make this synchronous", or
> haphazardly copy the config file from a synchronous standby. That would
> either bring down the master, or act as a "fake" standby, acknowledging
> commits before they're flushed to the real synchronous standby. Either
> one would be bad.

That's not "very likely".  Shops with enough DBAs to actually have a
heirarchy of database access are rare indeed; the average business has
trouble employing *one* fulltime DBA.  Also, you're mistaken if you
think that DBA's wouldn't restrict the ability of development users to
set up asynch connections; those have a significant impact on the
production server as well.

However, the possibility of *accidental* DOS is a bit more likely, per
accidentally using the wrong config file.  I don't think you can guard
against that with permissions, though; if a DBA is copying the config
file from a synchronous server accidentally, presumably he's copying the
.pgpass file as well, and then the new server is identical to the legit
synch rep servers as far as the master is concerned.

So, again, I don't agree that a separate synchrep permission is useful,
or warranted.

However, your arguments *do* make me backpedal on the issue of having a
list of synch rep roles on the master.  I can easily imagine a DBA
needing to rapidly disable synch rep if replication is failing, without
taking the time to log in to several separate standbys, one or more of
which might be improperly configured and inaccessible.  I can't think of
a simpler way to do that than having a synch_rep_roles configuration on
the master.  That would also handle control issues for the senior DBA,
since you'd need superuser access to the master to modify it.

--Josh Berkus


Re: Sync Rep Design

From
Simon Riggs
Date:
On Tue, 2011-01-04 at 10:28 -0800, Josh Berkus wrote:

> The relevant question is: which configuration(s) can we have ready for
> the next CommitFest and alpha release?

Based upon that series of conversations, I've reworked the design so
that there is (currently) only a single standby offering sync rep at any
one time. Other standbys can request to be sync standbys but they only
become the sync standby if the first one fails. Which was simple to do
and bridges the challenges of an exactly identified sync standby and the
fragility of too closely specifying the config.

I think you're right that trying to please everyone is not going to be
possible in this release, because of the explosion of parameter
combinations that require testing and because of the explosion in my
head that causes.

I'm not feeling well now, so I'm going to go to bed, not just to avoid
snapping at people. Even given that short interlude, I see no problem
about delivery.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep Design

From
Robert Haas
Date:
On Sun, Jan 2, 2011 at 4:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, 2011-01-02 at 18:54 +0200, Heikki Linnakangas wrote:
>
>> I believe we all agree that there's different use cases that require
>> different setups. Both "first-past-the-post" and "wait-for-all-to-ack"
>> have their uses.
>
> Robert's analysis is that "first-past-the-post" doesn't actually improve
> the durability guarantee (according to his calcs). Which means that
>  1 primary, 2 sync standbys with first-past-the-post
> is actually worse than
>  1 primary, 1 sync and 1 async standby
> in terms of its durability guarantees.
>
> So ISTM that Robert does not agree that both have their uses.

I think it depends on what failure modes you want to protect against.
If you have a primary in New York, a secondary in Los Angeles, and
another secondary in London, you might decide that the chances of two
standbys being taken out by the same event are negligible, or
alternatively that if one event does take out both of them, it'll be
something like a meteor where you'll have bigger things to worry about
than lost transactions.  In that case, requiring one ACK but not two
is pretty sensible.  If the primary goes down, you'll look at the two
remaining machines (which, by presumption, will still be up) and
promote whichever one is ahead.  In this setup, you get a performance
benefit from waiting for either ACK rather than both ACKs, and you
haven't compromised any of the cases you care about.

However, if you have the traditional close/far setup, things are
different.  Suppose you have a primary and a secondary in New York and
another secondary in Los Angeles.  Now it has to be viewed as a
reasonable possibility that you could lose the New York site.  If that
happens, you need to be able to promote the LA standby *without
reference to the NY standby*.  So you really can't afford to do the
1-of-2 thing, because then when NY goes away you're not sure whether
the LA standby is safe to promote.

So, IMHO, it just depends on what you want to do.

>> I'm not
>> sure what the point of such a timeout in general is, but people have
>> requested that.
>
> Again, this sounds like you think a timeout has no measurable benefit,
> other than to please some people's perceived needs.
>
>> The "wait-for-all-to-ack" looks a lot less ridiculous if you also
>> configure a timeout and don't wait for disconnected standbys
>
> Does it? Do Robert, Stefan and Aidan agree? What are the availability
> and durability percentages if we do that? Based on those, we may decide
> to do that instead. But I'd like to see some analysis of your ideas, not
> just a "we could". Since nobody has commented on my analysis, lets see
> someone else's.

Here's my take on this point.  I think there is a use case for waiting
for a disconnected standby and a use case for not waiting for a
disconnected standby.  The danger of NOT waiting for a disconnected
standby is that if you then go on to irretrievably lose the primary,
you lose transactions.  But on the other hand, if you do wait, you've
made the primary unavailable.  I don't know that there's one right
answer here.  For some people, if they can't be certain of recording
the transaction in two places, then it may be better to not process
any transactions at all.  For other people, it may be better to
process transactions unprotected for a while while you get a new
standby up.  It's not for us to make that judgment; we're here to
provide options.

Having said that, I am OK with whichever one we want to implement
first so long as we keep the door open to doing the other one later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
"Joshua D. Drake"
Date:
> I'm not feeling well now, so I'm going to go to bed, not just to avoid
> snapping at people. Even given that short interlude, I see no problem
> about delivery.

Cool! Thanks Simon. Feel better.

Joshua D. Drake


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt



Re: Sync Rep Design

From
Stefan Kaltenbrunner
Date:
On 01/04/2011 07:51 PM, Simon Riggs wrote:
> On Tue, 2011-01-04 at 10:28 -0800, Josh Berkus wrote:
>
>> The relevant question is: which configuration(s) can we have ready for
>> the next CommitFest and alpha release?
>
> Based upon that series of conversations, I've reworked the design so
> that there is (currently) only a single standby offering sync rep at any
> one time. Other standbys can request to be sync standbys but they only
> become the sync standby if the first one fails. Which was simple to do
> and bridges the challenges of an exactly identified sync standby and the
> fragility of too closely specifying the config.

ah cool - like that approach for 9.1!


Thanks for working on that!


Stefan


Re: Sync Rep Design

From
Robert Haas
Date:
On Tue, Jan 4, 2011 at 2:50 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> On 01/04/2011 07:51 PM, Simon Riggs wrote:
>>
>> On Tue, 2011-01-04 at 10:28 -0800, Josh Berkus wrote:
>>
>>> The relevant question is: which configuration(s) can we have ready for
>>> the next CommitFest and alpha release?
>>
>> Based upon that series of conversations, I've reworked the design so
>> that there is (currently) only a single standby offering sync rep at any
>> one time. Other standbys can request to be sync standbys but they only
>> become the sync standby if the first one fails. Which was simple to do
>> and bridges the challenges of an exactly identified sync standby and the
>> fragility of too closely specifying the config.
>
> ah cool - like that approach for 9.1!

Yeah, I like that idea too, on first blush.  I think we should think
it over and see whether we're committing ourselves to any design
decisions we may later regret - what parameters will we need to add
from that point to get all the configs we ultimately want to support?
But it seems a reasonable starting point, and we can argue about the
rest once we have working code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Sync Rep Design

From
Dimitri Fontaine
Date:
Josh Berkus <josh@postgresql.org> writes:
> So, again, I don't agree that a separate synchrep permission is useful,
> or warranted.

+1

> However, your arguments *do* make me backpedal on the issue of having a
> list of synch rep roles on the master.  I can easily imagine a DBA
> needing to rapidly disable synch rep if replication is failing, without
> taking the time to log in to several separate standbys, one or more of
> which might be improperly configured and inaccessible.  I can't think of
> a simpler way to do that than having a synch_rep_roles configuration on
> the master.  That would also handle control issues for the senior DBA,
> since you'd need superuser access to the master to modify it.

What about the HBA here?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Sync Rep Design

From
Josh Berkus
Date:
> What about the HBA here?

Hmmm.  That's tempting; an "synchronous" HBA instead of a GUC?  But that
doesn't solve the problem of "standby #6 is failing, I want to kick it
off synch rep".

I'd be opposed to having a GUC *and* an HBA.  making DBAs set things
independantly in two places just frustrates our users.

--Josh


Re: Sync Rep Design

From
Dimitri Fontaine
Date:
Josh Berkus <josh@postgresql.org> writes:
>> What about the HBA here?
>
> Hmmm.  That's tempting; an "synchronous" HBA instead of a GUC?  But that
> doesn't solve the problem of "standby #6 is failing, I want to kick it
> off synch rep".
>
> I'd be opposed to having a GUC *and* an HBA.  making DBAs set things
> independantly in two places just frustrates our users.

I was just thinking that you could prepend a reject line at the right
place in the file.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Sync Rep Design

From
Josh Berkus
Date:
> I was just thinking that you could prepend a reject line at the right
> place in the file.

Hmmm, that's worth thinking about.  How do others feel about this?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Sync Rep Design

From
Greg Smith
Date:
Simon Riggs wrote:
> Based upon that series of conversations, I've reworked the design so
> that there is (currently) only a single standby offering sync rep at any
> one time. Other standbys can request to be sync standbys but they only
> become the sync standby if the first one fails. Which was simple to do
> and bridges the challenges of an exactly identified sync standby and the
> fragility of too closely specifying the config.
>   

That seems like a good enough starting point to cover a lot of cases.  
Presuming the two servers each at two sites config that shows up in a 
lot of these discussions, people in the "I need sync to a remote spot" 
can get that, and if that site is unavailable for long enough to be 
kicked out they'll smoothly degrade to sync on a secondary local copy.  
And those who want high performance local sync and best-effort for the 
remote site can configure for that too, and if the local secondary dies 
they'll just degrade to the slow remote commits.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books