Thread: Support for N synchronous standby servers

Support for N synchronous standby servers

From

Michael Paquier

Date:

09 August 2014, 06:03:27

Hi all,

Please find attached a patch to add support of synchronous replication
for multiple standby servers. This is controlled by the addition of a
new GUC parameter called synchronous_standby_num, that makes server
wait for transaction commit on the first N standbys defined in
synchronous_standby_names. The implementation is really
straight-forward, and has just needed a couple of modifications in
walsender.c for pg_stat_get_wal_senders and syncrep.c.

When a process commit is cancelled manually by user or when
ProcDiePending shows up, the message returned to user does not show
the list of walsenders where the commit has not been confirmed as it
partially confirmed. I have not done anything for that but let me know
if that would be useful. This would need a scan of the walsenders to
get their application_name.

Thanks,
--
Michael

Attachment

0001-syncrep_multi_standbys.patch

Re: Support for N synchronous standby servers

From

Fujii Masao

Date:

10 August 2014, 16:31:35

On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Hi all,
>
> Please find attached a patch to add support of synchronous replication
> for multiple standby servers. This is controlled by the addition of a
> new GUC parameter called synchronous_standby_num, that makes server
> wait for transaction commit on the first N standbys defined in
> synchronous_standby_names. The implementation is really
> straight-forward, and has just needed a couple of modifications in
> walsender.c for pg_stat_get_wal_senders and syncrep.c.

Great! This is really the feature which I really want.
Though I forgot why we missed this feature when
we had added the synchronous replication feature,
maybe it's worth reading the old discussion which
may suggest the potential problem of N sync standbys.

I just tested this feature with synchronous_standby_num = 2.
I started up only one synchronous standby and ran
the write transaction. Then the transaction was successfully
completed, i.e., it didn't wait for two standbys. Probably
this is a bug of the patch.

And, you forgot to add the line of synchronous_standby_num
to postgresql.conf.sample.

Regards,

-- 
Fujii Masao

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

11 August 2014, 02:54:43

On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> Great! This is really the feature which I really want.
> Though I forgot why we missed this feature when
> we had added the synchronous replication feature,
> maybe it's worth reading the old discussion which
> may suggest the potential problem of N sync standbys.
Sure, I'll double check. Thanks for your comments.

> I just tested this feature with synchronous_standby_num = 2.
> I started up only one synchronous standby and ran
> the write transaction. Then the transaction was successfully
> completed, i.e., it didn't wait for two standbys. Probably
> this is a bug of the patch.

Oh OK, yes this is a bug of what I did. The number of standbys to wait
for takes precedence on the number of standbys found in the list of
active WAL senders. I changed the patch to take into account that
behavior. So for example if you have only one sync standby connected,
and synchronous_standby_num = 2, client waits indefinitely.

> And, you forgot to add the line of synchronous_standby_num
> to postgresql.conf.sample.
Yep, right.

On top of that, I refactored the code in such a way that
pg_stat_get_wal_senders and SyncRepReleaseWaiters rely on a single API
to get the list of synchronous standbys found. This reduces code
duplication, duplication that already exists in HEAD...
Regards,
--
Michael

Attachment

20140811_multi_syncrep_v2.patch

Re: Support for N synchronous standby servers

From

Fujii Masao

Date:

11 August 2014, 04:26:15

On Mon, Aug 11, 2014 at 11:54 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> Great! This is really the feature which I really want.
>> Though I forgot why we missed this feature when
>> we had added the synchronous replication feature,
>> maybe it's worth reading the old discussion which
>> may suggest the potential problem of N sync standbys.
> Sure, I'll double check. Thanks for your comments.
>
>> I just tested this feature with synchronous_standby_num = 2.
>> I started up only one synchronous standby and ran
>> the write transaction. Then the transaction was successfully
>> completed, i.e., it didn't wait for two standbys. Probably
>> this is a bug of the patch.
>
> Oh OK, yes this is a bug of what I did. The number of standbys to wait
> for takes precedence on the number of standbys found in the list of
> active WAL senders. I changed the patch to take into account that
> behavior. So for example if you have only one sync standby connected,
> and synchronous_standby_num = 2, client waits indefinitely.

Thanks for updating the patch! Again I tested the feature and found something
wrong. I set synchronous_standby_num to 2 and started three standbys. Two of
them are included in synchronous_standby_names, i.e., they are synchronous
standbys. That is, the other one standby is always asynchronous. When
I shutdown one of synchronous standbys and executed the write transaction,
the transaction was successfully completed. So the transaction doesn't wait for
two sync standbys in that case. Probably this is a bug.

Regards,

-- 
Fujii Masao

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

11 August 2014, 05:10:15

On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Thanks for updating the patch! Again I tested the feature and found something
> wrong. I set synchronous_standby_num to 2 and started three standbys. Two of
> them are included in synchronous_standby_names, i.e., they are synchronous
> standbys. That is, the other one standby is always asynchronous. When
> I shutdown one of synchronous standbys and executed the write transaction,
> the transaction was successfully completed. So the transaction doesn't wait for
> two sync standbys in that case. Probably this is a bug.
Well, that's working in my case :)
Please see below with 4 nodes: 1 master and 3 standbys on same host. Master listens to 5432, other nodes to 5433, 5434 and 5435. Each standby's application_name is node_$PORT
=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------
node_5433,node_5434
(1 row)
=# show synchronous_standby_num ;
synchronous_standby_num
-------------------------
2
(1 row)
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
application_name | replay_delta | sync_priority | sync_state
------------------+--------------+---------------+------------
node_5433 | 0 | 1 | sync
node_5434 | 0 | 2 | sync
node_5435 | 0 | 0 | async
(3 rows)
=# create table aa (a int);
CREATE TABLE
[...]
-- Stopped node with port 5433:
[...]
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
application_name | replay_delta | sync_priority | sync_state
------------------+--------------+---------------+------------
node_5434 | 0 | 2 | sync
node_5435 | 0 | 0 | async
(2 rows)
=# create table ab (a int);
^CCancel request sent
WARNING: 01000: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby(s).
LOCATION: SyncRepWaitForLSN, syncrep.c:227
CREATE TABLE

Regards,

--
Michael

Re: Support for N synchronous standby servers

From

Fujii Masao

Date:

11 August 2014, 07:26:30

On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Thanks for updating the patch! Again I tested the feature and found
>> something
>> wrong. I set synchronous_standby_num to 2 and started three standbys. Two
>> of
>> them are included in synchronous_standby_names, i.e., they are synchronous
>> standbys. That is, the other one standby is always asynchronous. When
>> I shutdown one of synchronous standbys and executed the write transaction,
>> the transaction was successfully completed. So the transaction doesn't
>> wait for
>> two sync standbys in that case. Probably this is a bug.
> Well, that's working in my case :)

Oh, that worked in my machine, too, this time... I did something wrong.
Sorry for the noise.

Regards,

-- 
Fujii Masao

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

11 August 2014, 07:38:44

On Mon, Aug 11, 2014 at 4:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> Oh, that worked in my machine, too, this time... I did something wrong.
> Sorry for the noise.
No problem, thanks for spending time testing.
-- 
Michael

Re: Support for N synchronous standby servers

From

Fujii Masao

Date:

13 August 2014, 05:10:27

On Mon, Aug 11, 2014 at 4:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Aug 11, 2014 at 4:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> Oh, that worked in my machine, too, this time... I did something wrong.
>> Sorry for the noise.
> No problem, thanks for spending time testing.

Probably I got the similar but another problem. I set synchronous_standby_num
to 2 and started up two synchronous standbys. When I ran write transactions,
they were successfully completed. That's OK.

I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
and then ran write transactions again. In this case, they must not be completed
because their WAL cannot be replicated to the standby that its walreceiver
was stopped. But they were successfully completed.

Regards,

-- 
Fujii Masao

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

13 August 2014, 07:10:45

On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
> and then ran write transactions again. In this case, they must not be completed
> because their WAL cannot be replicated to the standby that its walreceiver
> was stopped. But they were successfully completed.

At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
sender in sync, making backends wake up even if other standbys did not
catch up. But we need to scan all the synchronous wal senders and find
the minimum write and flush positions and update walsndctl with those
values. Well that's a code path I forgot to cover.

Attached is an updated patch fixing the problem you reported.

Regards,
--
Michael

Attachment

20140813_multi_syncrep_v3.patch

Re: Support for N synchronous standby servers

From

Fujii Masao

Date:

14 August 2014, 11:34:19

On Wed, Aug 13, 2014 at 4:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
>> and then ran write transactions again. In this case, they must not be completed
>> because their WAL cannot be replicated to the standby that its walreceiver
>> was stopped. But they were successfully completed.
>
> At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
> SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
> sender in sync, making backends wake up even if other standbys did not
> catch up. But we need to scan all the synchronous wal senders and find
> the minimum write and flush positions and update walsndctl with those
> values. Well that's a code path I forgot to cover.
>
> Attached is an updated patch fixing the problem you reported.

+        At any one time there will be at a number of active
synchronous standbys
+        defined by <varname>synchronous_standby_num</>; transactions waiting

It's better to use <xref linkend="guc-synchronous-standby-num">, instead.

+        for commit will be allowed to proceed after those standby servers
+        confirms receipt of their data. The synchronous standbys will be

Typo: confirms -> confirm

+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>, as described in
+        <xref linkend="synchronous-replication">, and listed as the first
+        elements of <xref linkend="guc-synchronous-standby-names">.
+       </para>
+       <para>
+        Default value is 1.
+       </para>

synchronous_standby_num is defined with PGC_SIGHUP. So the following
should be added into the document.
   This parameter can only be set in the postgresql.conf file or on
the server command line.

The name of the parameter "synchronous_standby_num" sounds to me that
the transaction must wait for its WAL to be replicated to s_s_num standbys.
But that's not true in your patch. If s_s_names is empty, replication works
asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.

The description of s_s_num is not sufficient. I'm afraid that users can easily
misunderstand that they can use quorum commit feature by using s_s_names
and s_s_num. That is, the transaction waits for its WAL to be replicated to
any s_s_num standbys listed in s_s_names.

When s_s_num is set to larger value than max_wal_senders, we should warn that?

+    for (i = 0; i < num_sync; i++)
+    {
+        volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+
+        if (min_write_pos > walsndloc->write)
+            min_write_pos = walsndloc->write;
+        if (min_flush_pos > walsndloc->flush)
+            min_flush_pos = walsndloc->flush;
+    }

I don't think that it's safe to see those shared values without spinlock.

Regards,

-- 
Fujii Masao

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

15 August 2014, 07:05:59

On Thu, Aug 14, 2014 at 8:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> +        At any one time there will be at a number of active
> synchronous standbys
> +        defined by <varname>synchronous_standby_num</>; transactions waiting
>
> It's better to use <xref linkend="guc-synchronous-standby-num">, instead.
Fixed.

> +        for commit will be allowed to proceed after those standby servers
> +        confirms receipt of their data. The synchronous standbys will be
>
> Typo: confirms -> confirm

Fixed.

> +       <para>
> +        Specifies the number of standbys that support
> +        <firstterm>synchronous replication</>, as described in
> +        <xref linkend="synchronous-replication">, and listed as the first
> +        elements of <xref linkend="guc-synchronous-standby-names">.
> +       </para>
> +       <para>
> +        Default value is 1.
> +       </para>
>
> synchronous_standby_num is defined with PGC_SIGHUP. So the following
> should be added into the document.
>
>     This parameter can only be set in the postgresql.conf file or on
> the server command line.
Fixed.

> The name of the parameter "synchronous_standby_num" sounds to me that
> the transaction must wait for its WAL to be replicated to s_s_num standbys.
> But that's not true in your patch. If s_s_names is empty, replication works
> asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.
> The description of s_s_num is not sufficient. I'm afraid that users can easily
> misunderstand that they can use quorum commit feature by using s_s_names
> and s_s_num. That is, the transaction waits for its WAL to be replicated to
> any s_s_num standbys listed in s_s_names.

I reworked the docs to mention all that. Yes things are a bit
different than any quorum commit facility (how to parametrize that
simply without a parameter mapping one to one the items of
s_s_names?), as this facility relies on the order of the items of
s_s_names and the fact that stansbys are connected at a given time.

> When s_s_num is set to larger value than max_wal_senders, we should warn that?
Actually I have done a bit more than that by forbidding setting
s_s_num to a value higher than max_wal_senders. Thoughts?

Now that we discuss the interactions with other parameters. Another
thing that I am wondering about now is: what should we do if we
specify s_s_num to a number higher than the elements in s_s_names?
Currently, the patch gives the priority to s_s_num, in short if we set
s_s_num to 100, server will wait for 100 servers to confirm commit
even if there are less than 100 elements in s_s_names. I chose this
way because it looks saner particularly if s_s_names = '*'. Thoughts
once again?

> +    for (i = 0; i < num_sync; i++)
> +    {
> +        volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
> +
> +        if (min_write_pos > walsndloc->write)
> +            min_write_pos = walsndloc->write;
> +        if (min_flush_pos > walsndloc->flush)
> +            min_flush_pos = walsndloc->flush;
> +    }
>
> I don't think that it's safe to see those shared values without spinlock.
Looking at walsender.c you are right. I have updated the code to use
the mutex lock of the walsender whose values are being read from.

Regards,
--
Michael

On Thu, Aug 14, 2014 at 4:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Aug 13, 2014 at 4:10 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
>>> and then ran write transactions again. In this case, they must not be completed
>>> because their WAL cannot be replicated to the standby that its walreceiver
>>> was stopped. But they were successfully completed.
>>
>> At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
>> SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
>> sender in sync, making backends wake up even if other standbys did not
>> catch up. But we need to scan all the synchronous wal senders and find
>> the minimum write and flush positions and update walsndctl with those
>> values. Well that's a code path I forgot to cover.
>>
>> Attached is an updated patch fixing the problem you reported.
>
> +        At any one time there will be at a number of active
> synchronous standbys
> +        defined by <varname>synchronous_standby_num</>; transactions waiting
>
> It's better to use <xref linkend="guc-synchronous-standby-num">, instead.
>
> +        for commit will be allowed to proceed after those standby servers
> +        confirms receipt of their data. The synchronous standbys will be
>
> Typo: confirms -> confirm
>
> +       <para>
> +        Specifies the number of standbys that support
> +        <firstterm>synchronous replication</>, as described in
> +        <xref linkend="synchronous-replication">, and listed as the first
> +        elements of <xref linkend="guc-synchronous-standby-names">.
> +       </para>
> +       <para>
> +        Default value is 1.
> +       </para>
>
> synchronous_standby_num is defined with PGC_SIGHUP. So the following
> should be added into the document.
>
>     This parameter can only be set in the postgresql.conf file or on
> the server command line.
>
> The name of the parameter "synchronous_standby_num" sounds to me that
> the transaction must wait for its WAL to be replicated to s_s_num standbys.
> But that's not true in your patch. If s_s_names is empty, replication works
> asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.
>
> The description of s_s_num is not sufficient. I'm afraid that users can easily
> misunderstand that they can use quorum commit feature by using s_s_names
> and s_s_num. That is, the transaction waits for its WAL to be replicated to
> any s_s_num standbys listed in s_s_names.
>
> When s_s_num is set to larger value than max_wal_senders, we should warn that?
>
> +    for (i = 0; i < num_sync; i++)
> +    {
> +        volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
> +
> +        if (min_write_pos > walsndloc->write)
> +            min_write_pos = walsndloc->write;
> +        if (min_flush_pos > walsndloc->flush)
> +            min_flush_pos = walsndloc->flush;
> +    }
>
> I don't think that it's safe to see those shared values without spinlock.
>
> Regards,
>
> --
> Fujii Masao



--
Michael

Attachment

20140815_multi_syncrep_v4.patch

Re: Support for N synchronous standby servers

From

Fujii Masao

Date:

15 August 2014, 12:28:37

On Fri, Aug 15, 2014 at 4:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Aug 14, 2014 at 8:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> +        At any one time there will be at a number of active
>> synchronous standbys
>> +        defined by <varname>synchronous_standby_num</>; transactions waiting
>>
>> It's better to use <xref linkend="guc-synchronous-standby-num">, instead.
> Fixed.
>
>> +        for commit will be allowed to proceed after those standby servers
>> +        confirms receipt of their data. The synchronous standbys will be
>>
>> Typo: confirms -> confirm
>
> Fixed.
>
>> +       <para>
>> +        Specifies the number of standbys that support
>> +        <firstterm>synchronous replication</>, as described in
>> +        <xref linkend="synchronous-replication">, and listed as the first
>> +        elements of <xref linkend="guc-synchronous-standby-names">.
>> +       </para>
>> +       <para>
>> +        Default value is 1.
>> +       </para>
>>
>> synchronous_standby_num is defined with PGC_SIGHUP. So the following
>> should be added into the document.
>>
>>     This parameter can only be set in the postgresql.conf file or on
>> the server command line.
> Fixed.
>
>> The name of the parameter "synchronous_standby_num" sounds to me that
>> the transaction must wait for its WAL to be replicated to s_s_num standbys.
>> But that's not true in your patch. If s_s_names is empty, replication works
>> asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.
>> The description of s_s_num is not sufficient. I'm afraid that users can easily
>> misunderstand that they can use quorum commit feature by using s_s_names
>> and s_s_num. That is, the transaction waits for its WAL to be replicated to
>> any s_s_num standbys listed in s_s_names.
>
> I reworked the docs to mention all that. Yes things are a bit
> different than any quorum commit facility (how to parametrize that
> simply without a parameter mapping one to one the items of
> s_s_names?), as this facility relies on the order of the items of
> s_s_names and the fact that stansbys are connected at a given time.
>
>> When s_s_num is set to larger value than max_wal_senders, we should warn that?
> Actually I have done a bit more than that by forbidding setting
> s_s_num to a value higher than max_wal_senders. Thoughts?

You added check_synchronous_standby_num() as the GUC check function for
synchronous_standby_num, and checked that there. But that seems to be wrong.
You can easily see the following error messages even if synchronous_standby_num
is smaller than max_wal_senders. The point is that synchronous_standby_num
should be located before max_wal_senders in postgresql.conf.

LOG:  invalid value for parameter "synchronous_standby_num": 0
DETAIL:  synchronous_standby_num cannot be higher than max_wal_senders.

> Now that we discuss the interactions with other parameters. Another
> thing that I am wondering about now is: what should we do if we
> specify s_s_num to a number higher than the elements in s_s_names?
> Currently, the patch gives the priority to s_s_num, in short if we set
> s_s_num to 100, server will wait for 100 servers to confirm commit
> even if there are less than 100 elements in s_s_names. I chose this
> way because it looks saner particularly if s_s_names = '*'. Thoughts
> once again?

I'm fine with this. As you gave an example, the number of entries in s_s_names
can be smaller than the number of actual active sync standbys. For example,
when s_s_names is set to 'hoge', more than one standbys with the name 'hoge'
can connect to the server with sync mode.

I still think that it's strange that replication can be async even when
s_s_num is larger than zero. That is, I think that the transaction must
wait for s_s_num sync standbys whether s_s_names is empty or not.
OTOH, if s_s_num is zero, replication must be async whether s_s_names
is empty or not. At least for me, it's intuitive to use s_s_num primarily
to control the sync mode. Of course, other hackers may have different
thoughts, so we need to keep our ear open for them.

In the above design, one problem is that the number of parameters
that those who want to set up only one sync replication need to change is
incremented by one. That is, they need to change s_s_num additionally.
If we are really concerned about this, we can treat a value of -1 in
s_s_num as the special value, which allows us to control sync replication
only by s_s_names as we do now. That is, if s_s_names is empty,
replication would be async. Otherwise, only one standby with
high-priority in s_s_names becomes sync one. Probably the default of
s_s_num should be -1. Thought?

The source code comments at the top of syncrep.c need to be udpated.
It's worth checking whether there are other comments to be updated.

Regards,

-- 
Fujii Masao

Re: Support for N synchronous standby servers

From

Robert Haas

Date:

18 August 2014, 16:23:09

On Fri, Aug 15, 2014 at 8:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Now that we discuss the interactions with other parameters. Another
>> thing that I am wondering about now is: what should we do if we
>> specify s_s_num to a number higher than the elements in s_s_names?
>> Currently, the patch gives the priority to s_s_num, in short if we set
>> s_s_num to 100, server will wait for 100 servers to confirm commit
>> even if there are less than 100 elements in s_s_names. I chose this
>> way because it looks saner particularly if s_s_names = '*'. Thoughts
>> once again?
>
> I'm fine with this. As you gave an example, the number of entries in s_s_names
> can be smaller than the number of actual active sync standbys. For example,
> when s_s_names is set to 'hoge', more than one standbys with the name 'hoge'
> can connect to the server with sync mode.

This is a bit tricky.  Suppose there is one standby connected which
has reached the relevant WAL position.  We then lose that connection,
and a new standby connects.  When or if the second standby is known to
have reached the relevant WAL position, can we release waiters?  It
depends.  If the old and new connections are to two different standbys
that happen to have the same name, yes.  But if it's the same standby
reconnecting, then no.

> I still think that it's strange that replication can be async even when
> s_s_num is larger than zero. That is, I think that the transaction must
> wait for s_s_num sync standbys whether s_s_names is empty or not.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

21 August 2014, 05:27:12

On Fri, Aug 15, 2014 at 9:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> You added check_synchronous_standby_num() as the GUC check function for
> synchronous_standby_num, and checked that there. But that seems to be wrong.
> You can easily see the following error messages even if synchronous_standby_num
> is smaller than max_wal_senders. The point is that synchronous_standby_num
> should be located before max_wal_senders in postgresql.conf.
>
> LOG: invalid value for parameter "synchronous_standby_num": 0
> DETAIL: synchronous_standby_num cannot be higher than max_wal_senders.

I am not sure what I can do here, so I am removing this check in the code, and simply add a note in the docs that a value of _num higher than max_wal_senders does not have much meaning.

> I still think that it's strange that replication can be async even when
> s_s_num is larger than zero. That is, I think that the transaction must
> wait for s_s_num sync standbys whether s_s_names is empty or not.
> OTOH, if s_s_num is zero, replication must be async whether s_s_names
> is empty or not. At least for me, it's intuitive to use s_s_num primarily
> to control the sync mode. Of course, other hackers may have different
> thoughts, so we need to keep our ear open for them.

Sure, the compromise looks to be what you propose, and I am fine with that.

> In the above design, one problem is that the number of parameters
> that those who want to set up only one sync replication need to change is
> incremented by one. That is, they need to change s_s_num additionally.
> If we are really concerned about this, we can treat a value of -1 in
> s_s_num as the special value, which allows us to control sync replication
> only by s_s_names as we do now. That is, if s_s_names is empty,
> replication would be async. Otherwise, only one standby with
> high-priority in s_s_names becomes sync one. Probably the default of
> s_s_num should be -1. Thought?

Taking into account those comments, attached is a patch doing the following things depending on the values of _num and _names:
- If _num = -1 and _names is empty, all the standbys are considered as async (same behavior as 9.1~, and default).
- If _num = -1 and _names has at least one item, wait for one standby, even if it is not connected at the time of commit. If one node is found as sync, other standbys listed in _names with higher priority than the sync one are in potential state (same as existing behavior).
- If _num = 0, all the standbys are async, whatever the values in _names. Priority is enforced to 0 for all the standbys. SyncStandbysDefined is set to false in this case.
- If _num > 0, must wait for _num standbys whatever the values in _names
The default value of _num is -1. Documentation has been updated in consequence.

> The source code comments at the top of syncrep.c need to be udpated.
> It's worth checking whether there are other comments to be updated.
Done. I have updated some comments in other places than the header.

Regards,

--
Michael

Attachment

20140821_multi_syncrep_v5.patch

Re: Support for N synchronous standby servers

From

Rajeev rastogi

Date:

22 August 2014, 10:15:04

On 09 August 2014 11:33, Michael Paquier Wrote:

> Please find attached a patch to add support of synchronous replication
> for multiple standby servers. This is controlled by the addition of a
> new GUC parameter called synchronous_standby_num, that makes server
> wait for transaction commit on the first N standbys defined in
> synchronous_standby_names. The implementation is really straight-
> forward, and has just needed a couple of modifications in walsender.c
> for pg_stat_get_wal_senders and syncrep.c.

I have just started looking into this patch.
Please find below my first level of observation from the patch:

1. Allocation of memory for sync_nodes in function SyncRepGetSynchronousNodes should be equivalent to
allowed_sync_nodesinstead of max_wal_senders. As anyway we are not going to store sync stdbys more   than
allowed_sync_nodes.   sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int)); 

2. Logic of deciding the highest priority one seems to be in-correct.Assume, s_s_num = 3, s_s_names = 3,4,2,1 standby
nodesare in order as: 1,2,3,4,5,6,7As per the logic in patch, node 4 with priority 2 will not be added in the list
whereas1,2,3 will be added.The problem is because priority updated for next tracking is not the highest priority as of
thatiteration, it is just priority of last node added to the list. So it may happen that a node with     higher
priorityis still there in list but we are comparing with some other smaller priority.  
3. Can we optimize the function SyncRepGetSynchronousNodes in such a way that it gets the number of standby nodes from
s_s_namesitself. With this it will be usful to stop scanning the moment we get first s_s_num potential standbys. 

Thanks and Regards,
Kumar Rajeev Rastogi

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

22 August 2014, 14:42:42

On Fri, Aug 22, 2014 at 7:14 PM, Rajeev rastogi <rajeev.rastogi@huawei.com> wrote:

I have just started looking into this patch.
Please find below my first level of observation from the patch:

Thanks! Updated patch attached.

1. Allocation of memory for sync_nodes in function SyncRepGetSynchronousNodes should be equivalent to allowed_sync_nodes instead of max_wal_senders. As anyway we are not going to store sync stdbys more than allowed_sync_nodes.
sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int));

Fixed.

2. Logic of deciding the highest priority one seems to be in-correct.
Assume, s_s_num = 3, s_s_names = 3,4,2,1
standby nodes are in order as: 1,2,3,4,5,6,7

As per the logic in patch, node 4 with priority 2 will not be added in the list whereas 1,2,3 will be added.

The problem is because priority updated for next tracking is not the highest priority as of that iteration, it is just priority of last node added to the list. So it may happen that a node with higher priority is still there in list but we are comparing with some other smaller priority.

Fixed. Nice catch!

3. Can we optimize the function SyncRepGetSynchronousNodes in such a way that it gets the number of standby nodes from s_s_names itself. With this it will be usful to stop scanning the moment we get first s_s_num potential standbys.

By doing so, we would need to scan the WAL sender array more than once (or once if we can find N sync nodes with a name matching the first entry, smth unlikely to happen). We would need as well to recalculate for a given item in the list _names what is its priority and compare it with the existing entries in the WAL sender list. So this is not worth the shot.
Also, using the priority instead of s_s_names is more solid as s_s_names is now used only in SyncRepGetStandbyPriority to calculate the priority for a given WAL sender, and is a function only called by a WAL sender itself when it initializes.

Regards,

--
Michael

Attachment

20140822_multi_syncrep_v6.patch

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

23 August 2014, 05:52:33

On Fri, Aug 22, 2014 at 11:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>>
>> 2. Logic of deciding the highest priority one seems to be in-correct.
>>         Assume, s_s_num = 3, s_s_names = 3,4,2,1
>>         standby nodes are in order as: 1,2,3,4,5,6,7
>>
>>         As per the logic in patch, node 4 with priority 2 will not be added in the list whereas 1,2,3 will be added.
>>
>>         The problem is because priority updated for next tracking is not the highest priority as of that iteration,
itis just priority of last node added to the list. So it may happen that a node with       higher priority is still
therein list but we are comparing with some other smaller priority. 
>
>
> Fixed. Nice catch!


Actually by re-reading the code I wrote yesterday I found that the fix
in v6 for that is not correct. That's really fixed with v7 attached.
Regards,
--
Michael

Attachment

20140823_multi_syncrep_v7.patch

Re: Support for N synchronous standby servers

From

Rajeev rastogi

Date:

27 August 2014, 05:46:50

On 23 August 2014 11:22, Michael Paquier Wrote:

> >> 2. Logic of deciding the highest priority one seems to be in-correct.
> >>         Assume, s_s_num = 3, s_s_names = 3,4,2,1
> >>         standby nodes are in order as: 1,2,3,4,5,6,7
> >>
> >>         As per the logic in patch, node 4 with priority 2 will not
> be added in the list whereas 1,2,3 will be added.
> >>
> >>         The problem is because priority updated for next tracking is
> not the highest priority as of that iteration, it is just priority of
> last node added to the list. So it may happen that a node with
> higher priority is still there in list but we are comparing with some
> other smaller priority.
> >
> >
> > Fixed. Nice catch!
>
>
> Actually by re-reading the code I wrote yesterday I found that the fix
> in v6 for that is not correct. That's really fixed with v7 attached.

I have done some more review, below are my comments:

1. There are currently two loops on *num_sync, Can we simply the function SyncRepGetSynchronousNodes by moving the
prioritycalculation inside the upper loop    if (*num_sync == allowed_sync_nodes)    {        for (j = 0; j <
*num_sync;j++)        {Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.So in this
loopitself, we can calculate the priority as well as assigment of new standbys with lower priority.Let me know if you
seeany issue with this.     
2.    Comment inside the function SyncRepReleaseWaiters,/* * We should have found ourselves at least, except if it is
notexpected * to find any synchronous nodes. */Assert(num_sync > 0);I think "except if it is not expected to find any
synchronousnodes" is not required. As if it has come till this point means atleast this node is synchronous. 
3.     Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the
same.IMHO,s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console
withoutanyknowledge of user.I see that some discussion has happened regarding this but I think just adding
documentationfor this is not enough.I am not sure what issue is observed in adding check during GUC initialization but
ifthere is unavoidable issue during GUC initializationthen can't we try to add check at later points. 
4.  Similary interaction between parameters s_s_names and s_s_num. I see some discussion has happened regarding this
andit is acceptableto have s_s_num more than s_s_names. But I was thinking should not give atleast some notice message
touser for such case along with some documentation. 
config.sgml
5. "At any one time there will be at a number of active synchronous standbys": this sentence is not proper.

6.     When this parameter is set to <literal>0</>, all the standby       nodes will be considered as asynchronous.
Canwe make this as When this parameter is set to <literal>0</>, all the standby       nodes will be considered as
asynchronousirrespective of value of synchronous_standby_names.         
7.     Are considered as synchronous the first elements of       <xref linkend="guc-synchronous-standby-names"> in
numberof       <xref linkend="guc-synchronous-standby-num"> that are       connected. 
Starting of this sentence looks to be incomplete.
high-availability.sgml
8.  Standbys listed after this will take over the role   of synchronous standby if the first one should fail.
    Should not we make it as:    Standbys listed after this will take over the role   of synchronous standby if any of
thefirst synchronous-standby-num standby fails. 

Let me know incase if something is not clear.

Thanks and Regards,
Kumar Rajeev Rastogi.

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

28 August 2014, 07:11:04

On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
<rajeev.rastogi@huawei.com> wrote:
> I have done some more review, below are my comments:
Thanks!

> 1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the
prioritycalculation inside the upper loop 
>                 if (*num_sync == allowed_sync_nodes)
>                 {
>                         for (j = 0; j < *num_sync; j++)
>                         {
>         Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.
>         So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower
priority.
>         Let me know if you see any issue with this.

OK, I see, yes this can minimize process a bit so I refactored the
code by integrating the second loop to the first. This has needed the
removal of the break portion as we need to find the highest priority
value among the nodes already determined as synchronous.

> 2.      Comment inside the function SyncRepReleaseWaiters,
>         /*
>          * We should have found ourselves at least, except if it is not expected
>          * to find any synchronous nodes.
>          */
>         Assert(num_sync > 0);
>
>         I think "except if it is not expected to find any synchronous nodes" is not required.
>         As if it has come till this point means at least this node is synchronous.
Yes, removed.

> 3.      Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the
same.
>         IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the
consolewithout 
>         any knowledge of user.
>         I see that some discussion has happened regarding this but I think just adding documentation for this is not
enough.
>         I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable
issueduring GUC initialization then can't we try to add check at later points. 

The trick here is that you cannot really return a warning at GUC
loading level to the user as a warning could be easily triggered if
for example s_s_num is present before max_wal_senders in
postgresql.conf. I am open to any solutions if there are any (like an
error when initializing WAL senders?!). Documentation seems enough for
me to warn the user.

> 4.  Similary interaction between parameters s_s_names and s_s_num. I see some discussion has happened regarding this
andit is acceptable 
>         to have s_s_num more than s_s_names. But I was thinking should not give atleast some notice message to user
forsuch case along with 
>         some documentation.

Done. I added the following in the paragraph "Server will wait":
Hence it is recommended to not set <varname>synchronous_standby_num</>
to a value higher than the number of elements in
<varname>synchronous_standby_names</>.

> 5. "At any one time there will be at a number of active synchronous standbys": this sentence is not proper.
What about that:
"At any one time there can be a number of active synchronous standbys
up to the number defined by <xref
linkend="guc-synchronous-standby-num">"

> 6.      When this parameter is set to <literal>0</>, all the standby
>         nodes will be considered as asynchronous.
>
>         Can we make this as
>         When this parameter is set to <literal>0</>, all the standby
>         nodes will be considered as asynchronous irrespective of value of synchronous_standby_names.

Done. This seems proper for the user as we do not care at all about
s_s_names if _num = 0.

> 7.      Are considered as synchronous the first elements of
>         <xref linkend="guc-synchronous-standby-names"> in number of
>         <xref linkend="guc-synchronous-standby-num"> that are
>         connected.
>
>         Starting of this sentence looks to be incomplete.
OK, I reworked this part as well. I hope it is clearer.

> 8.  Standbys listed after this will take over the role
>     of synchronous standby if the first one should fail.
>
>                 Should not we make it as:
>
>         Standbys listed after this will take over the role
>     of synchronous standby if any of the first synchronous-standby-num standby fails.
Fixed as proposed.

At the same I found a bug with pg_stat_get_wal_senders caused by a
NULL pointer that was freed when s_s_num = 0. Updated patch addressing
comments is attached. On top of that the documentation has been
reworked a bit by replacing the too-high amount of <xref> blocks by
<varname>, having a link to a given variable specified only once.
Regards,
--
Michael

Attachment

20140828_multi_syncrep_v8.patch

Re: Support for N synchronous standby servers

From

Heikki Linnakangas

Date:

10 September 2014, 20:22:19

On 08/28/2014 10:10 AM, Michael Paquier wrote:
> + #synchronous_standby_num = -1    # number of standbys servers using sync rep

To be honest, that's a horrible name for the GUC. Back when synchronous 
replication was implemented, we had looong discussions on this feature. 
It was called "quorum commit" back then. I'd suggest using the "quorum" 
term in this patch, too, that's a fairly well-known term in distributed 
computing for this.

When synchronous replication was added, quorum was left out to keep 
things simple; the current feature set was the most we could all agree 
on to be useful. If you search the archives for "quorum commit" you'll 
see what I mean. There was a lot of confusion on what is possible and 
what is useful, but regarding this particular patch: people wanted to be 
able to describe more complicated scenarios. For example, imagine that 
you have a master and two standbys in one the primary data center, and 
two more standbys in a different data center. It should be possible to 
specify that you must get acknowledgment from at least on standby in 
both data centers. Maybe you could hack that by giving the standbys in 
the same data center the same name, but it gets ugly, and it still won't 
scale to even more complex scenarios.

Maybe that's OK - we don't necessarily need to solve all scenarios at 
once. But it's worth considering.

BTW, how does this patch behave if there are multiple standbys connected 
with the same name?

- Heikki

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

11 September 2014, 03:40:19

On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 08/28/2014 10:10 AM, Michael Paquier wrote:
>>
>> + #synchronous_standby_num = -1 # number of standbys servers using sync
>> rep
>
>
> To be honest, that's a horrible name for the GUC. Back when synchronous
> replication was implemented, we had looong discussions on this feature. It
> was called "quorum commit" back then. I'd suggest using the "quorum" term in
> this patch, too, that's a fairly well-known term in distributed computing
> for this.
I am open to any suggestions. Then what about the following parameter names?
- synchronous_quorum_num
- synchronous_standby_quorum
- synchronous_standby_quorum_num
- synchronous_quorum_commit

> When synchronous replication was added, quorum was left out to keep things
> simple; the current feature set was the most we could all agree on to be
> useful. If you search the archives for "quorum commit" you'll see what I
> mean. There was a lot of confusion on what is possible and what is useful,
> but regarding this particular patch: people wanted to be able to describe
> more complicated scenarios. For example, imagine that you have a master and
> two standbys in one the primary data center, and two more standbys in a
> different data center. It should be possible to specify that you must get
> acknowledgment from at least on standby in both data centers. Maybe you
> could hack that by giving the standbys in the same data center the same
> name, but it gets ugly, and it still won't scale to even more complex
> scenarios.

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

Now, in your case the two nodes on the second data center need to have
the same priority either way. With this patch you can achieve that
with the same node name. Where things are not that cool with this
patch is something like that though:
- 5 slaves: 1 with master (node_local), 2 on a 2nd data center
(node_center1), 2 last on a 3rd data center (node_center2)
- s_s_num = 3
- s_s_names = 'node_local,node_center1,node_center2'

In this case the nodes have the following priority:
- node_local => 1
- the 2 nodes with node_center1 => 2
- the 2 nodes with node_center2 => 3
In this {1,2,2,3,3} schema, the patch makes system wait for
node_local, and the two nodes in node_center1 without caring about the
ones in node_center2 as it will pick up only the nodes with the
highest priority. If user expects the system to wait for a node in
node_center2 he'll be disappointed. That's perhaps where we could
improve things, by adding an extra parameter able to control the
priority ranks, say synchronous_priority_check:
- [absolute|individual], wait for the first s_s_num nodes having the
lowest priority, in this case we'll wait for {1,2,2}
- group: for only one node in the lowest s_s_num priorities, here
we'll wait for {1,2,3}
Note that we may not even need this parameter if we assume by default
that we wait for only one node in a given group that has the same
priority.

> Maybe that's OK - we don't necessarily need to solve all scenarios at once.
> But it's worth considering.

Parametrizing and coverage of the user expectations are tricky. Either
way not everybody can be happy :) There are even people unhappy now
because we can only define one single sync node.

> BTW, how does this patch behave if there are multiple standbys connected
> with the same name?

All the nodes have the same priority. For example in the case of a
cluster with 5 slaves having the same application name and s_s_num =3,
the first three nodes when scanning the WAL sender array are expected
to return a COMMIT before committing locally:
=# show synchronous_standby_num ;synchronous_standby_num
-------------------------3
(1 row)
=# show synchronous_standby_names ;synchronous_standby_names
---------------------------node
(1 row)
=# SELECT application_name, client_port,   pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,sync_state   FROM pg_stat_replication ORDER BY replay_delta ASC, applapplication_name | client_port |
replay_delta| sync_priority | sync_state
 
------------------+-------------+--------------+---------------+------------node             |       50251 |
0|             1 | syncnode             |       50252 |            0 |             1 | syncnode             |
50253|            0 |             1 | syncnode             |       50254 |            0 |             1 | potentialnode
           |       50255 |            0 |             1 | potential
 
(5 rows)

After writing this long message, and thinking more about that, I kind
of like the group approach. Thoughts welcome.
Regards,
-- 
Michael

Re: Support for N synchronous standby servers

From

Amit Kapila

Date:

11 September 2014, 08:01:22

On Thu, Sep 11, 2014 at 9:10 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > On 08/28/2014 10:10 AM, Michael Paquier wrote:
> >>
> >> + #synchronous_standby_num = -1 # number of standbys servers using sync
> >> rep
> >
> >
> > To be honest, that's a horrible name for the GUC. Back when synchronous
> > replication was implemented, we had looong discussions on this feature. It
> > was called "quorum commit" back then. I'd suggest using the "quorum" term in
> > this patch, too, that's a fairly well-known term in distributed computing
> > for this.
> I am open to any suggestions. Then what about the following parameter names?
> - synchronous_quorum_num
> - synchronous_standby_quorum
> - synchronous_standby_quorum_num
> - synchronous_quorum_commit

or simply synchronous_standbys

> > When synchronous replication was added, quorum was left out to keep things
> > simple; the current feature set was the most we could all agree on to be
> > useful. If you search the archives for "quorum commit" you'll see what I
> > mean. There was a lot of confusion on what is possible and what is useful,
> > but regarding this particular patch: people wanted to be able to describe
> > more complicated scenarios. For example, imagine that you have a master and
> > two standbys in one the primary data center, and two more standbys in a
> > different data center. It should be possible to specify that you must get
> > acknowledgment from at least on standby in both data centers. Maybe you
> > could hack that by giving the standbys in the same data center the same
> > name, but it gets ugly, and it still won't scale to even more complex
> > scenarios.

Won't this problem be handled if synchronous mode is supported in

cascading replication?

I am not sure about the feasibility of same, but I think the basic problem

mentioned above can be addressed and may be others as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers

From

Amit Kapila

Date:

11 September 2014, 11:19:37

On Thu, Aug 28, 2014 at 12:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
> <rajeev.rastogi@huawei.com> wrote:
> > I have done some more review, below are my comments:
> Thanks!
>
> > 1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the priority calculation inside the upper loop
> > if (*num_sync == allowed_sync_nodes)
> > {
> > for (j = 0; j < *num_sync; j++)
> > {
> > Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.
> > So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority.
> > Let me know if you see any issue with this.
>
> OK, I see, yes this can minimize process a bit so I refactored the
> code by integrating the second loop to the first. This has needed the
> removal of the break portion as we need to find the highest priority
> value among the nodes already determined as synchronous.
>
> > 2. Comment inside the function SyncRepReleaseWaiters,
> > /*
> > * We should have found ourselves at least, except if it is not expected
> > * to find any synchronous nodes.
> > */
> > Assert(num_sync > 0);
> >
> > I think "except if it is not expected to find any synchronous nodes" is not required.
> > As if it has come till this point means at least this node is synchronous.
> Yes, removed.
>
> > 3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same.
> > IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console without
> > any knowledge of user.
> > I see that some discussion has happened regarding this but I think just adding documentation for this is not enough.
> > I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issue during GUC initialization then can't we try to add check at later points.
>
> The trick here is that you cannot really return a warning at GUC
> loading level to the user as a warning could be easily triggered if
> for example s_s_num is present before max_wal_senders in
> postgresql.conf. I am open to any solutions if there are any (like an
> error when initializing WAL senders?!). Documentation seems enough for
> me to warn the user.

How about making it as a PGC_POSTMASTER parameter and then

have a check similar to below in PostmasterMain()

/*
* Check for invalid combinations of GUC settings.
*/
if (ReservedBackends >= MaxConnections)
{
write_stderr("%s: superuser_reserved_connections must be less than max_connections\n", progname);
ExitPostmaster(1);
}

if (max_wal_senders >= MaxConnections)
{
write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
ExitPostmaster(1);
}

if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));

if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers

From

Robert Haas

Date:

11 September 2014, 15:48:37

On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Currently two nodes can only have the same priority if they have the
> same application_name, so we could for example add a new connstring
> parameter called, let's say application_group, to define groups of
> nodes that will have the same priority (if a node does not define
> application_group, it defaults to application_name, if app_name is
> NULL, well we don't care much it cannot be a sync candidate). That's a
> first idea that we could use to control groups of nodes. And we could
> switch syncrep.c to use application_group in s_s_names instead of
> application_name. That would be backward-compatible, and could open
> the door for more improvements for quorum commits as we could control
> groups node nodes. Well this is a super-set of what application_name
> can already do, but there is no problem to identify single nodes of
> the same data center and how much they could be late in replication,
> so I think that this would be really user-friendly. An idea similar to
> that would be a base work for the next thing... See below.

In general, I think the user's requirement for what synchronous
standbys could need to acknowledge a commit could be an arbitrary
Boolean expression - well, probably no NOT, but any amount of AND and
OR that you want to use.  Can someone want A OR (((B AND C) OR (D AND
E)) AND F)?  Maybe!  Based on previous discussions, it seems not
unlikely that as soon as we decide we don't want to support that,
someone will tell us they can't live without it.  In general, though,
I'd expect the two common patterns to be more or less what you've set
forth above: any K servers from set X plus any L servers from set Y
plus any M servers from set Z, etc.  However, I'm not confident it's
right to control this by adding more configuration on the client side.
I think it would be better to stick with the idea that each client
specifies an application_name, and then the master specifies the
policy in some way.  One advantage of that is that you can change the
rules in ONE place - the master - rather than potentially having to
update every client.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

12 September 2014, 05:13:54

On Fri, Sep 12, 2014 at 12:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Currently two nodes can only have the same priority if they have the
>> same application_name, so we could for example add a new connstring
>> parameter called, let's say application_group, to define groups of
>> nodes that will have the same priority (if a node does not define
>> application_group, it defaults to application_name, if app_name is
>> NULL, well we don't care much it cannot be a sync candidate). That's a
>> first idea that we could use to control groups of nodes. And we could
>> switch syncrep.c to use application_group in s_s_names instead of
>> application_name. That would be backward-compatible, and could open
>> the door for more improvements for quorum commits as we could control
>> groups node nodes. Well this is a super-set of what application_name
>> can already do, but there is no problem to identify single nodes of
>> the same data center and how much they could be late in replication,
>> so I think that this would be really user-friendly. An idea similar to
>> that would be a base work for the next thing... See below.
>
> In general, I think the user's requirement for what synchronous
> standbys could need to acknowledge a commit could be an arbitrary
> Boolean expression - well, probably no NOT, but any amount of AND and
> OR that you want to use.  Can someone want A OR (((B AND C) OR (D AND
> E)) AND F)?  Maybe!  Based on previous discussions, it seems not
> unlikely that as soon as we decide we don't want to support that,
> someone will tell us they can't live without it.  In general, though,
> I'd expect the two common patterns to be more or less what you've set
> forth above: any K servers from set X plus any L servers from set Y
> plus any M servers from set Z, etc.  However, I'm not confident it's
> right to control this by adding more configuration on the client side.
> I think it would be better to stick with the idea that each client
> specifies an application_name, and then the master specifies the
> policy in some way.  One advantage of that is that you can change the
> rules in ONE place - the master - rather than potentially having to
> update every client.
OK. I see your point.

Now, what about the following assumptions (somewhat restrictions to
facilitate the user experience for setting syncrep and the
parametrization of this feature):
- Nodes are defined within the same set (or group) if they have the
same priority, aka the same application_name.
- One node cannot be a part of two sets. That's obvious...

The current patch has its own merit, but it fails in the case you and
Heikki are describing: wait for k nodes in set 1 (nodes with lowest
priority value), l nodes in set 2 (nodes with priority 2nd lowest
priority value), etc.
What is does is, if for example we have a set of nodes with priorities
{0,1,1,2,2,3,3}, backends will wait for flush_position from the first
s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2
{0,1,1,2}, etc.

Now what about that: instead of waiting for the nodes in "absolute"
order like the way current patch does, let's do it in a "relative"
way. By that I mean that a backend waits for flush_position
confirmation only from *1* node among a set of nodes having the same
priority. So by using s_s_num = 3, we'll wait for {0, "one node with
1", "one node with 2"}, and you can guess the rest.

The point is as well that we can keep s_s_num behavior as it is now:
- if set at -1, we rely on the default way of doing with s_s_names
(empty means all nodes async, at least one entry meaning that we need
to wait for a node)
- if set at 0, all nodes are forced to be async'd
- if set at n > 1, we have to wait for one node in each set of the
N-lowest priority values.
I'd see enough users happy with those improvements, and that would
help improving the coverage of test cases that Heikki and you
envisioned.

By the way, as the CF is running low in time, I am going to mark this
patch as "Returned with Feedback" as I have received enough feedback.
I am still planning to work on that for the next CF, so it would be
great if there is an agreement on what can be done for this feature to
avoid blind progress. Particularly I see some merit in the last idea,
that we could still extend by allowing values of the type "k,l,m" in
s_s_num to let user decide: wait for 3 sets, k nodes in set 1, l nodes
in set 2 and m nodes in set 3. Having a GUC parameter with integer
values is not that user-friendly though, so I think that I'd hold on
having only one node for a single set.

Thoughts?
-- 
Michael

Re: Support for N synchronous standby servers

From

Robert Haas

Date:

12 September 2014, 18:28:48

On Fri, Sep 12, 2014 at 1:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> OK. I see your point.
>
> Now, what about the following assumptions (somewhat restrictions to
> facilitate the user experience for setting syncrep and the
> parametrization of this feature):
> - Nodes are defined within the same set (or group) if they have the
> same priority, aka the same application_name.
> - One node cannot be a part of two sets. That's obvious...

I feel pretty strongly that we should encourage people to use a
different application_name for every server.  The fact that a server
is interchangeable for one purpose does not mean that it's
interchangeable for all purposes; let's try to keep application_name
as the identifier for a server, and design the other facilities we
need around that.

> The current patch has its own merit, but it fails in the case you and
> Heikki are describing: wait for k nodes in set 1 (nodes with lowest
> priority value), l nodes in set 2 (nodes with priority 2nd lowest
> priority value), etc.
> What is does is, if for example we have a set of nodes with priorities
> {0,1,1,2,2,3,3}, backends will wait for flush_position from the first
> s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2
> {0,1,1,2}, etc.
>
> Now what about that: instead of waiting for the nodes in "absolute"
> order like the way current patch does, let's do it in a "relative"
> way. By that I mean that a backend waits for flush_position
> confirmation only from *1* node among a set of nodes having the same
> priority. So by using s_s_num = 3, we'll wait for {0, "one node with
> 1", "one node with 2"}, and you can guess the rest.
>
> The point is as well that we can keep s_s_num behavior as it is now:
> - if set at -1, we rely on the default way of doing with s_s_names
> (empty means all nodes async, at least one entry meaning that we need
> to wait for a node)
> - if set at 0, all nodes are forced to be async'd
> - if set at n > 1, we have to wait for one node in each set of the
> N-lowest priority values.
> I'd see enough users happy with those improvements, and that would
> help improving the coverage of test cases that Heikki and you
> envisioned.

Sounds confusing.  I hate to be the guy always suggesting a
mini-language (cf. recent discussion of an expression syntax for
pgbench), but we could do much more powerful and flexible things here
if we had one.  For example, suppose we let each element of
synchronous_standby_names use the constructs (X,Y,Z,...) [meaning one
of the parenthesized severs], N(X,Y,Z,...) [meaning N of the
parenthesized servers].  Then if you want to consider a commit
acknowledge when you have any two of foo, bar, and baz you can write:

synchronous_standby_names = 2(foo,bar,baz)

And if you want to acknowledge when you've got either foo or both bar
and baz, you can write:

synchronous_standby_names = (foo,2(bar,baz))

And if you want one of foo and bar and one of baz and bletch, you can write:

synchronous_standby_names = 2((foo,bar),(baz,bletch))

The crazy-complicated policy I mentioned upthread would be:

synchronous_standby_names = (a,2((2(b,c),2(d,e)),f))
or (equivalently and simpler)
synchronous_standby_names = (a,3(b,c,f),3(d,e,f))

We could have a rule that we fall back to the next rule in
synchronous_standby_names when the first rule can never be satisfied
by the connected standbys.  For example, if you have foo, bar, and
baz, and you want any two of the three, but wish to prefer waiting for
foo over the others when it's connected, then you could write:

synchronous_standby_names = 2(foo,2(bar,baz)), 2(bar, baz)

If foo disconnects, the first rule can never be met, so we use the
second rule.  It's still 2 out of 3, just as if we'd written
2(foo,bar,baz) but we won't accept an ack from bar and baz as
sufficient unless foo is dead.

The exact syntax here is of course debatable; maybe somebody come up
with something better.  But it doesn't seem like it would be
incredibly painful to implement, and it would give us a lot of
flexibility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for N synchronous standby servers

From

Gregory Smith

Date:

12 September 2014, 18:45:10

On 9/12/14, 2:28 PM, Robert Haas wrote:
> I hate to be the guy always suggesting a mini-language (cf. recent 
> discussion of an expression syntax for pgbench), but we could do much 
> more powerful and flexible things here if we had one. For example, 
> suppose we let each element of synchronous_standby_names use the 
> constructs (X,Y,Z,...)

While I have my old list history hat on this afternoon, when the 9.1 
deadline was approaching I said that some people were not going to be 
happy until "is it safe to commit?" calls an arbitrary function that is 
passed the names of all the active servers, and then they could plug 
whatever consensus rule they wanted into there.  And then I said that if 
we actually wanted to ship something, it should be some stupid simple 
thing like just putting a list of servers in synchronous_standby_names 
and proceeding if one is active.  One of those two ideas worked out...

Can you make a case for why it needs to be a mini-language instead of a 
function?

-- 
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

Re: Support for N synchronous standby servers

From

Robert Haas

Date:

12 September 2014, 18:56:14

On Fri, Sep 12, 2014 at 2:44 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote:
> On 9/12/14, 2:28 PM, Robert Haas wrote:
>> I hate to be the guy always suggesting a mini-language (cf. recent
>> discussion of an expression syntax for pgbench), but we could do much more
>> powerful and flexible things here if we had one. For example, suppose we let
>> each element of synchronous_standby_names use the constructs (X,Y,Z,...)
>
> While I have my old list history hat on this afternoon, when the 9.1
> deadline was approaching I said that some people were not going to be happy
> until "is it safe to commit?" calls an arbitrary function that is passed the
> names of all the active servers, and then they could plug whatever consensus
> rule they wanted into there.  And then I said that if we actually wanted to
> ship something, it should be some stupid simple thing like just putting a
> list of servers in synchronous_standby_names and proceeding if one is
> active.  One of those two ideas worked out...
>
> Can you make a case for why it needs to be a mini-language instead of a
> function?

I think so.  If we make it a function, then it's either the kind of
function that you access via pg_proc, or it's the kind that's written
in C and installed by storing a function pointer in a hook variable
from _PG_init().  The first approach is a non-starter because it would
require walsender to be connected to the database where that function
lives, which is a non-starter at least for logical replication where
we need walsender to be connected to the database being replicated.
Even if we found some way around that problem, and I'm not sure there
is one, I suspect the overhead would be pretty high.  The second
approach - a hook that can be accessed directly by loadable modules -
seems like it would work fine; the only problem that is that you've
got to write your policy function in C.  But I have no issue with
exposing it that way if someone wants to write a patch.  There is no
joy in getting between the advanced user and his nutty-complicated
sync rep configuration.  However, I suspect that most users will
prefer a more user-friendly interface.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

15 September 2014, 19:00:52

On Sat, Sep 13, 2014 at 3:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think so.  If we make it a function, then it's either the kind of
> function that you access via pg_proc, or it's the kind that's written
> in C and installed by storing a function pointer in a hook variable
> from _PG_init().  The first approach is a non-starter because it would
> require walsender to be connected to the database where that function
> lives, which is a non-starter at least for logical replication where
> we need walsender to be connected to the database being replicated.
> Even if we found some way around that problem, and I'm not sure there
> is one, I suspect the overhead would be pretty high.  The second
> approach - a hook that can be accessed directly by loadable modules -
> seems like it would work fine; the only problem that is that you've
> got to write your policy function in C.  But I have no issue with
> exposing it that way if someone wants to write a patch.  There is no
> joy in getting between the advanced user and his nutty-complicated
> sync rep configuration.  However, I suspect that most users will
> prefer a more user-friendly interface.

Reading both your answers, I'd tend to think that having a set of
hooks to satisfy all the potential user requests would be enough. We
could let the server code decide what is the priority of the standbys
using the information in synchronous_standby_names, then have the
hooks interact with SyncRepReleaseWaiters and pg_stat_get_wal_senders.

We would need two hooks:
- one able to get an array of WAL sender position defining all the
nodes considered as sync nodes. This is enough for
pg_stat_get_wal_senders. SyncRepReleaseWaiters would need it as
well...
- a second able to define the update policy of the write and flush
positions in walsndctl (SYNC_REP_WAIT_FLUSH and SYNC_REP_WAIT_WRITE),
as well as the next failover policy. This would be needed for
SyncRepReleaseWaiters when a WAL sender calls SyncRepReleaseWaiters.

Perhaps that's overthinking, but I am getting the impression that
whatever the decision taken, it would involve modifications of the
sync-standby parametrization at GUC level, and whatever the path
chosen (dedicated language, set of integer params), there will be
complains about what things can or cannot be done.

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.
-- 
Michael

Re: Support for N synchronous standby servers

From

Robert Haas

Date:

16 September 2014, 12:25:56

On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> At least a set of hooks has the merit to say: do what you like with
> your synchronous node policy.

Sure.  I dunno if people will find that terribly user-friendly, so we
might not want that to be the ONLY thing we offer.

But even if it is, it is certainly better than a poke in the eye with
a sharp stick.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

16 September 2014, 18:19:18

On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> At least a set of hooks has the merit to say: do what you like with
>> your synchronous node policy.
>
> Sure.  I dunno if people will find that terribly user-friendly, so we
> might not want that to be the ONLY thing we offer.
Well, user-friendly interface is actually the reason why a simple GUC
integer was used in the first series of patches present on this thread
to set as sync the N-nodes with the lowest priority. I could not come
up with something more simple. Hence what about doing the following:
- A patch refactoring code for pg_stat_get_wal_senders and
SyncRepReleaseWaiters as there is in either case duplicated code in
this area to select the synchronous node as the one connected with
lowest priority
- A patch defining the hooks necessary, I suspect that two of them are
necessary as mentioned upthread.
- A patch for a contrib module implementing an example of simple
policy. It can be a fancy thing with a custom language or even a more
simple thing.
Thoughts? Patch 1 refactoring the code is a win in all cases.
Regards,
-- 
Michael

Re: Support for N synchronous standby servers

From

Robert Haas

Date:

19 September 2014, 17:18:31

On Tue, Sep 16, 2014 at 2:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> At least a set of hooks has the merit to say: do what you like with
>>> your synchronous node policy.
>>
>> Sure.  I dunno if people will find that terribly user-friendly, so we
>> might not want that to be the ONLY thing we offer.
> Well, user-friendly interface is actually the reason why a simple GUC
> integer was used in the first series of patches present on this thread
> to set as sync the N-nodes with the lowest priority. I could not come
> up with something more simple. Hence what about doing the following:
> - A patch refactoring code for pg_stat_get_wal_senders and
> SyncRepReleaseWaiters as there is in either case duplicated code in
> this area to select the synchronous node as the one connected with
> lowest priority

A strong +1 for this idea.  I have never liked that, and cleaning it
up seems eminently sensible.

> - A patch defining the hooks necessary, I suspect that two of them are
> necessary as mentioned upthread.
> - A patch for a contrib module implementing an example of simple
> policy. It can be a fancy thing with a custom language or even a more
> simple thing.

I'm less convinced about this part.  There's a big usability gap
between a GUC and a hook, and I think Heikki's comments upthread were
meant to suggest that even in GUC-land we can probably satisfy more
use cases that what this patch does now.  I think that's right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for N synchronous standby servers

From

Michael Paquier

Date:

20 September 2014, 04:16:35

On Fri, Sep 19, 2014 at 12:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 16, 2014 at 2:19 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> At least a set of hooks has the merit to say: do what you like with
>>>> your synchronous node policy.
>>>
>>> Sure.  I dunno if people will find that terribly user-friendly, so we
>>> might not want that to be the ONLY thing we offer.
>> Well, user-friendly interface is actually the reason why a simple GUC
>> integer was used in the first series of patches present on this thread
>> to set as sync the N-nodes with the lowest priority. I could not come
>> up with something more simple. Hence what about doing the following:
>> - A patch refactoring code for pg_stat_get_wal_senders and
>> SyncRepReleaseWaiters as there is in either case duplicated code in
>> this area to select the synchronous node as the one connected with
>> lowest priority
>
> A strong +1 for this idea.  I have never liked that, and cleaning it
> up seems eminently sensible.

Interestingly, the syncrep code has in some of its code paths the idea
that a synchronous node is unique, while other code paths assume that
there can be multiple synchronous nodes. If that's fine I think that
it would be better to just make the code multiple-sync node aware, by
having a single function call in walsender.c and syncrep.c that
returns an integer array of WAL sender positions (WalSndCtl). as that
seems more extensible long-term. Well for now the array would have a
single element, being the WAL sender with lowest priority > 0. Feel
free to protest about that approach though :)

>> - A patch defining the hooks necessary, I suspect that two of them are
>> necessary as mentioned upthread.
>> - A patch for a contrib module implementing an example of simple
>> policy. It can be a fancy thing with a custom language or even a more
>> simple thing.
>
> I'm less convinced about this part.  There's a big usability gap
> between a GUC and a hook, and I think Heikki's comments upthread were
> meant to suggest that even in GUC-land we can probably satisfy more
> use cases that what this patch does now.  I think that's right.
Hehe. OK then let's see how something with a GUC would go then. There
is no parameter now using a custom language as format base, but I
guess that it is fine to have a text parameter with a validation
callback within. No?
-- 
Michael