Thread: Support for N synchronous standby servers
Hi all, Please find attached a patch to add support of synchronous replication for multiple standby servers. This is controlled by the addition of a new GUC parameter called synchronous_standby_num, that makes server wait for transaction commit on the first N standbys defined in synchronous_standby_names. The implementation is really straight-forward, and has just needed a couple of modifications in walsender.c for pg_stat_get_wal_senders and syncrep.c. When a process commit is cancelled manually by user or when ProcDiePending shows up, the message returned to user does not show the list of walsenders where the commit has not been confirmed as it partially confirmed. I have not done anything for that but let me know if that would be useful. This would need a scan of the walsenders to get their application_name. Thanks, -- Michael
Attachment
On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hi all, > > Please find attached a patch to add support of synchronous replication > for multiple standby servers. This is controlled by the addition of a > new GUC parameter called synchronous_standby_num, that makes server > wait for transaction commit on the first N standbys defined in > synchronous_standby_names. The implementation is really > straight-forward, and has just needed a couple of modifications in > walsender.c for pg_stat_get_wal_senders and syncrep.c. Great! This is really the feature which I really want. Though I forgot why we missed this feature when we had added the synchronous replication feature, maybe it's worth reading the old discussion which may suggest the potential problem of N sync standbys. I just tested this feature with synchronous_standby_num = 2. I started up only one synchronous standby and ran the write transaction. Then the transaction was successfully completed, i.e., it didn't wait for two standbys. Probably this is a bug of the patch. And, you forgot to add the line of synchronous_standby_num to postgresql.conf.sample. Regards, -- Fujii Masao
On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > Great! This is really the feature which I really want. > Though I forgot why we missed this feature when > we had added the synchronous replication feature, > maybe it's worth reading the old discussion which > may suggest the potential problem of N sync standbys. Sure, I'll double check. Thanks for your comments. > I just tested this feature with synchronous_standby_num = 2. > I started up only one synchronous standby and ran > the write transaction. Then the transaction was successfully > completed, i.e., it didn't wait for two standbys. Probably > this is a bug of the patch. Oh OK, yes this is a bug of what I did. The number of standbys to wait for takes precedence on the number of standbys found in the list of active WAL senders. I changed the patch to take into account that behavior. So for example if you have only one sync standby connected, and synchronous_standby_num = 2, client waits indefinitely. > And, you forgot to add the line of synchronous_standby_num > to postgresql.conf.sample. Yep, right. On top of that, I refactored the code in such a way that pg_stat_get_wal_senders and SyncRepReleaseWaiters rely on a single API to get the list of synchronous standbys found. This reduces code duplication, duplication that already exists in HEAD... Regards, -- Michael
Attachment
On Mon, Aug 11, 2014 at 11:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> Great! This is really the feature which I really want. >> Though I forgot why we missed this feature when >> we had added the synchronous replication feature, >> maybe it's worth reading the old discussion which >> may suggest the potential problem of N sync standbys. > Sure, I'll double check. Thanks for your comments. > >> I just tested this feature with synchronous_standby_num = 2. >> I started up only one synchronous standby and ran >> the write transaction. Then the transaction was successfully >> completed, i.e., it didn't wait for two standbys. Probably >> this is a bug of the patch. > > Oh OK, yes this is a bug of what I did. The number of standbys to wait > for takes precedence on the number of standbys found in the list of > active WAL senders. I changed the patch to take into account that > behavior. So for example if you have only one sync standby connected, > and synchronous_standby_num = 2, client waits indefinitely. Thanks for updating the patch! Again I tested the feature and found something wrong. I set synchronous_standby_num to 2 and started three standbys. Two of them are included in synchronous_standby_names, i.e., they are synchronous standbys. That is, the other one standby is always asynchronous. When I shutdown one of synchronous standbys and executed the write transaction, the transaction was successfully completed. So the transaction doesn't wait for two sync standbys in that case. Probably this is a bug. Regards, -- Fujii Masao
On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Thanks for updating the patch! Again I tested the feature and found something
> wrong. I set synchronous_standby_num to 2 and started three standbys. Two of
> them are included in synchronous_standby_names, i.e., they are synchronous
> standbys. That is, the other one standby is always asynchronous. When
> I shutdown one of synchronous standbys and executed the write transaction,
> the transaction was successfully completed. So the transaction doesn't wait for
> two sync standbys in that case. Probably this is a bug.
Well, that's working in my case :)
Please see below with 4 nodes: 1 master and 3 standbys on same host. Master listens to 5432, other nodes to 5433, 5434 and 5435. Each standby's application_name is node_$PORT
=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------
node_5433,node_5434
(1 row)
=# show synchronous_standby_num ;
synchronous_standby_num
-------------------------
2
(1 row)
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
application_name | replay_delta | sync_priority | sync_state
------------------+--------------+---------------+------------
node_5433 | 0 | 1 | sync
node_5434 | 0 | 2 | sync
node_5435 | 0 | 0 | async
(3 rows)
=# create table aa (a int);
CREATE TABLE
[...]
-- Stopped node with port 5433:
[...]
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
application_name | replay_delta | sync_priority | sync_state
------------------+--------------+---------------+------------
node_5434 | 0 | 2 | sync
node_5435 | 0 | 0 | async
(2 rows)
=# create table ab (a int);
^CCancel request sent
WARNING: 01000: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby(s).
LOCATION: SyncRepWaitForLSN, syncrep.c:227
CREATE TABLE
--
Michael
Michael
On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Thanks for updating the patch! Again I tested the feature and found >> something >> wrong. I set synchronous_standby_num to 2 and started three standbys. Two >> of >> them are included in synchronous_standby_names, i.e., they are synchronous >> standbys. That is, the other one standby is always asynchronous. When >> I shutdown one of synchronous standbys and executed the write transaction, >> the transaction was successfully completed. So the transaction doesn't >> wait for >> two sync standbys in that case. Probably this is a bug. > Well, that's working in my case :) Oh, that worked in my machine, too, this time... I did something wrong. Sorry for the noise. Regards, -- Fujii Masao
On Mon, Aug 11, 2014 at 4:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > Oh, that worked in my machine, too, this time... I did something wrong. > Sorry for the noise. No problem, thanks for spending time testing. -- Michael
On Mon, Aug 11, 2014 at 4:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Aug 11, 2014 at 4:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> Oh, that worked in my machine, too, this time... I did something wrong. >> Sorry for the noise. > No problem, thanks for spending time testing. Probably I got the similar but another problem. I set synchronous_standby_num to 2 and started up two synchronous standbys. When I ran write transactions, they were successfully completed. That's OK. I sent the SIGSTOP signal to the walreceiver process in one of sync standbys, and then ran write transactions again. In this case, they must not be completed because their WAL cannot be replicated to the standby that its walreceiver was stopped. But they were successfully completed. Regards, -- Fujii Masao
On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > I sent the SIGSTOP signal to the walreceiver process in one of sync standbys, > and then ran write transactions again. In this case, they must not be completed > because their WAL cannot be replicated to the standby that its walreceiver > was stopped. But they were successfully completed. At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal sender in sync, making backends wake up even if other standbys did not catch up. But we need to scan all the synchronous wal senders and find the minimum write and flush positions and update walsndctl with those values. Well that's a code path I forgot to cover. Attached is an updated patch fixing the problem you reported. Regards, -- Michael
Attachment
On Wed, Aug 13, 2014 at 4:10 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> I sent the SIGSTOP signal to the walreceiver process in one of sync standbys, >> and then ran write transactions again. In this case, they must not be completed >> because their WAL cannot be replicated to the standby that its walreceiver >> was stopped. But they were successfully completed. > > At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and > SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal > sender in sync, making backends wake up even if other standbys did not > catch up. But we need to scan all the synchronous wal senders and find > the minimum write and flush positions and update walsndctl with those > values. Well that's a code path I forgot to cover. > > Attached is an updated patch fixing the problem you reported. + At any one time there will be at a number of active synchronous standbys + defined by <varname>synchronous_standby_num</>; transactions waiting It's better to use <xref linkend="guc-synchronous-standby-num">, instead. + for commit will be allowed to proceed after those standby servers + confirms receipt of their data. The synchronous standbys will be Typo: confirms -> confirm + <para> + Specifies the number of standbys that support + <firstterm>synchronous replication</>, as described in + <xref linkend="synchronous-replication">, and listed as the first + elements of <xref linkend="guc-synchronous-standby-names">. + </para> + <para> + Default value is 1. + </para> synchronous_standby_num is defined with PGC_SIGHUP. So the following should be added into the document. This parameter can only be set in the postgresql.conf file or on the server command line. The name of the parameter "synchronous_standby_num" sounds to me that the transaction must wait for its WAL to be replicated to s_s_num standbys. But that's not true in your patch. If s_s_names is empty, replication works asynchronously whether the value of s_s_num is. I'm afraid that it's confusing. The description of s_s_num is not sufficient. I'm afraid that users can easily misunderstand that they can use quorum commit feature by using s_s_names and s_s_num. That is, the transaction waits for its WAL to be replicated to any s_s_num standbys listed in s_s_names. When s_s_num is set to larger value than max_wal_senders, we should warn that? + for (i = 0; i < num_sync; i++) + { + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]]; + + if (min_write_pos > walsndloc->write) + min_write_pos = walsndloc->write; + if (min_flush_pos > walsndloc->flush) + min_flush_pos = walsndloc->flush; + } I don't think that it's safe to see those shared values without spinlock. Regards, -- Fujii Masao
On Thu, Aug 14, 2014 at 8:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > + At any one time there will be at a number of active > synchronous standbys > + defined by <varname>synchronous_standby_num</>; transactions waiting > > It's better to use <xref linkend="guc-synchronous-standby-num">, instead. Fixed. > + for commit will be allowed to proceed after those standby servers > + confirms receipt of their data. The synchronous standbys will be > > Typo: confirms -> confirm Fixed. > + <para> > + Specifies the number of standbys that support > + <firstterm>synchronous replication</>, as described in > + <xref linkend="synchronous-replication">, and listed as the first > + elements of <xref linkend="guc-synchronous-standby-names">. > + </para> > + <para> > + Default value is 1. > + </para> > > synchronous_standby_num is defined with PGC_SIGHUP. So the following > should be added into the document. > > This parameter can only be set in the postgresql.conf file or on > the server command line. Fixed. > The name of the parameter "synchronous_standby_num" sounds to me that > the transaction must wait for its WAL to be replicated to s_s_num standbys. > But that's not true in your patch. If s_s_names is empty, replication works > asynchronously whether the value of s_s_num is. I'm afraid that it's confusing. > The description of s_s_num is not sufficient. I'm afraid that users can easily > misunderstand that they can use quorum commit feature by using s_s_names > and s_s_num. That is, the transaction waits for its WAL to be replicated to > any s_s_num standbys listed in s_s_names. I reworked the docs to mention all that. Yes things are a bit different than any quorum commit facility (how to parametrize that simply without a parameter mapping one to one the items of s_s_names?), as this facility relies on the order of the items of s_s_names and the fact that stansbys are connected at a given time. > When s_s_num is set to larger value than max_wal_senders, we should warn that? Actually I have done a bit more than that by forbidding setting s_s_num to a value higher than max_wal_senders. Thoughts? Now that we discuss the interactions with other parameters. Another thing that I am wondering about now is: what should we do if we specify s_s_num to a number higher than the elements in s_s_names? Currently, the patch gives the priority to s_s_num, in short if we set s_s_num to 100, server will wait for 100 servers to confirm commit even if there are less than 100 elements in s_s_names. I chose this way because it looks saner particularly if s_s_names = '*'. Thoughts once again? > + for (i = 0; i < num_sync; i++) > + { > + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]]; > + > + if (min_write_pos > walsndloc->write) > + min_write_pos = walsndloc->write; > + if (min_flush_pos > walsndloc->flush) > + min_flush_pos = walsndloc->flush; > + } > > I don't think that it's safe to see those shared values without spinlock. Looking at walsender.c you are right. I have updated the code to use the mutex lock of the walsender whose values are being read from. Regards, -- Michael On Thu, Aug 14, 2014 at 4:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Aug 13, 2014 at 4:10 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> I sent the SIGSTOP signal to the walreceiver process in one of sync standbys, >>> and then ran write transactions again. In this case, they must not be completed >>> because their WAL cannot be replicated to the standby that its walreceiver >>> was stopped. But they were successfully completed. >> >> At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and >> SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal >> sender in sync, making backends wake up even if other standbys did not >> catch up. But we need to scan all the synchronous wal senders and find >> the minimum write and flush positions and update walsndctl with those >> values. Well that's a code path I forgot to cover. >> >> Attached is an updated patch fixing the problem you reported. > > + At any one time there will be at a number of active > synchronous standbys > + defined by <varname>synchronous_standby_num</>; transactions waiting > > It's better to use <xref linkend="guc-synchronous-standby-num">, instead. > > + for commit will be allowed to proceed after those standby servers > + confirms receipt of their data. The synchronous standbys will be > > Typo: confirms -> confirm > > + <para> > + Specifies the number of standbys that support > + <firstterm>synchronous replication</>, as described in > + <xref linkend="synchronous-replication">, and listed as the first > + elements of <xref linkend="guc-synchronous-standby-names">. > + </para> > + <para> > + Default value is 1. > + </para> > > synchronous_standby_num is defined with PGC_SIGHUP. So the following > should be added into the document. > > This parameter can only be set in the postgresql.conf file or on > the server command line. > > The name of the parameter "synchronous_standby_num" sounds to me that > the transaction must wait for its WAL to be replicated to s_s_num standbys. > But that's not true in your patch. If s_s_names is empty, replication works > asynchronously whether the value of s_s_num is. I'm afraid that it's confusing. > > The description of s_s_num is not sufficient. I'm afraid that users can easily > misunderstand that they can use quorum commit feature by using s_s_names > and s_s_num. That is, the transaction waits for its WAL to be replicated to > any s_s_num standbys listed in s_s_names. > > When s_s_num is set to larger value than max_wal_senders, we should warn that? > > + for (i = 0; i < num_sync; i++) > + { > + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]]; > + > + if (min_write_pos > walsndloc->write) > + min_write_pos = walsndloc->write; > + if (min_flush_pos > walsndloc->flush) > + min_flush_pos = walsndloc->flush; > + } > > I don't think that it's safe to see those shared values without spinlock. > > Regards, > > -- > Fujii Masao -- Michael
Attachment
On Fri, Aug 15, 2014 at 4:05 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Aug 14, 2014 at 8:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> + At any one time there will be at a number of active >> synchronous standbys >> + defined by <varname>synchronous_standby_num</>; transactions waiting >> >> It's better to use <xref linkend="guc-synchronous-standby-num">, instead. > Fixed. > >> + for commit will be allowed to proceed after those standby servers >> + confirms receipt of their data. The synchronous standbys will be >> >> Typo: confirms -> confirm > > Fixed. > >> + <para> >> + Specifies the number of standbys that support >> + <firstterm>synchronous replication</>, as described in >> + <xref linkend="synchronous-replication">, and listed as the first >> + elements of <xref linkend="guc-synchronous-standby-names">. >> + </para> >> + <para> >> + Default value is 1. >> + </para> >> >> synchronous_standby_num is defined with PGC_SIGHUP. So the following >> should be added into the document. >> >> This parameter can only be set in the postgresql.conf file or on >> the server command line. > Fixed. > >> The name of the parameter "synchronous_standby_num" sounds to me that >> the transaction must wait for its WAL to be replicated to s_s_num standbys. >> But that's not true in your patch. If s_s_names is empty, replication works >> asynchronously whether the value of s_s_num is. I'm afraid that it's confusing. >> The description of s_s_num is not sufficient. I'm afraid that users can easily >> misunderstand that they can use quorum commit feature by using s_s_names >> and s_s_num. That is, the transaction waits for its WAL to be replicated to >> any s_s_num standbys listed in s_s_names. > > I reworked the docs to mention all that. Yes things are a bit > different than any quorum commit facility (how to parametrize that > simply without a parameter mapping one to one the items of > s_s_names?), as this facility relies on the order of the items of > s_s_names and the fact that stansbys are connected at a given time. > >> When s_s_num is set to larger value than max_wal_senders, we should warn that? > Actually I have done a bit more than that by forbidding setting > s_s_num to a value higher than max_wal_senders. Thoughts? You added check_synchronous_standby_num() as the GUC check function for synchronous_standby_num, and checked that there. But that seems to be wrong. You can easily see the following error messages even if synchronous_standby_num is smaller than max_wal_senders. The point is that synchronous_standby_num should be located before max_wal_senders in postgresql.conf. LOG: invalid value for parameter "synchronous_standby_num": 0 DETAIL: synchronous_standby_num cannot be higher than max_wal_senders. > Now that we discuss the interactions with other parameters. Another > thing that I am wondering about now is: what should we do if we > specify s_s_num to a number higher than the elements in s_s_names? > Currently, the patch gives the priority to s_s_num, in short if we set > s_s_num to 100, server will wait for 100 servers to confirm commit > even if there are less than 100 elements in s_s_names. I chose this > way because it looks saner particularly if s_s_names = '*'. Thoughts > once again? I'm fine with this. As you gave an example, the number of entries in s_s_names can be smaller than the number of actual active sync standbys. For example, when s_s_names is set to 'hoge', more than one standbys with the name 'hoge' can connect to the server with sync mode. I still think that it's strange that replication can be async even when s_s_num is larger than zero. That is, I think that the transaction must wait for s_s_num sync standbys whether s_s_names is empty or not. OTOH, if s_s_num is zero, replication must be async whether s_s_names is empty or not. At least for me, it's intuitive to use s_s_num primarily to control the sync mode. Of course, other hackers may have different thoughts, so we need to keep our ear open for them. In the above design, one problem is that the number of parameters that those who want to set up only one sync replication need to change is incremented by one. That is, they need to change s_s_num additionally. If we are really concerned about this, we can treat a value of -1 in s_s_num as the special value, which allows us to control sync replication only by s_s_names as we do now. That is, if s_s_names is empty, replication would be async. Otherwise, only one standby with high-priority in s_s_names becomes sync one. Probably the default of s_s_num should be -1. Thought? The source code comments at the top of syncrep.c need to be udpated. It's worth checking whether there are other comments to be updated. Regards, -- Fujii Masao
On Fri, Aug 15, 2014 at 8:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Now that we discuss the interactions with other parameters. Another >> thing that I am wondering about now is: what should we do if we >> specify s_s_num to a number higher than the elements in s_s_names? >> Currently, the patch gives the priority to s_s_num, in short if we set >> s_s_num to 100, server will wait for 100 servers to confirm commit >> even if there are less than 100 elements in s_s_names. I chose this >> way because it looks saner particularly if s_s_names = '*'. Thoughts >> once again? > > I'm fine with this. As you gave an example, the number of entries in s_s_names > can be smaller than the number of actual active sync standbys. For example, > when s_s_names is set to 'hoge', more than one standbys with the name 'hoge' > can connect to the server with sync mode. This is a bit tricky. Suppose there is one standby connected which has reached the relevant WAL position. We then lose that connection, and a new standby connects. When or if the second standby is known to have reached the relevant WAL position, can we release waiters? It depends. If the old and new connections are to two different standbys that happen to have the same name, yes. But if it's the same standby reconnecting, then no. > I still think that it's strange that replication can be async even when > s_s_num is larger than zero. That is, I think that the transaction must > wait for s_s_num sync standbys whether s_s_names is empty or not. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 15, 2014 at 9:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> You added check_synchronous_standby_num() as the GUC check function for
> synchronous_standby_num, and checked that there. But that seems to be wrong.
> You can easily see the following error messages even if synchronous_standby_num
> is smaller than max_wal_senders. The point is that synchronous_standby_num
> should be located before max_wal_senders in postgresql.conf.
>
> LOG: invalid value for parameter "synchronous_standby_num": 0
> DETAIL: synchronous_standby_num cannot be higher than max_wal_senders.
> You added check_synchronous_standby_num() as the GUC check function for
> synchronous_standby_num, and checked that there. But that seems to be wrong.
> You can easily see the following error messages even if synchronous_standby_num
> is smaller than max_wal_senders. The point is that synchronous_standby_num
> should be located before max_wal_senders in postgresql.conf.
>
> LOG: invalid value for parameter "synchronous_standby_num": 0
> DETAIL: synchronous_standby_num cannot be higher than max_wal_senders.
I am not sure what I can do here, so I am removing this check in the code, and simply add a note in the docs that a value of _num higher than max_wal_senders does not have much meaning.
> I still think that it's strange that replication can be async even when
> s_s_num is larger than zero. That is, I think that the transaction must
> wait for s_s_num sync standbys whether s_s_names is empty or not.
> OTOH, if s_s_num is zero, replication must be async whether s_s_names
> is empty or not. At least for me, it's intuitive to use s_s_num primarily
> to control the sync mode. Of course, other hackers may have different
> thoughts, so we need to keep our ear open for them.
Sure, the compromise looks to be what you propose, and I am fine with that.
> In the above design, one problem is that the number of parameters
> that those who want to set up only one sync replication need to change is
> incremented by one. That is, they need to change s_s_num additionally.
> If we are really concerned about this, we can treat a value of -1 in
> s_s_num as the special value, which allows us to control sync replication
> only by s_s_names as we do now. That is, if s_s_names is empty,
> replication would be async. Otherwise, only one standby with
> high-priority in s_s_names becomes sync one. Probably the default of
> s_s_num should be -1. Thought?
Taking into account those comments, attached is a patch doing the following things depending on the values of _num and _names:
- If _num = -1 and _names is empty, all the standbys are considered as async (same behavior as 9.1~, and default).
- If _num = -1 and _names has at least one item, wait for one standby, even if it is not connected at the time of commit. If one node is found as sync, other standbys listed in _names with higher priority than the sync one are in potential state (same as existing behavior).
- If _num = 0, all the standbys are async, whatever the values in _names. Priority is enforced to 0 for all the standbys. SyncStandbysDefined is set to false in this case.
- If _num > 0, must wait for _num standbys whatever the values in _names
The default value of _num is -1. Documentation has been updated in consequence.
> The source code comments at the top of syncrep.c need to be udpated.
> It's worth checking whether there are other comments to be updated.
Done. I have updated some comments in other places than the header.
> that those who want to set up only one sync replication need to change is
> incremented by one. That is, they need to change s_s_num additionally.
> If we are really concerned about this, we can treat a value of -1 in
> s_s_num as the special value, which allows us to control sync replication
> only by s_s_names as we do now. That is, if s_s_names is empty,
> replication would be async. Otherwise, only one standby with
> high-priority in s_s_names becomes sync one. Probably the default of
> s_s_num should be -1. Thought?
Taking into account those comments, attached is a patch doing the following things depending on the values of _num and _names:
- If _num = -1 and _names is empty, all the standbys are considered as async (same behavior as 9.1~, and default).
- If _num = -1 and _names has at least one item, wait for one standby, even if it is not connected at the time of commit. If one node is found as sync, other standbys listed in _names with higher priority than the sync one are in potential state (same as existing behavior).
- If _num = 0, all the standbys are async, whatever the values in _names. Priority is enforced to 0 for all the standbys. SyncStandbysDefined is set to false in this case.
- If _num > 0, must wait for _num standbys whatever the values in _names
The default value of _num is -1. Documentation has been updated in consequence.
> The source code comments at the top of syncrep.c need to be udpated.
> It's worth checking whether there are other comments to be updated.
Done. I have updated some comments in other places than the header.
Regards,
--
Michael
Michael
Attachment
On 09 August 2014 11:33, Michael Paquier Wrote: > Please find attached a patch to add support of synchronous replication > for multiple standby servers. This is controlled by the addition of a > new GUC parameter called synchronous_standby_num, that makes server > wait for transaction commit on the first N standbys defined in > synchronous_standby_names. The implementation is really straight- > forward, and has just needed a couple of modifications in walsender.c > for pg_stat_get_wal_senders and syncrep.c. I have just started looking into this patch. Please find below my first level of observation from the patch: 1. Allocation of memory for sync_nodes in function SyncRepGetSynchronousNodes should be equivalent to allowed_sync_nodesinstead of max_wal_senders. As anyway we are not going to store sync stdbys more than allowed_sync_nodes. sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int)); 2. Logic of deciding the highest priority one seems to be in-correct.Assume, s_s_num = 3, s_s_names = 3,4,2,1 standby nodesare in order as: 1,2,3,4,5,6,7As per the logic in patch, node 4 with priority 2 will not be added in the list whereas1,2,3 will be added.The problem is because priority updated for next tracking is not the highest priority as of thatiteration, it is just priority of last node added to the list. So it may happen that a node with higher priorityis still there in list but we are comparing with some other smaller priority. 3. Can we optimize the function SyncRepGetSynchronousNodes in such a way that it gets the number of standby nodes from s_s_namesitself. With this it will be usful to stop scanning the moment we get first s_s_num potential standbys. Thanks and Regards, Kumar Rajeev Rastogi
On Fri, Aug 22, 2014 at 7:14 PM, Rajeev rastogi <rajeev.rastogi@huawei.com> wrote:
-- I have just started looking into this patch.
Please find below my first level of observation from the patch:
Thanks! Updated patch attached.
1. Allocation of memory for sync_nodes in function SyncRepGetSynchronousNodes should be equivalent to allowed_sync_nodes instead of max_wal_senders. As anyway we are not going to store sync stdbys more than allowed_sync_nodes.
sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int));
Fixed.
2. Logic of deciding the highest priority one seems to be in-correct.
Assume, s_s_num = 3, s_s_names = 3,4,2,1
standby nodes are in order as: 1,2,3,4,5,6,7
As per the logic in patch, node 4 with priority 2 will not be added in the list whereas 1,2,3 will be added.
The problem is because priority updated for next tracking is not the highest priority as of that iteration, it is just priority of last node added to the list. So it may happen that a node with higher priority is still there in list but we are comparing with some other smaller priority.
Fixed. Nice catch!
3. Can we optimize the function SyncRepGetSynchronousNodes in such a way that it gets the number of standby nodes from s_s_names itself. With this it will be usful to stop scanning the moment we get first s_s_num potential standbys.
By doing so, we would need to scan the WAL sender array more than once (or once if we can find N sync nodes with a name matching the first entry, smth unlikely to happen). We would need as well to recalculate for a given item in the list _names what is its priority and compare it with the existing entries in the WAL sender list. So this is not worth the shot.
Also, using the priority instead of s_s_names is more solid as s_s_names is now used only in SyncRepGetStandbyPriority to calculate the priority for a given WAL sender, and is a function only called by a WAL sender itself when it initializes.
Also, using the priority instead of s_s_names is more solid as s_s_names is now used only in SyncRepGetStandbyPriority to calculate the priority for a given WAL sender, and is a function only called by a WAL sender itself when it initializes.
Regards,
Michael
Attachment
On Fri, Aug 22, 2014 at 11:42 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> >> 2. Logic of deciding the highest priority one seems to be in-correct. >> Assume, s_s_num = 3, s_s_names = 3,4,2,1 >> standby nodes are in order as: 1,2,3,4,5,6,7 >> >> As per the logic in patch, node 4 with priority 2 will not be added in the list whereas 1,2,3 will be added. >> >> The problem is because priority updated for next tracking is not the highest priority as of that iteration, itis just priority of last node added to the list. So it may happen that a node with higher priority is still therein list but we are comparing with some other smaller priority. > > > Fixed. Nice catch! Actually by re-reading the code I wrote yesterday I found that the fix in v6 for that is not correct. That's really fixed with v7 attached. Regards, -- Michael
Attachment
On 23 August 2014 11:22, Michael Paquier Wrote: > >> 2. Logic of deciding the highest priority one seems to be in-correct. > >> Assume, s_s_num = 3, s_s_names = 3,4,2,1 > >> standby nodes are in order as: 1,2,3,4,5,6,7 > >> > >> As per the logic in patch, node 4 with priority 2 will not > be added in the list whereas 1,2,3 will be added. > >> > >> The problem is because priority updated for next tracking is > not the highest priority as of that iteration, it is just priority of > last node added to the list. So it may happen that a node with > higher priority is still there in list but we are comparing with some > other smaller priority. > > > > > > Fixed. Nice catch! > > > Actually by re-reading the code I wrote yesterday I found that the fix > in v6 for that is not correct. That's really fixed with v7 attached. I have done some more review, below are my comments: 1. There are currently two loops on *num_sync, Can we simply the function SyncRepGetSynchronousNodes by moving the prioritycalculation inside the upper loop if (*num_sync == allowed_sync_nodes) { for (j = 0; j < *num_sync;j++) {Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.So in this loopitself, we can calculate the priority as well as assigment of new standbys with lower priority.Let me know if you seeany issue with this. 2. Comment inside the function SyncRepReleaseWaiters,/* * We should have found ourselves at least, except if it is notexpected * to find any synchronous nodes. */Assert(num_sync > 0);I think "except if it is not expected to find any synchronousnodes" is not required. As if it has come till this point means atleast this node is synchronous. 3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same.IMHO,s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console withoutanyknowledge of user.I see that some discussion has happened regarding this but I think just adding documentationfor this is not enough.I am not sure what issue is observed in adding check during GUC initialization but ifthere is unavoidable issue during GUC initializationthen can't we try to add check at later points. 4. Similary interaction between parameters s_s_names and s_s_num. I see some discussion has happened regarding this andit is acceptableto have s_s_num more than s_s_names. But I was thinking should not give atleast some notice message touser for such case along with some documentation. config.sgml 5. "At any one time there will be at a number of active synchronous standbys": this sentence is not proper. 6. When this parameter is set to <literal>0</>, all the standby nodes will be considered as asynchronous. Canwe make this as When this parameter is set to <literal>0</>, all the standby nodes will be considered as asynchronousirrespective of value of synchronous_standby_names. 7. Are considered as synchronous the first elements of <xref linkend="guc-synchronous-standby-names"> in numberof <xref linkend="guc-synchronous-standby-num"> that are connected. Starting of this sentence looks to be incomplete. high-availability.sgml 8. Standbys listed after this will take over the role of synchronous standby if the first one should fail. Should not we make it as: Standbys listed after this will take over the role of synchronous standby if any of thefirst synchronous-standby-num standby fails. Let me know incase if something is not clear. Thanks and Regards, Kumar Rajeev Rastogi.
On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi <rajeev.rastogi@huawei.com> wrote: > I have done some more review, below are my comments: Thanks! > 1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the prioritycalculation inside the upper loop > if (*num_sync == allowed_sync_nodes) > { > for (j = 0; j < *num_sync; j++) > { > Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches. > So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority. > Let me know if you see any issue with this. OK, I see, yes this can minimize process a bit so I refactored the code by integrating the second loop to the first. This has needed the removal of the break portion as we need to find the highest priority value among the nodes already determined as synchronous. > 2. Comment inside the function SyncRepReleaseWaiters, > /* > * We should have found ourselves at least, except if it is not expected > * to find any synchronous nodes. > */ > Assert(num_sync > 0); > > I think "except if it is not expected to find any synchronous nodes" is not required. > As if it has come till this point means at least this node is synchronous. Yes, removed. > 3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same. > IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the consolewithout > any knowledge of user. > I see that some discussion has happened regarding this but I think just adding documentation for this is not enough. > I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issueduring GUC initialization then can't we try to add check at later points. The trick here is that you cannot really return a warning at GUC loading level to the user as a warning could be easily triggered if for example s_s_num is present before max_wal_senders in postgresql.conf. I am open to any solutions if there are any (like an error when initializing WAL senders?!). Documentation seems enough for me to warn the user. > 4. Similary interaction between parameters s_s_names and s_s_num. I see some discussion has happened regarding this andit is acceptable > to have s_s_num more than s_s_names. But I was thinking should not give atleast some notice message to user forsuch case along with > some documentation. Done. I added the following in the paragraph "Server will wait": Hence it is recommended to not set <varname>synchronous_standby_num</> to a value higher than the number of elements in <varname>synchronous_standby_names</>. > 5. "At any one time there will be at a number of active synchronous standbys": this sentence is not proper. What about that: "At any one time there can be a number of active synchronous standbys up to the number defined by <xref linkend="guc-synchronous-standby-num">" > 6. When this parameter is set to <literal>0</>, all the standby > nodes will be considered as asynchronous. > > Can we make this as > When this parameter is set to <literal>0</>, all the standby > nodes will be considered as asynchronous irrespective of value of synchronous_standby_names. Done. This seems proper for the user as we do not care at all about s_s_names if _num = 0. > 7. Are considered as synchronous the first elements of > <xref linkend="guc-synchronous-standby-names"> in number of > <xref linkend="guc-synchronous-standby-num"> that are > connected. > > Starting of this sentence looks to be incomplete. OK, I reworked this part as well. I hope it is clearer. > 8. Standbys listed after this will take over the role > of synchronous standby if the first one should fail. > > Should not we make it as: > > Standbys listed after this will take over the role > of synchronous standby if any of the first synchronous-standby-num standby fails. Fixed as proposed. At the same I found a bug with pg_stat_get_wal_senders caused by a NULL pointer that was freed when s_s_num = 0. Updated patch addressing comments is attached. On top of that the documentation has been reworked a bit by replacing the too-high amount of <xref> blocks by <varname>, having a link to a given variable specified only once. Regards, -- Michael
Attachment
On 08/28/2014 10:10 AM, Michael Paquier wrote: > + #synchronous_standby_num = -1 # number of standbys servers using sync rep To be honest, that's a horrible name for the GUC. Back when synchronous replication was implemented, we had looong discussions on this feature. It was called "quorum commit" back then. I'd suggest using the "quorum" term in this patch, too, that's a fairly well-known term in distributed computing for this. When synchronous replication was added, quorum was left out to keep things simple; the current feature set was the most we could all agree on to be useful. If you search the archives for "quorum commit" you'll see what I mean. There was a lot of confusion on what is possible and what is useful, but regarding this particular patch: people wanted to be able to describe more complicated scenarios. For example, imagine that you have a master and two standbys in one the primary data center, and two more standbys in a different data center. It should be possible to specify that you must get acknowledgment from at least on standby in both data centers. Maybe you could hack that by giving the standbys in the same data center the same name, but it gets ugly, and it still won't scale to even more complex scenarios. Maybe that's OK - we don't necessarily need to solve all scenarios at once. But it's worth considering. BTW, how does this patch behave if there are multiple standbys connected with the same name? - Heikki
On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 08/28/2014 10:10 AM, Michael Paquier wrote: >> >> + #synchronous_standby_num = -1 # number of standbys servers using sync >> rep > > > To be honest, that's a horrible name for the GUC. Back when synchronous > replication was implemented, we had looong discussions on this feature. It > was called "quorum commit" back then. I'd suggest using the "quorum" term in > this patch, too, that's a fairly well-known term in distributed computing > for this. I am open to any suggestions. Then what about the following parameter names? - synchronous_quorum_num - synchronous_standby_quorum - synchronous_standby_quorum_num - synchronous_quorum_commit > When synchronous replication was added, quorum was left out to keep things > simple; the current feature set was the most we could all agree on to be > useful. If you search the archives for "quorum commit" you'll see what I > mean. There was a lot of confusion on what is possible and what is useful, > but regarding this particular patch: people wanted to be able to describe > more complicated scenarios. For example, imagine that you have a master and > two standbys in one the primary data center, and two more standbys in a > different data center. It should be possible to specify that you must get > acknowledgment from at least on standby in both data centers. Maybe you > could hack that by giving the standbys in the same data center the same > name, but it gets ugly, and it still won't scale to even more complex > scenarios. Currently two nodes can only have the same priority if they have the same application_name, so we could for example add a new connstring parameter called, let's say application_group, to define groups of nodes that will have the same priority (if a node does not define application_group, it defaults to application_name, if app_name is NULL, well we don't care much it cannot be a sync candidate). That's a first idea that we could use to control groups of nodes. And we could switch syncrep.c to use application_group in s_s_names instead of application_name. That would be backward-compatible, and could open the door for more improvements for quorum commits as we could control groups node nodes. Well this is a super-set of what application_name can already do, but there is no problem to identify single nodes of the same data center and how much they could be late in replication, so I think that this would be really user-friendly. An idea similar to that would be a base work for the next thing... See below. Now, in your case the two nodes on the second data center need to have the same priority either way. With this patch you can achieve that with the same node name. Where things are not that cool with this patch is something like that though: - 5 slaves: 1 with master (node_local), 2 on a 2nd data center (node_center1), 2 last on a 3rd data center (node_center2) - s_s_num = 3 - s_s_names = 'node_local,node_center1,node_center2' In this case the nodes have the following priority: - node_local => 1 - the 2 nodes with node_center1 => 2 - the 2 nodes with node_center2 => 3 In this {1,2,2,3,3} schema, the patch makes system wait for node_local, and the two nodes in node_center1 without caring about the ones in node_center2 as it will pick up only the nodes with the highest priority. If user expects the system to wait for a node in node_center2 he'll be disappointed. That's perhaps where we could improve things, by adding an extra parameter able to control the priority ranks, say synchronous_priority_check: - [absolute|individual], wait for the first s_s_num nodes having the lowest priority, in this case we'll wait for {1,2,2} - group: for only one node in the lowest s_s_num priorities, here we'll wait for {1,2,3} Note that we may not even need this parameter if we assume by default that we wait for only one node in a given group that has the same priority. > Maybe that's OK - we don't necessarily need to solve all scenarios at once. > But it's worth considering. Parametrizing and coverage of the user expectations are tricky. Either way not everybody can be happy :) There are even people unhappy now because we can only define one single sync node. > BTW, how does this patch behave if there are multiple standbys connected > with the same name? All the nodes have the same priority. For example in the case of a cluster with 5 slaves having the same application name and s_s_num =3, the first three nodes when scanning the WAL sender array are expected to return a COMMIT before committing locally: =# show synchronous_standby_num ;synchronous_standby_num -------------------------3 (1 row) =# show synchronous_standby_names ;synchronous_standby_names ---------------------------node (1 row) =# SELECT application_name, client_port, pg_xlog_location_diff(sent_location, flush_location) AS replay_delta, sync_priority,sync_state FROM pg_stat_replication ORDER BY replay_delta ASC, applapplication_name | client_port | replay_delta| sync_priority | sync_state ------------------+-------------+--------------+---------------+------------node | 50251 | 0| 1 | syncnode | 50252 | 0 | 1 | syncnode | 50253| 0 | 1 | syncnode | 50254 | 0 | 1 | potentialnode | 50255 | 0 | 1 | potential (5 rows) After writing this long message, and thinking more about that, I kind of like the group approach. Thoughts welcome. Regards, -- Michael
On Thu, Sep 11, 2014 at 9:10 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > On 08/28/2014 10:10 AM, Michael Paquier wrote:
> >>
> >> + #synchronous_standby_num = -1 # number of standbys servers using sync
> >> rep
> >
> >
> > To be honest, that's a horrible name for the GUC. Back when synchronous
> > replication was implemented, we had looong discussions on this feature. It
> > was called "quorum commit" back then. I'd suggest using the "quorum" term in
> > this patch, too, that's a fairly well-known term in distributed computing
> > for this.
> I am open to any suggestions. Then what about the following parameter names?
> - synchronous_quorum_num
> - synchronous_standby_quorum
> - synchronous_standby_quorum_num
> - synchronous_quorum_commit
or simply synchronous_standbys
> > When synchronous replication was added, quorum was left out to keep things
> > simple; the current feature set was the most we could all agree on to be
> > useful. If you search the archives for "quorum commit" you'll see what I
> > mean. There was a lot of confusion on what is possible and what is useful,
> > but regarding this particular patch: people wanted to be able to describe
> > more complicated scenarios. For example, imagine that you have a master and
> > two standbys in one the primary data center, and two more standbys in a
> > different data center. It should be possible to specify that you must get
> > acknowledgment from at least on standby in both data centers. Maybe you
> > could hack that by giving the standbys in the same data center the same
> > name, but it gets ugly, and it still won't scale to even more complex
> > scenarios.
Won't this problem be handled if synchronous mode is supported in
> On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > On 08/28/2014 10:10 AM, Michael Paquier wrote:
> >>
> >> + #synchronous_standby_num = -1 # number of standbys servers using sync
> >> rep
> >
> >
> > To be honest, that's a horrible name for the GUC. Back when synchronous
> > replication was implemented, we had looong discussions on this feature. It
> > was called "quorum commit" back then. I'd suggest using the "quorum" term in
> > this patch, too, that's a fairly well-known term in distributed computing
> > for this.
> I am open to any suggestions. Then what about the following parameter names?
> - synchronous_quorum_num
> - synchronous_standby_quorum
> - synchronous_standby_quorum_num
> - synchronous_quorum_commit
or simply synchronous_standbys
> > When synchronous replication was added, quorum was left out to keep things
> > simple; the current feature set was the most we could all agree on to be
> > useful. If you search the archives for "quorum commit" you'll see what I
> > mean. There was a lot of confusion on what is possible and what is useful,
> > but regarding this particular patch: people wanted to be able to describe
> > more complicated scenarios. For example, imagine that you have a master and
> > two standbys in one the primary data center, and two more standbys in a
> > different data center. It should be possible to specify that you must get
> > acknowledgment from at least on standby in both data centers. Maybe you
> > could hack that by giving the standbys in the same data center the same
> > name, but it gets ugly, and it still won't scale to even more complex
> > scenarios.
Won't this problem be handled if synchronous mode is supported in
cascading replication?
I am not sure about the feasibility of same, but I think the basic problem
mentioned above can be addressed and may be others as well.
On Thu, Aug 28, 2014 at 12:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
> <rajeev.rastogi@huawei.com> wrote:
> > I have done some more review, below are my comments:
> Thanks!
>
> > 1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the priority calculation inside the upper loop
> > if (*num_sync == allowed_sync_nodes)
> > {
> > for (j = 0; j < *num_sync; j++)
> > {
> > Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.
> > So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority.
> > Let me know if you see any issue with this.
>
> OK, I see, yes this can minimize process a bit so I refactored the
> code by integrating the second loop to the first. This has needed the
> removal of the break portion as we need to find the highest priority
> value among the nodes already determined as synchronous.
>
> > 2. Comment inside the function SyncRepReleaseWaiters,
> > /*
> > * We should have found ourselves at least, except if it is not expected
> > * to find any synchronous nodes.
> > */
> > Assert(num_sync > 0);
> >
> > I think "except if it is not expected to find any synchronous nodes" is not required.
> > As if it has come till this point means at least this node is synchronous.
> Yes, removed.
>
> > 3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same.
> > IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console without
> > any knowledge of user.
> > I see that some discussion has happened regarding this but I think just adding documentation for this is not enough.
> > I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issue during GUC initialization then can't we try to add check at later points.
>
> The trick here is that you cannot really return a warning at GUC
> loading level to the user as a warning could be easily triggered if
> for example s_s_num is present before max_wal_senders in
> postgresql.conf. I am open to any solutions if there are any (like an
> error when initializing WAL senders?!). Documentation seems enough for
> me to warn the user.
How about making it as a PGC_POSTMASTER parameter and then
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
>
> On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
> <rajeev.rastogi@huawei.com> wrote:
> > I have done some more review, below are my comments:
> Thanks!
>
> > 1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the priority calculation inside the upper loop
> > if (*num_sync == allowed_sync_nodes)
> > {
> > for (j = 0; j < *num_sync; j++)
> > {
> > Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.
> > So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority.
> > Let me know if you see any issue with this.
>
> OK, I see, yes this can minimize process a bit so I refactored the
> code by integrating the second loop to the first. This has needed the
> removal of the break portion as we need to find the highest priority
> value among the nodes already determined as synchronous.
>
> > 2. Comment inside the function SyncRepReleaseWaiters,
> > /*
> > * We should have found ourselves at least, except if it is not expected
> > * to find any synchronous nodes.
> > */
> > Assert(num_sync > 0);
> >
> > I think "except if it is not expected to find any synchronous nodes" is not required.
> > As if it has come till this point means at least this node is synchronous.
> Yes, removed.
>
> > 3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same.
> > IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console without
> > any knowledge of user.
> > I see that some discussion has happened regarding this but I think just adding documentation for this is not enough.
> > I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issue during GUC initialization then can't we try to add check at later points.
>
> The trick here is that you cannot really return a warning at GUC
> loading level to the user as a warning could be easily triggered if
> for example s_s_num is present before max_wal_senders in
> postgresql.conf. I am open to any solutions if there are any (like an
> error when initializing WAL senders?!). Documentation seems enough for
> me to warn the user.
How about making it as a PGC_POSTMASTER parameter and then
have a check similar to below in PostmasterMain()
/*
* Check for invalid combinations of GUC settings.
*/
if (ReservedBackends >= MaxConnections)
{
write_stderr("%s: superuser_reserved_connections must be less than max_connections\n", progname);
ExitPostmaster(1);
}
if (max_wal_senders >= MaxConnections)
{
write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
ExitPostmaster(1);
}
if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
* Check for invalid combinations of GUC settings.
*/
if (ReservedBackends >= MaxConnections)
{
write_stderr("%s: superuser_reserved_connections must be less than max_connections\n", progname);
ExitPostmaster(1);
}
if (max_wal_senders >= MaxConnections)
{
write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
ExitPostmaster(1);
}
if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Currently two nodes can only have the same priority if they have the > same application_name, so we could for example add a new connstring > parameter called, let's say application_group, to define groups of > nodes that will have the same priority (if a node does not define > application_group, it defaults to application_name, if app_name is > NULL, well we don't care much it cannot be a sync candidate). That's a > first idea that we could use to control groups of nodes. And we could > switch syncrep.c to use application_group in s_s_names instead of > application_name. That would be backward-compatible, and could open > the door for more improvements for quorum commits as we could control > groups node nodes. Well this is a super-set of what application_name > can already do, but there is no problem to identify single nodes of > the same data center and how much they could be late in replication, > so I think that this would be really user-friendly. An idea similar to > that would be a base work for the next thing... See below. In general, I think the user's requirement for what synchronous standbys could need to acknowledge a commit could be an arbitrary Boolean expression - well, probably no NOT, but any amount of AND and OR that you want to use. Can someone want A OR (((B AND C) OR (D AND E)) AND F)? Maybe! Based on previous discussions, it seems not unlikely that as soon as we decide we don't want to support that, someone will tell us they can't live without it. In general, though, I'd expect the two common patterns to be more or less what you've set forth above: any K servers from set X plus any L servers from set Y plus any M servers from set Z, etc. However, I'm not confident it's right to control this by adding more configuration on the client side. I think it would be better to stick with the idea that each client specifies an application_name, and then the master specifies the policy in some way. One advantage of that is that you can change the rules in ONE place - the master - rather than potentially having to update every client. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 12, 2014 at 12:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Currently two nodes can only have the same priority if they have the >> same application_name, so we could for example add a new connstring >> parameter called, let's say application_group, to define groups of >> nodes that will have the same priority (if a node does not define >> application_group, it defaults to application_name, if app_name is >> NULL, well we don't care much it cannot be a sync candidate). That's a >> first idea that we could use to control groups of nodes. And we could >> switch syncrep.c to use application_group in s_s_names instead of >> application_name. That would be backward-compatible, and could open >> the door for more improvements for quorum commits as we could control >> groups node nodes. Well this is a super-set of what application_name >> can already do, but there is no problem to identify single nodes of >> the same data center and how much they could be late in replication, >> so I think that this would be really user-friendly. An idea similar to >> that would be a base work for the next thing... See below. > > In general, I think the user's requirement for what synchronous > standbys could need to acknowledge a commit could be an arbitrary > Boolean expression - well, probably no NOT, but any amount of AND and > OR that you want to use. Can someone want A OR (((B AND C) OR (D AND > E)) AND F)? Maybe! Based on previous discussions, it seems not > unlikely that as soon as we decide we don't want to support that, > someone will tell us they can't live without it. In general, though, > I'd expect the two common patterns to be more or less what you've set > forth above: any K servers from set X plus any L servers from set Y > plus any M servers from set Z, etc. However, I'm not confident it's > right to control this by adding more configuration on the client side. > I think it would be better to stick with the idea that each client > specifies an application_name, and then the master specifies the > policy in some way. One advantage of that is that you can change the > rules in ONE place - the master - rather than potentially having to > update every client. OK. I see your point. Now, what about the following assumptions (somewhat restrictions to facilitate the user experience for setting syncrep and the parametrization of this feature): - Nodes are defined within the same set (or group) if they have the same priority, aka the same application_name. - One node cannot be a part of two sets. That's obvious... The current patch has its own merit, but it fails in the case you and Heikki are describing: wait for k nodes in set 1 (nodes with lowest priority value), l nodes in set 2 (nodes with priority 2nd lowest priority value), etc. What is does is, if for example we have a set of nodes with priorities {0,1,1,2,2,3,3}, backends will wait for flush_position from the first s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2 {0,1,1,2}, etc. Now what about that: instead of waiting for the nodes in "absolute" order like the way current patch does, let's do it in a "relative" way. By that I mean that a backend waits for flush_position confirmation only from *1* node among a set of nodes having the same priority. So by using s_s_num = 3, we'll wait for {0, "one node with 1", "one node with 2"}, and you can guess the rest. The point is as well that we can keep s_s_num behavior as it is now: - if set at -1, we rely on the default way of doing with s_s_names (empty means all nodes async, at least one entry meaning that we need to wait for a node) - if set at 0, all nodes are forced to be async'd - if set at n > 1, we have to wait for one node in each set of the N-lowest priority values. I'd see enough users happy with those improvements, and that would help improving the coverage of test cases that Heikki and you envisioned. By the way, as the CF is running low in time, I am going to mark this patch as "Returned with Feedback" as I have received enough feedback. I am still planning to work on that for the next CF, so it would be great if there is an agreement on what can be done for this feature to avoid blind progress. Particularly I see some merit in the last idea, that we could still extend by allowing values of the type "k,l,m" in s_s_num to let user decide: wait for 3 sets, k nodes in set 1, l nodes in set 2 and m nodes in set 3. Having a GUC parameter with integer values is not that user-friendly though, so I think that I'd hold on having only one node for a single set. Thoughts? -- Michael
On Fri, Sep 12, 2014 at 1:13 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > OK. I see your point. > > Now, what about the following assumptions (somewhat restrictions to > facilitate the user experience for setting syncrep and the > parametrization of this feature): > - Nodes are defined within the same set (or group) if they have the > same priority, aka the same application_name. > - One node cannot be a part of two sets. That's obvious... I feel pretty strongly that we should encourage people to use a different application_name for every server. The fact that a server is interchangeable for one purpose does not mean that it's interchangeable for all purposes; let's try to keep application_name as the identifier for a server, and design the other facilities we need around that. > The current patch has its own merit, but it fails in the case you and > Heikki are describing: wait for k nodes in set 1 (nodes with lowest > priority value), l nodes in set 2 (nodes with priority 2nd lowest > priority value), etc. > What is does is, if for example we have a set of nodes with priorities > {0,1,1,2,2,3,3}, backends will wait for flush_position from the first > s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2 > {0,1,1,2}, etc. > > Now what about that: instead of waiting for the nodes in "absolute" > order like the way current patch does, let's do it in a "relative" > way. By that I mean that a backend waits for flush_position > confirmation only from *1* node among a set of nodes having the same > priority. So by using s_s_num = 3, we'll wait for {0, "one node with > 1", "one node with 2"}, and you can guess the rest. > > The point is as well that we can keep s_s_num behavior as it is now: > - if set at -1, we rely on the default way of doing with s_s_names > (empty means all nodes async, at least one entry meaning that we need > to wait for a node) > - if set at 0, all nodes are forced to be async'd > - if set at n > 1, we have to wait for one node in each set of the > N-lowest priority values. > I'd see enough users happy with those improvements, and that would > help improving the coverage of test cases that Heikki and you > envisioned. Sounds confusing. I hate to be the guy always suggesting a mini-language (cf. recent discussion of an expression syntax for pgbench), but we could do much more powerful and flexible things here if we had one. For example, suppose we let each element of synchronous_standby_names use the constructs (X,Y,Z,...) [meaning one of the parenthesized severs], N(X,Y,Z,...) [meaning N of the parenthesized servers]. Then if you want to consider a commit acknowledge when you have any two of foo, bar, and baz you can write: synchronous_standby_names = 2(foo,bar,baz) And if you want to acknowledge when you've got either foo or both bar and baz, you can write: synchronous_standby_names = (foo,2(bar,baz)) And if you want one of foo and bar and one of baz and bletch, you can write: synchronous_standby_names = 2((foo,bar),(baz,bletch)) The crazy-complicated policy I mentioned upthread would be: synchronous_standby_names = (a,2((2(b,c),2(d,e)),f)) or (equivalently and simpler) synchronous_standby_names = (a,3(b,c,f),3(d,e,f)) We could have a rule that we fall back to the next rule in synchronous_standby_names when the first rule can never be satisfied by the connected standbys. For example, if you have foo, bar, and baz, and you want any two of the three, but wish to prefer waiting for foo over the others when it's connected, then you could write: synchronous_standby_names = 2(foo,2(bar,baz)), 2(bar, baz) If foo disconnects, the first rule can never be met, so we use the second rule. It's still 2 out of 3, just as if we'd written 2(foo,bar,baz) but we won't accept an ack from bar and baz as sufficient unless foo is dead. The exact syntax here is of course debatable; maybe somebody come up with something better. But it doesn't seem like it would be incredibly painful to implement, and it would give us a lot of flexibility. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/12/14, 2:28 PM, Robert Haas wrote: > I hate to be the guy always suggesting a mini-language (cf. recent > discussion of an expression syntax for pgbench), but we could do much > more powerful and flexible things here if we had one. For example, > suppose we let each element of synchronous_standby_names use the > constructs (X,Y,Z,...) While I have my old list history hat on this afternoon, when the 9.1 deadline was approaching I said that some people were not going to be happy until "is it safe to commit?" calls an arbitrary function that is passed the names of all the active servers, and then they could plug whatever consensus rule they wanted into there. And then I said that if we actually wanted to ship something, it should be some stupid simple thing like just putting a list of servers in synchronous_standby_names and proceeding if one is active. One of those two ideas worked out... Can you make a case for why it needs to be a mini-language instead of a function? -- Greg Smith greg.smith@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/
On Fri, Sep 12, 2014 at 2:44 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote: > On 9/12/14, 2:28 PM, Robert Haas wrote: >> I hate to be the guy always suggesting a mini-language (cf. recent >> discussion of an expression syntax for pgbench), but we could do much more >> powerful and flexible things here if we had one. For example, suppose we let >> each element of synchronous_standby_names use the constructs (X,Y,Z,...) > > While I have my old list history hat on this afternoon, when the 9.1 > deadline was approaching I said that some people were not going to be happy > until "is it safe to commit?" calls an arbitrary function that is passed the > names of all the active servers, and then they could plug whatever consensus > rule they wanted into there. And then I said that if we actually wanted to > ship something, it should be some stupid simple thing like just putting a > list of servers in synchronous_standby_names and proceeding if one is > active. One of those two ideas worked out... > > Can you make a case for why it needs to be a mini-language instead of a > function? I think so. If we make it a function, then it's either the kind of function that you access via pg_proc, or it's the kind that's written in C and installed by storing a function pointer in a hook variable from _PG_init(). The first approach is a non-starter because it would require walsender to be connected to the database where that function lives, which is a non-starter at least for logical replication where we need walsender to be connected to the database being replicated. Even if we found some way around that problem, and I'm not sure there is one, I suspect the overhead would be pretty high. The second approach - a hook that can be accessed directly by loadable modules - seems like it would work fine; the only problem that is that you've got to write your policy function in C. But I have no issue with exposing it that way if someone wants to write a patch. There is no joy in getting between the advanced user and his nutty-complicated sync rep configuration. However, I suspect that most users will prefer a more user-friendly interface. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Sep 13, 2014 at 3:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think so. If we make it a function, then it's either the kind of > function that you access via pg_proc, or it's the kind that's written > in C and installed by storing a function pointer in a hook variable > from _PG_init(). The first approach is a non-starter because it would > require walsender to be connected to the database where that function > lives, which is a non-starter at least for logical replication where > we need walsender to be connected to the database being replicated. > Even if we found some way around that problem, and I'm not sure there > is one, I suspect the overhead would be pretty high. The second > approach - a hook that can be accessed directly by loadable modules - > seems like it would work fine; the only problem that is that you've > got to write your policy function in C. But I have no issue with > exposing it that way if someone wants to write a patch. There is no > joy in getting between the advanced user and his nutty-complicated > sync rep configuration. However, I suspect that most users will > prefer a more user-friendly interface. Reading both your answers, I'd tend to think that having a set of hooks to satisfy all the potential user requests would be enough. We could let the server code decide what is the priority of the standbys using the information in synchronous_standby_names, then have the hooks interact with SyncRepReleaseWaiters and pg_stat_get_wal_senders. We would need two hooks: - one able to get an array of WAL sender position defining all the nodes considered as sync nodes. This is enough for pg_stat_get_wal_senders. SyncRepReleaseWaiters would need it as well... - a second able to define the update policy of the write and flush positions in walsndctl (SYNC_REP_WAIT_FLUSH and SYNC_REP_WAIT_WRITE), as well as the next failover policy. This would be needed for SyncRepReleaseWaiters when a WAL sender calls SyncRepReleaseWaiters. Perhaps that's overthinking, but I am getting the impression that whatever the decision taken, it would involve modifications of the sync-standby parametrization at GUC level, and whatever the path chosen (dedicated language, set of integer params), there will be complains about what things can or cannot be done. At least a set of hooks has the merit to say: do what you like with your synchronous node policy. -- Michael
On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > At least a set of hooks has the merit to say: do what you like with > your synchronous node policy. Sure. I dunno if people will find that terribly user-friendly, so we might not want that to be the ONLY thing we offer. But even if it is, it is certainly better than a poke in the eye with a sharp stick. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> At least a set of hooks has the merit to say: do what you like with >> your synchronous node policy. > > Sure. I dunno if people will find that terribly user-friendly, so we > might not want that to be the ONLY thing we offer. Well, user-friendly interface is actually the reason why a simple GUC integer was used in the first series of patches present on this thread to set as sync the N-nodes with the lowest priority. I could not come up with something more simple. Hence what about doing the following: - A patch refactoring code for pg_stat_get_wal_senders and SyncRepReleaseWaiters as there is in either case duplicated code in this area to select the synchronous node as the one connected with lowest priority - A patch defining the hooks necessary, I suspect that two of them are necessary as mentioned upthread. - A patch for a contrib module implementing an example of simple policy. It can be a fancy thing with a custom language or even a more simple thing. Thoughts? Patch 1 refactoring the code is a win in all cases. Regards, -- Michael
On Tue, Sep 16, 2014 at 2:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> At least a set of hooks has the merit to say: do what you like with >>> your synchronous node policy. >> >> Sure. I dunno if people will find that terribly user-friendly, so we >> might not want that to be the ONLY thing we offer. > Well, user-friendly interface is actually the reason why a simple GUC > integer was used in the first series of patches present on this thread > to set as sync the N-nodes with the lowest priority. I could not come > up with something more simple. Hence what about doing the following: > - A patch refactoring code for pg_stat_get_wal_senders and > SyncRepReleaseWaiters as there is in either case duplicated code in > this area to select the synchronous node as the one connected with > lowest priority A strong +1 for this idea. I have never liked that, and cleaning it up seems eminently sensible. > - A patch defining the hooks necessary, I suspect that two of them are > necessary as mentioned upthread. > - A patch for a contrib module implementing an example of simple > policy. It can be a fancy thing with a custom language or even a more > simple thing. I'm less convinced about this part. There's a big usability gap between a GUC and a hook, and I think Heikki's comments upthread were meant to suggest that even in GUC-land we can probably satisfy more use cases that what this patch does now. I think that's right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 19, 2014 at 12:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 16, 2014 at 2:19 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> At least a set of hooks has the merit to say: do what you like with >>>> your synchronous node policy. >>> >>> Sure. I dunno if people will find that terribly user-friendly, so we >>> might not want that to be the ONLY thing we offer. >> Well, user-friendly interface is actually the reason why a simple GUC >> integer was used in the first series of patches present on this thread >> to set as sync the N-nodes with the lowest priority. I could not come >> up with something more simple. Hence what about doing the following: >> - A patch refactoring code for pg_stat_get_wal_senders and >> SyncRepReleaseWaiters as there is in either case duplicated code in >> this area to select the synchronous node as the one connected with >> lowest priority > > A strong +1 for this idea. I have never liked that, and cleaning it > up seems eminently sensible. Interestingly, the syncrep code has in some of its code paths the idea that a synchronous node is unique, while other code paths assume that there can be multiple synchronous nodes. If that's fine I think that it would be better to just make the code multiple-sync node aware, by having a single function call in walsender.c and syncrep.c that returns an integer array of WAL sender positions (WalSndCtl). as that seems more extensible long-term. Well for now the array would have a single element, being the WAL sender with lowest priority > 0. Feel free to protest about that approach though :) >> - A patch defining the hooks necessary, I suspect that two of them are >> necessary as mentioned upthread. >> - A patch for a contrib module implementing an example of simple >> policy. It can be a fancy thing with a custom language or even a more >> simple thing. > > I'm less convinced about this part. There's a big usability gap > between a GUC and a hook, and I think Heikki's comments upthread were > meant to suggest that even in GUC-land we can probably satisfy more > use cases that what this patch does now. I think that's right. Hehe. OK then let's see how something with a GUC would go then. There is no parameter now using a custom language as format base, but I guess that it is fine to have a text parameter with a validation callback within. No? -- Michael